Data versioning is the systematic practice of tracking and managing changes to datasets over time, analogous to source code version control. It creates immutable, timestamped snapshots of data, enabling reproducibility, rollback to previous states, and comprehensive lineage tracking. This is critical for machine learning, where model performance is intrinsically linked to specific training data states. Unlike simple backups, versioning treats data as a first-class artifact in the development lifecycle, linking it to specific code commits and model iterations.
