Data Lineage Definition
Data lineage is a record of where data comes from, how it has been transformed, and where it ends up. It traces the full journey of a piece of data across systems and processes, so that anyone looking at a figure in a report can work backwards and understand exactly how it was produced.
What does it look like in practice?
Say a revenue figure appears in an executive dashboard. Data lineage tells you: that number came from the orders table in the ecommerce platform, was joined with refund records from the finance system, had returns subtracted in a transformation step, and was loaded into the data warehouse last night at 2am. If the number looks wrong, lineage tells you exactly where to look.
Most lineage is captured and displayed visually, as a graph showing the chain of sources, pipelines, transformations, and destinations connected by arrows.
Why does it matter?
Without lineage, tracking down a data problem means asking several teams where their numbers come from and hoping someone remembers. With it, the answer is documented and auditable. This matters in a few situations in particular:
- Debugging — when a report shows an unexpected result, lineage points to the step where something went wrong
- Compliance — regulations like GDPR require organisations to know where personal data lives and how it moves; lineage makes that answerable
- Change management — before modifying a database column or a pipeline step, lineage shows what else depends on it and what might break
Who uses it and where does it fit?
Data engineers use lineage to debug data pipelines and assess the impact of changes before they make them. Compliance and legal teams use it to respond to audits, particularly where personal data is involved. Analysts use it to verify that the numbers they are reporting on are calculated the way they expect.
It also plays a practical role in Master Data Management (MDM). When a golden record is produced by merging data from several source systems, lineage records which sources contributed and which rules were applied. If a merged record contains an error, lineage is what lets you trace it back to its origin rather than manually checking each upstream system.