Data Lineage: What It Is, How It Works, and How to Implement It

Key Takeaways

A data pipeline moves data from one or more sources to a destination, applying transformations along the way.
The main components are ingestion, processing, storage, and delivery.
Batch, streaming, and hybrid are the three core pipeline types, each with different trade-offs.
Most pipeline failures trace back to poor data quality, rigid mapping, or missing error handling.
MDM and pipeline design need to be planned together: pipelines carry data, but master data management ensures it means the same thing in every system.
AtroCore provides a configurable, open-source foundation for building automated data pipelines between ERP, e-commerce, PIM, and other business systems.

What a Data Pipeline Actually Is

A data pipeline is a set of automated steps that moves data from a source to a destination. Between those two points, the data gets extracted, transformed, validated, and loaded. The pipeline handles the mechanics so the receiving system gets clean, structured, usable data without manual intervention.

In practice, most companies run multiple pipelines in parallel. One pulls orders from an e-commerce platform into an ERP. Another syncs product data from a PIM to a web store. A third pushes inventory updates to a fulfillment partner. Each of these is a pipeline, and each one has to run reliably, on schedule, and in the right format for the destination.

The phrase "data pipeline" is sometimes used interchangeably with ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform). These are specific implementation patterns within the broader concept. ETL transforms data before loading it into the destination, typically a data warehouse or operational database. ELT loads raw data first into a data lake or cloud warehouse, then runs transformations inside the destination using its own compute. Both patterns describe pipelines, but not all pipelines follow either pattern strictly. A data flow that moves records from an ERP to a web store via scheduled file export is also a data pipeline, even if it never touches a warehouse or runs SQL.

Core Components of a Data Pipeline

Every pipeline, regardless of type or complexity, has the same basic structure.

Ingestion

The entry point. Data arrives from one or more sources: databases, APIs, files, message queues, or user inputs. Source connectors handle the specifics of each system: authentication, connection management, and initial data capture. For systems that expose a REST API, the ingestion layer sends HTTP requests and handles pagination and rate limits. For file-based sources, it monitors directories or FTP endpoints for new data. Its reliability directly determines everything downstream.

Processing

This is where transformation happens. In an ETL pipeline, it is the heaviest step: raw data from the source rarely matches the schema the destination expects. Field names differ. Date formats are inconsistent. Some values need to be calculated from others. The processing layer applies mapping rules, data type conversions, deduplication logic, and validation checks. This is also where errors surface, so the processing layer needs clear rules for what to do when a record fails validation: reject it, flag it, quarantine it, or pass it with a warning.

Storage

Storage sits between ingestion and delivery for pipelines that need it. Not every pipeline writes to intermediate storage, but batch pipelines typically do. Data lands in a staging area, gets processed, then moves to the destination. The staging layer also enables reprocessing: if a transformation rule changes, you can rerun the pipeline against stored raw data without re-ingesting from the source.

Delivery

The output layer. Data arrives at the destination in the format it expects: a database insert, an API call, a file export, or a message sent to a queue. The delivery layer handles confirmation and retry logic. If the destination returns an error, the pipeline decides whether to retry immediately, retry with backoff, or log the failure and alert an operator.

Monitoring, Orchestration, and Lineage

A pipeline that runs silently and fails silently is worse than one that doesn't run at all. Every production pipeline needs event logs, error counts, latency metrics, and alerts when thresholds are breached. This broader capability is called pipeline observability: knowing not just whether the pipeline ran, but whether the data it produced is correct and complete.

Pipeline orchestration sits above all of this. It manages task sequencing, scheduling, dependency resolution, and retry behavior across the full data flow. Simple pipelines can rely on cron-based scheduling. More complex ones with branching logic or cross-system dependencies need a dedicated orchestration layer that tracks each run's state and handles failures without manual intervention.

Data lineage is the record of where each piece of data came from, what transformations it went through, and where it ended up. It is a governance requirement, but also an operational tool. When a downstream report shows wrong numbers, lineage is how you trace the problem back to the source. When a source schema changes, lineage tells you which pipelines and destinations are affected before you push the change.

Pipeline Types and When Each Makes Sense

Batch Pipelines

Batch pipelines collect data over a period of time and process it in bulk at scheduled intervals: hourly, nightly, weekly. They are simpler to build and easier to debug than real-time alternatives. Most business data integration scenarios fit batch processing well. Price updates, product data synchronization, order exports, and inventory reconciliation all tolerate a delay of minutes or hours.

The downside is that freshness is bound by the batch interval. If a product price changes and the next batch runs in six hours, the web store shows the old price for six hours. For many use cases, that is acceptable. For others, it isn't.

Streaming Pipelines

Streaming pipelines process data continuously as it arrives, event by event. Latency drops to seconds or milliseconds. Use cases that actually require this include fraud detection, real-time inventory tracking across multiple warehouses, and live pricing engines.

Streaming pipelines are significantly harder to build and operate than batch pipelines. They require infrastructure that handles out-of-order events, state management across a stream, and fault tolerance under high throughput. Unless the business case actually demands sub-minute data freshness, the added complexity is hard to justify.

Hybrid Pipelines

Hybrid architectures run streaming ingestion but batch processing. Data arrives continuously and is stored in a buffer or queue. Processing runs on that buffer at intervals, or in micro-batches every few seconds. Micro-batch processing is a practical middle ground: you get significantly fresher data than a nightly batch without the full operational complexity of true streaming. Most platforms that advertise "near real-time" are actually running micro-batches.

Lambda architecture is a well-known hybrid pattern that maintains separate batch and streaming layers with a serving layer that merges outputs. It is powerful but complex to maintain, because the same transformation logic has to be implemented twice. Kappa architecture simplifies this by treating everything as a stream, including historical reprocessing.

A related pattern worth knowing is change data capture (CDC). Rather than extracting a full dataset on every run, CDC monitors the source system's transaction log and captures only the rows that changed since the last run. This reduces load on source systems dramatically and enables continuous, low-latency data integration without requiring a full streaming infrastructure. For manufacturers running ERP systems with high transaction volumes, CDC is often the most practical path to near-real-time data without rebuilding the entire integration layer.

For most mid-sized manufacturing or distribution companies, a well-built batch pipeline with short intervals covers 90% of integration needs.

Where Data Pipelines Break

Schema drift is the most common cause. A source system updates its API response and adds, renames, or removes fields. The pipeline's mapping logic, written against the old schema, either breaks or silently passes wrong data through. Pipelines need schema validation at ingestion so changes are caught before they corrupt the destination. Data lineage helps here too: knowing which pipelines depend on a given source field means you can assess the blast radius of a schema change before it reaches production.

Data quality issues accumulate downstream. Null values where the destination expects a required field. Text in a numeric column. Duplicate records because the source system allows them. The processing layer has to handle these explicitly, not pass them through and let the destination deal with them.

Tight coupling is the third problem. When pipeline logic is written against the specific field names, data types, or API structure of one system, any change to that system breaks the pipeline. Configurable mapping layers fix this. Transformation rules stored as configuration rather than code can be updated without touching the pipeline itself.

Missing error handling and retry logic turn transient failures into data loss. Networks fail. APIs time out. Destination systems go down for maintenance. A pipeline without retry logic drops records permanently when these things happen.

Related to this is idempotency. If a pipeline step runs twice on the same data due to a retry, the result should be the same as if it ran once. Pipelines that are not idempotent create duplicate records or incorrect aggregates whenever a retry fires.

Data Pipelines and Master Data Management

Pipeline architecture and master data management (MDM) are closely related, and the relationship is often underestimated at the start of integration projects.

MDM is the discipline of creating and maintaining a single, authoritative record for core business entities: customers, suppliers, products, materials, and locations. A master data record is the trusted reference that all systems agree on.

Pipelines carry data between systems, but without a managed master record at the center, each pipeline can introduce its own version of the same entity. One system calls a product "Steel Bracket M6." Another calls it "Bracket, M6, Steel." A third uses an internal code with no label at all. The pipeline moves the data; MDM ensures it means the same thing everywhere it lands.

In practice, this means MDM and pipeline design have to be planned together. The transformation logic inside a pipeline often depends on a master data layer: mapping source codes to canonical identifiers, resolving duplicates against a golden record, and enriching incoming records with attributes from a central repository. Without that layer, transformation rules become a patchwork of hardcoded lookups that grow harder to maintain with every new source system.

For manufacturers, the most common master data domains flowing through pipelines are product data, supplier records, and bill-of-materials structures. When product master data is managed centrally and pipelines pull from that single source, downstream systems (web stores, ERPs, procurement platforms) receive consistent, validated data on every run. When master data is fragmented across systems and pipelines pull from each independently, inconsistencies compound with every synchronization cycle.

The MDM layer belongs in the architecture from the start, with the same priority as the ingestion or transformation layer.

Building a Data Pipeline: Practical Steps

Start with a clear definition of source and destination. Define the source system, its data format, and whether it delivers on a schedule or a trigger. Define what the destination expects, what schema it requires, and how it handles missing or malformed records.

Map the transformation logic before writing any code or configuring any tool. Every field in the destination schema needs a source. Every mismatch in format, unit, or structure needs a transformation rule. Doing this on paper first surfaces problems early and makes the actual implementation faster.

Build error handling from the start, not as an afterthought. Define explicitly what happens to records that fail validation: reject with logging, quarantine for manual review, or pass with a warning flag. Build the alerting before the pipeline goes to production.

Test with real data, not synthetic data. Synthetic data misses the edge cases that real data carries: encoding issues, empty strings where nulls are expected, locale-specific date formats, values outside expected ranges. Run the pipeline against a sample of actual source data in a staging environment.

Monitor continuously after deployment. Track record counts in versus records out. Alert on error rate thresholds. Log every run with timestamps and row counts. A pipeline with full observability from day one costs almost nothing extra to maintain; one without it accumulates invisible debt until something breaks in production.

How AtroCore Supports Data Pipeline Workflows

In our experience the recurring problem is the tooling: custom scripts that break on every source system update, or expensive middleware that needs vendor involvement to reconfigure. In several cases, teams were running five or more separate scripts to sync product data between an ERP, a PIM, and two sales channels, with no error logging and no alerting.

AtroCore is a free, open-source data platform with a built-in integration layer. Its Import and Export modules handle ingestion and delivery across REST APIs, FTP, file sources, and databases. Mapping rules are configured through the UI rather than hardcoded, so they stay maintainable when upstream systems change. Runs are logged with record counts and error details, covering pipeline observability without a separate monitoring stack. The platform connects natively to ERP systems, including SAP, Oracle, NetSuite, and Business Central, as well as e-commerce platforms, including Shopify and Adobe Commerce, and acts as the central orchestration layer across all of them.

For companies that also need MDM, AtroCore's broader data platform manages master data alongside pipeline execution in a single instance. Full details on the integration platform are at atrocore.com/en/integration-platform.

Data Pipeline Architecture: Components, Types, and How to Build One