What is Data Deduplication?

Data Deduplication Definition

Data Deduplication is the process of identifying records that refer to the same real-world entity (a customer, product, supplier, or location) and consolidating them into a single, accurate record.

How do duplicates appear in the first place?

Duplicates accumulate through normal business operations: a customer places an order through two different channels and gets two accounts, a supplier is entered manually by two teams with slightly different spellings, or a product is imported from multiple sources with different internal identifiers. Systems that lack validation rules or unique-key enforcement are especially prone to duplication over time.

How does deduplication work?

The process typically involves three steps. First, matching: comparing records using exact or fuzzy logic (for example, recognising that "Müller GmbH" and "Muller GmbH" are likely the same entity). Second, scoring: ranking candidate matches by confidence. Third, merging: combining the matched records into one, applying survivorship rules to decide which value to keep for each field when records conflict. The result feeds into a Golden Record.

What is the difference between deduplication and data cleansing?

Deduplication specifically targets duplicate records. Data quality and cleansing address a broader set of problems (incorrect values, missing fields, inconsistent formatting) within individual records, regardless of whether duplicates exist. In practice, both are done together as part of a Master Data Management programme.

What is Data Deduplication?

Data Deduplication Definition

How do duplicates appear in the first place?

How does deduplication work?

What is the difference between deduplication and data cleansing?

Related Articles

Data Steward: Role, Responsibilities, and Challenges