Data Deduplication Definition
Data Deduplication is the process of identifying records that refer to the same real-world entity (a customer, product, supplier, or location) and consolidating them into a single, accurate record.
How do duplicates appear in the first place?
Duplicates accumulate through normal business operations: a customer places an order through two different channels and gets two accounts, a supplier is entered manually by two teams with slightly different spellings, or a product is imported from multiple sources with different internal identifiers. Systems that lack validation rules or unique-key enforcement are especially prone to duplication over time.
How does deduplication work?
The process typically involves three steps. First, matching: comparing records using exact or fuzzy logic (for example, recognising that "Müller GmbH" and "Muller GmbH" are likely the same entity). Second, scoring: ranking candidate matches by confidence. Third, merging: combining the matched records into one, applying survivorship rules to decide which value to keep for each field when records conflict. The result feeds into a Golden Record.
What is the difference between deduplication and data cleansing?
Deduplication specifically targets duplicate records. Data quality and cleansing address a broader set of problems (incorrect values, missing fields, inconsistent formatting) within individual records, regardless of whether duplicates exist. In practice, both are done together as part of a Master Data Management programme.