Duplicate Detector¶
This content has moved!
We have moved our samples to a separate, dedicated site: https://solutions.atlan.com.
This document is no longer being maintained.
The Duplicate Detector package identifies assets (tables, views, etc) that are potential duplicates by comparing the set of columns within them in a case-insensitive and order-ambivalent way.
Comparison logic¶
For each asset:
- Retrieve the set of columns in that asset
- Normalize the names of each column (make them case-insensitive, remove
_
's) - Ignore ordering of the columns within the asset
- Calculate a unique numeric hash for the set of normalized, unordered columns
Then compare these hashes between assets, to look for identical hashes (indicating an identical set of normalized, unordered columns).
Capture logic¶
For any assets with identical hashes:
- Idempotently create a new term in a
Duplicate assets
glossary, namedDup. (00000000)
(where00000000
is the unique hash) - Idempotently link the term to each asset that has that unique hash for its unordered set of columns