The Duplicate Detector package identifies assets (tables, views, etc) that are potential duplicates by comparing the set of columns within them in a case-insensitive and order-ambivalent way.

Comparison logic

For each asset:

  • Retrieve the set of columns in that asset
  • Normalize the names of each column (make them case-insensitive, remove _'s)
  • Ignore ordering of the columns within the asset
  • Calculate a unique numeric hash for the set of normalized, unordered columns

Then compare these hashes between assets, to look for identical hashes (indicating an identical set of normalized, unordered columns).

Capture logic

For any assets with identical hashes:

  • Idempotently create a new term in a Duplicate assets glossary, named Dup. (00000000) (where 00000000 is the unique hash)
  • Idempotently link the term to each asset that has that unique hash for its unordered set of columns