Java Interview Question - Detecting Duplicate Product Titles

Java Interview Question - Detecting Duplicate Product Titles

Engineering With Java
Engineering With JavaApr 9, 2026

Key Takeaways

  • Pairwise comparison leads to O(n²) time, unsuitable for large catalogs
  • Canonical signature via sorted lower‑cased words enables O(n·k log k) grouping
  • HashMap keying by signature eliminates redundant sorting of identical titles
  • Extending normalization to strip punctuation and remove stopwords improves accuracy

Pulse Analysis

Detecting duplicate product titles is a classic text‑normalization problem that can cripple an e‑commerce marketplace if handled naively. The original code iterated over every title pair, sorting each string on every comparison. While functionally correct, that approach incurs quadratic time complexity and repeatedly reprocesses identical word sets, making it impractical for catalogs that can exceed a million entries. Modern backend services need algorithms that scale linearly with the number of listings while preserving accuracy.

A proven solution is to compute a canonical signature for each title. By converting the title to lower case, splitting on whitespace, sorting the resulting word array, and joining the words back into a single string, every permutation of the same words maps to an identical key. Storing titles in a HashMap keyed by this signature allows constant‑time insertion and grouping, reducing overall time to O(n · k log k), where n is the number of titles and k the average word count. Memory usage remains linear, dominated by the storage of the original titles and their signatures.

Real‑world product listings often contain punctuation, symbols, and common stopwords that should not affect duplication detection. Enhancing the normalization step to strip characters like hyphens, commas, and apostrophes, and to filter out stopwords such as "the" or "of," yields cleaner signatures and fewer false negatives. This refined pipeline not only improves grouping precision but also supports downstream features like search relevance scoring and automated moderation, delivering a more trustworthy marketplace experience.

Java Interview Question - Detecting Duplicate Product Titles

Comments

Want to join the conversation?