Search Duplicate or Similar Content
In most organizations, up to 40% of documents are duplicates or similar. This content:
Create confusion
Complicate search results
Waste storage space
Duplicate or Similar content can be scattered across multiple sources. A single file may have dozens of copies, possibly in the same source or across different folders, drives, emails, and systems.
GEODI automatically detects duplicate or similar content across data sources and helps you focus on clean, deduplicated content, leading to more efficient discovery, classification, and reporting.
A copy is the same content. Duplicity or similarity is calculated based on content, not format. So, a PDF and DOCX may be duplicates or similar. Images are included.
Find All Duplicates
These queries find all duplicate content; these are not suitable for remediation. Use duplicate2 or original2 instead.
duplicate
→ Lists all documents that have duplicates.-duplicate
→ Lists only unique documents (those without any duplicates)duplicate:(doc:a.pdf) (finds copies of a.pdf)
duplicate:"Georgia Aquarium" (finds a copy document containing the words)
Deduplication with GEODI - Identifying and Managing Duplicate Content
The following queries separate the original and duplicates. Let's assume we have 10 duplicates.
duplicate2:
→ Returns all duplicates within the target dataset → returns nine contentoriginal2:
→ Returns only the originals, one per duplicate set → returns 1
This approach helps you clean your content, reduce storage, and improve the precision of search and reporting.Duplicate2 effectively removes duplicate content when used with a Destroy Workflow.
Sorgu | Açıklama |
---|---|
duplicate2:(order:New|Old) | Specifies how to choose the original among duplicate files when no source list is defined:
|
duplicate2:(i:Source1,Source2,...) |
|
duplicate2:(ni:Source1,Source2,...) |
|
original2:(..) | You can only use original2 to find originals. original2 is just the opposite of duplicate2 with the same parameters and interpretation. |
This query gives you complete control over how GEODI handles duplicates, making cleanup safer, smarter, and aligned with your organizational priorities.
Find Similars
Like duplicates, GEODI finds similarities between text and image contents. Unlike duplicates, similarity has degree. All duplicates are also similar but you can exclude them.
maxcount:<n>
- limit similar count by n.minsimilarity:0.7
- set similarity index. The default is 0.7.excludeDuplicates:true
exclude copies default = false means copies are listed under similar.
similar:(doc:a.pdf) → similar content to a.pdf
similar:”Georgia Aquarium" (finds similar document containig the words)
Deduplication of Similar Content
When dealing with large sets of similar (but not identical) documents, you often want to retrieve only one representative rather than going through every variant. This improves search efficiency and helps you clean up redundant content through actions like deletion or quarantine.
Query | Description |
---|---|
| Returns one document per group with 70% or more similarity (default threshold) |
| Only returns one document from groups with 90%+ similarity |
| Filters within the results of your custom query (e.g., a folder or date range) |
| Also includes documents that do not have any similar matches in the result set |