Search Duplicate or Similar Content

Search Duplicate or Similar Content

In most organizations, up to 40% of documents are duplicates or similar. This content:

  • Create confusion

  • Complicate search results

  • Waste storage space

Duplicate or Similar content can be scattered across multiple sources. A single file may have dozens of copies, possibly in the same source or across different folders, drives, emails, and systems.

GEODI automatically detects duplicate or similar content across data sources and helps you focus on clean, deduplicated content, leading to more efficient discovery, classification, and reporting.

A copy is the same content. Duplicity or similarity is calculated based on content, not format. So, a PDF and DOCX may be duplicates or similar. Images are included.

 

 

 

Find All Duplicates

These queries find all duplicate content; these are not suitable for remediation. Use duplicate2 or original2 instead.

  • duplicate → Lists all documents that have duplicates.

  • -duplicate → Lists only unique documents (those without any duplicates)

  • duplicate:(doc:a.pdf) (finds copies of a.pdf)

  • duplicate:"Georgia Aquarium" (finds a copy document containing the words)

 

Deduplication with GEODI - Identifying and Managing Duplicate Content

The following queries separate the original and duplicates. Let's assume we have 10 duplicates.

  • duplicate2: → Returns all duplicates within the target dataset → returns nine content

  • original2: → Returns only the originals, one per duplicate set → returns 1

This approach helps you clean your content, reduce storage, and improve the precision of search and reporting.Duplicate2 effectively removes duplicate content when used with a Destroy Workflow.

Sorgu

Açıklama

Sorgu

Açıklama

duplicate2:(order:New|Old)

Specifies how to choose the original among duplicate files when no source list is defined:

  • order:OldOlder versions are treated as the originals

  • order:NewNewer versions are kept, older ones are marked as duplicates

duplicate2:(i:Source1,Source2,...)

  • i: stands for Important and keeps duplicates in the listed sources, that is If duplicates exist elsewhere, they are marked for deletion or action.

  • Example: duplicate2:(i:HR_Drive,Legal_Archive) --> → Keep duplicates found in HR_Drive and Legal_Archive, remove others

duplicate2:(ni:Source1,Source2,...)

  • ni: means Not Important and If a duplicate exists outside these sources, the files in the listed sources can be safely deleted

  • Example: duplicate2:(ni:Temp, Shared_Folder) --> → If a duplicate exists somewhere else, delete it from Temp or Shared_Folder

original2:(..)

You can only use original2 to find originals. original2 is just the opposite of duplicate2 with the same parameters and interpretation.

 

This query gives you complete control over how GEODI handles duplicates, making cleanup safer, smarter, and aligned with your organizational priorities.

 

 

Find Similars

Like duplicates, GEODI finds similarities between text and image contents. Unlike duplicates, similarity has degree. All duplicates are also similar but you can exclude them.

  • maxcount:<n> - limit similar count by n.

  • minsimilarity:0.7 - set similarity index. The default is 0.7.

  • excludeDuplicates:true exclude copies default = false means copies are listed under similar.

similar:(doc:a.pdf) → similar content to a.pdf

similar:”Georgia Aquarium" (finds similar document containig the words)

 

Deduplication of Similar Content

When dealing with large sets of similar (but not identical) documents, you often want to retrieve only one representative rather than going through every variant. This improves search efficiency and helps you clean up redundant content through actions like deletion or quarantine.

Query

Description

Query

Description

benzer2

Returns one document per group with 70% or more similarity (default threshold)

benzer2:(distance:0.9)

Only returns one document from groups with 90%+ similarity

benzer2:(<query>)

Filters within the results of your custom query (e.g., a folder or date range)

benzer2:(getnonsimilar:true)

Also includes documents that do not have any similar matches in the result set

Related content