Indexing

Indexing

Indexing is the result of discovery. GEODI creates brief information about the data and uses this brief to answer searches, reports, and other requests. GEODI will not need the original data unless you want to open it in a viewer.

 

 

 

Indexing Speed

  1. GEODI has an indexing speed setting. We suggest that you set it to maximum.

  2. The first indexing may take time, but continuous discovery does not. GEODI indexes only new or changed content. The default is rescans per day, but you may set it to any period.

  3. There are other factor that affects indexing speed:

    1. The time depends on the CPU, memory, disk, and other server resources that GEODI runs on.

    2. Data throughput of the Network and Disk of data sources is also very important.

    3. Options like OCR or FacePro greatly affect the performance.

  4. Sampling is a great way to increase speed. While it gives partial results, it may be enough to provide an insight into the data.

  5. Use file content filtering to eliminate “unnecessary” content. Default rules eliminate many file types not related to GDPR or PCI/DSS.

 

Options

The beginning should be “Index all Content.” After that, you may use all other options. If you set scheduled indexing and have periodic backups, you will not need to use options other than maintenance needs.

 

image-20240728-074018.png

 

 

Monitoring Indexing

Once you start indexing, the process can be monitored.

  1. The progress bar shows the approximate percentage of indexing. Please be careful about the progress bar, which is not linear. GEODI does not know how much time it will require to index future documents, so the progress bar is only an estimate using the previous document indexing time.

  2. This area shows the numbers. The graphics show trends in terms of documents/second.

  3. This shows if any error or warning was generated. You can click and download the report, which resides in the % appdata%logs folder.

  4. Error reports and detailed reports about the project.

image-20240921-092936.png

 

Index Storage

  1. The index needs some storage. The size will be much smaller than the data, but it is unpredictable. You can generally assume 10% to 20%.

  2. Options like sampling mode or similarity indexing affect the index size.

  3. A backup space for the index should also be reserved for uninterrupted service.

 

File Content Filtering

Any corpus contains various file types. Some may not be necessary for the project scope, and some may be too large for the network or the system.

GEODI comes with default rules to avoid some file types and place some size limitations, collected from the best practices of many Discovery projects. Here, we documented the rules and how you can modify them.

These filters apply to all files, whether they come from a folder, from an email attachment, from GDE, or embedded into a database.

Filters (%100)

Explanation

Filters (%100)

Explanation

IgnoreRules

Ignore rules contain file extensions, directory names, and some patterns.

The default list contains *.DLL, *.SYS, “program files” and similar.

Any file matches the ignore rules are not indexed and logged at all.

Settings are in:

  • <geodi>Settings\IgnoreFileTypes

  • <geodi>\Settings\IgnoreFolders

If you need to override defaults please do it in %appdata%

GDE has some additional ignorerules documented in the related page.

KnownFiles

These files are the ones GEODI has a reader like PDF, DOCX etc. The fu l list is Supported Formats

These files are processed as expected unless there is an IgnoreRule or ProtectRule. Ignorule will set the file type invisible. Protec rule may set so size limitation.

UnknownFiles

By default, unknown file types are ignored.

You may use advanced settings.

If you use “only name and date” then all unknown extensions will be indexed.

You may add any unwanted to ignore list but these actions requires to run discovery all over again.

 

image-20240919-115040.png

 

ProtectRules

These rules protect the system and network against large files. Protec rules apply to known and unknown files.

The content is grouped as local and far. There's no limitation for local content, which resides in local folders and network folders. Far me ns files from GDE, email attachments, and files from web pages.

By default, Far content is filtered as any file greater than 100Mb, and Compressed files greater than 500Mb are indexed as name only. You wi l know these files but not their content.

Settings are in:

<geodi>\Settings\Engine\ResourceBalancing

If you need to override defaults, please do it in %appdata%

 

Query of Indexed Content

GEODI provides a rich query language to query from coıntent, duplicates, date ranges, and more. Here are unique queries to help you query erroneous content and more.

  1. status:OnlyName → gives the content with only the name and date. These files come from UnknownFiles and ProtectRules.

  2. status:HasScanError → Files unread by file error, encryption, or other reasons are marked by an ! at the name.  

  3. status:IsContainer → Files within a folder or ZIP/RAR.

  4. status:IsCompletedIndex → Content succesful indexed.

  5. status:Crashed → In case of an index recovery after a system crash, GEODI will recover the index. This query shows the unsuccessful content. To avoid these, you should use an index backup.

  6. status:PartialRead → Content partially read by protectRules.

  7. GEODICryptedContent → encrypted content. 

  8. GEODICryptedContentPart → Partly encrypted content

Troubleshooting

The GEODI discovery engine is one of the fastest. Slow indexing may depend on the machine, settings, or environment.

  1. Check the indexing speed and be sure that the speed is high

  2. Check engine errors. If a source is throwing too many errors, this may slow down the indexing

  3. Another task may be using too many resources,

  4. Too many recognizers may slow down the indexing

  5. A slow disk may slow down indexing. Consider dividing the index and putting some part on a fast disk, like an SSD.

  6. Use sampling mode if you need a quicker result.

High CPU usage for an engine like GEODI should be expected. The CPU usage of GEODI never goes to an unresponsive machine state. GEODI always leaves one core to other tasks.

  1. CPU usage may be temporary; wait to see if it drops

  2. If a consistent CPU usage is observed, try decreasing the indexing speed

  3. OCR or FacePro consumes more CPU resources. If you are using these options, decrease the indexing speed or wait.

GEODI compressed to index as much as possible. Index size up % to 20 percent of corpus size should be expected. If it looks too high, you may try the following.

  1. Similarity index may be open.

  2. Some files (logs, CSV, etc.) may contain too much information, so you may want to exclude them.

GEODI generates error logs during indexing. These logs mostly concern content and should be considered warnings or information. There may be real system errors; you will be informed about them.

Most of them are

  1. Unreadable content

  2. Encrypted content

  3. Unreachable content (because of permissions)

 

FAQ

The GEODI content count includes all folders and files in compressed files like ZIP and RAR, so it's typical for the count not to match.

Some file types may be in ignore lists.

Other than that be sure that GEODI covers all.