Indexing is what GEODI crawls all data and runs the discovery. GEODI creates brief information about the data and uses this brief to answer searches, reports, and all others. GEODI will not need the original data unless you want to open it in a viewer or start indexing.

Continuous Discovery

Once you index data sources, the process will repeat automatically for new data(rows, files, emails, etc.). You do not need to intervene; just tell GEODI the recheck period. This is done in Project Wizard.

Index Storage

The index needs some storage. The size will be much smaller than the data, but it is unpredictable. You may assume %10 to %20 generally.
Options like sampling mode or similarity indexing., affect the index size.
A backup space for the index should also be reserved for uninterrupted service.

Indexing Time

The time very depends on the CPU, Memory, Disk and other resources of the Server that GEODI runs on.
Data throughput of the Network and Disk of data sources are also very important.
Options like OCR or FacePro greatly affect the performance.
The GEODI indexing engine multitasks and processes multiple documents simultaneously. The Indexing Speed parameter mentioned in the Settings tab affects this speed. If the server has no task other than GEODI, we suggest that you set it to maximum. GEODI will use the CPU as much as possible but leave at least a core for user interaction.

geodi-en > Indexing > ScanButton.png

Options

The beginning should be “Index all Content”. After that, you may use all other options. If you set scheduled indexing and have periodic backups you will not need to use options other than maintenance needs.

Monitoring Indexing

GEODI informs you about the progress of indexing. Please be careful about progress bar, which is not lineer. That is GEODI can not know how much time will require to index for future documents, so the progress bar is only an estimate using the previous document indexing time.

Sampling

Sampling is possible for both, structured and unstructured data. Each data source asks you the sampling values. Sampling saves great time for discovery projects. We suggest you always use sampling for DB discovery. For unstructured data sampling is also a good starting point. Start with sampled mode and see what is in data, are there any unnecessary types or are there any permission problem.

Filecontent Filtering

Any corpus contains various file types. Some may not be necessary for the project scope, and some may be too large to disrupt the network or unwanted at all.

GEODI comes ready with default rules to avoid some file types and place some size limitations, collected from the best practices of many Discovery projects. Here, we documented the rules and how you can modify them.

Filters (%100)	Explanation
IgnoreRules	Ignore rules contain file extensions, directory names and some patterns. Default list contains .DLL, .SYS, “programm files” and similar. Any file matches ignore rules are not indexed and logged at all. Settings are in: <geodi>Settings\IgnoreFileTypes <geodi>\Settings\IgnoreFolders If you need to override defauls please do it in %appdata%
KnownFiles	These files are the ones GEODI has a reader like PDF, DOCX etc. The full list is Supported Formats These files are processed as expected unless there is an IgnoreRule or ProtectRule. Ignorule will set the file type invisible. Protectrule may set so size limitation.
UnknownFiles	By default unknownfile types are ignored. You may override this settings from Project Wizard advanced settings. If you use “only name and date” then all unknown extensions will be indexed. You may add any unwanted to ignore list but these actions requires to run discovery all over again.
ProtectRules	These rules is to protects system and network again too large files. Protect rules apply to known and unknown files. The content are grouped as local and far. There is no limitation for local content which resides in local folders and network folders. Far means files from GDE, e-mail attachments and files from web pages. By default, Far content is filtered as any file greater than 100Mb, and Compressed files greater than 500Mb are indexed as name only. You will know these files but not their content. Settings are in: <geodi>\Settings\Engine\ResourceBalancing If you need to override defauls please do it in %appdata%

Troubleshooting

GEODI discovery engine is one of the fastest among other discovery engines. Slow indexing may depend on machine, settings or enviroment.

Check indexing speed and be sure that speed is high
Check engine errors, If a source is throwing too many errors this may slow down the indexing
Another task may be using too much resources,
Too many recognizers may slow down a indexing
Slow disk may slow down indexing. Consider dividing index and putting some part to a fast disk, like SSD.
Use sampling mode if you need a quciker result

High CPU usage for an engine like GEODI should be expected. CPU usage of GEODI never goes to unresponsive machine state. GEODI always leaves one core to other tasks.

CPU usage may be temporary, just wait to see if it drops
If a consistent CPU usage try decreasing indexing speed
OCR or FacePro needs CPU, if you are using these options, decrease indexing speed or wait.

GEODI compressed to index as much as possible. Index size upto %20 percent of corpus size should be expected. If it looks too high to you, then you may try the following.

Similarity index may be open. This index need disk.
Some files may hav too much information (logs, csv etc), you may exlude them.

GEODI generates error logs during indexing. These logs mostly are about content and shuld be considered as warning or info. There may be real erros about system, you will be informed about them.

Most of them are

Unreadable content
Encyrpted content
Unreachable content (because of permissions)