Indexing is what GEODI crawls all data and runs the the result of discovery. GEODI creates brief information about the data and uses this brief to answer searches, reports, and all othersother requests. GEODI will not need the original data unless you want to open it in a viewer or start indexing.
Table of Contents | ||
---|---|---|
|
Info |
---|
Continuous DiscoveryOnce you index data sources, the process will repeat automatically for new data(rows, files, emails, etc.). You do not need to intervene; just tell GEODI the recheck period. This is done in Project Wizard. |
Info |
---|
Index Storage
|
Info |
---|
Indexing TimeThe time very depends on the CPU, Memory, Disk and other resources of the ServerIndexing Speed
|
|
Info |
---|
OptionsThe beginning should be “Index all Content”Content.” After that, you may use all other options. If you set scheduled indexing and have periodic backups, you will not need to use options other than maintenance needs. |
...
Info |
---|
Monitoring IndexingGEODI informs you about the progress Once you start indexing, the process can be monitored.
|
Sampling
Sampling is possible for both, structured and unstructured data. Each data source asks you the sampling values. Sampling saves great time for discovery projects. We suggest you always use sampling for DB discovery. For unstructured data sampling is also a good starting point. Start with sampled mode and see what is in data, are there any unnecessary types or are there any permission problem.
|
Info |
---|
Index Storage
|
File Content Filtering
Any corpus contains various file types. Some may not be necessary for the project scope, and some may be too large to disrupt for the network or unwanted at allthe system.
GEODI comes ready with default rules to avoid some file types and place some size limitations, collected from the best practices of many Discovery projects. Here, we documented the rules and how you can modify them.
These filters apply to all files, whether they come from a folder, from an email attachment, from GDE, or embedded into a database.
Filters (%100) | Explanation |
---|---|
IgnoreRules | Ignore rules contain file extensions, directory names, and some patterns. Default The default list contains *.DLL, *.SYS, “programm “program files” and similar. Any file matches the ignore rules are not indexed and logged at all. Settings are in:
|
KnownFiles | These files are the ones GEODI has a reader like PDF, DOCX etc. The full fu l list is Supported Formats These files are processed as expected unless there is an IgnoreRule or ProtectRule. Ignorule will set the file type invisible. Protectrule Protec rule may set so size limitation. |
UnknownFiles | By default unknownfile , unknown file types are ignored. You may override this settings from Project Wizard use advanced settings. If you use “only name and date” then all unknown extensions will be indexed. You may add any unwanted to ignore list but these actions requires to run discovery all over again. |
ProtectRules | These rules is to protects protect the system and network again too against large files. Protect Protec rules apply to known and unknown files. The content are is grouped as local and far. There is 's no limitation for local content, which resides in local folders and network folders. Far means me ns files from GDE, e-mail email attachments, and files from web pages. By default, Far content is filtered as any file greater than 100Mb, and Compressed files greater than 500Mb are indexed as name only. You will wi l know these files but not their content. Settings are in: <geodi>\Settings\Engine\ResourceBalancing
|
Query of Indexed Content
GEODI provides a rich query language to query from coıntent, duplicates, date ranges, and more. Here are unique queries to help you query erroneous content and more.
status:OnlyName → gives the content with only the name and date. These files come from UnknownFiles and ProtectRules.
status:HasScanError → Files unread by file error, encryption, or other reasons are marked by an ! at the name.
status:IsContainer → Files within a folder or ZIP/RAR.
status:IsCompletedIndex → Content succesful indexed.
status:Crashed → In case of an index recovery after a system crash, GEODI will recover the index. This query shows the unsuccessful content. To avoid these, you should use an index backup.
status:PartialRead → Content partially read by protectRules.
GEODICryptedContent → encrypted content.
GEODICryptedContentPart → Partly encrypted content
Troubleshooting
Expand | ||
---|---|---|
| ||
The GEODI discovery engine is one of the fastest among other discovery engines. Slow indexing may depend on the machine, settings, or enviromentenvironment.
|
Expand | ||
---|---|---|
| ||
High CPU usage for an engine like GEODI should be expected. The CPU usage of GEODI never goes to an unresponsive machine state. GEODI always leaves one core to other tasks.
|
Expand | ||
---|---|---|
| ||
GEODI compressed to index as much as possible. Index size upto %20 up % to 20 percent of corpus size should be expected. If it looks too high to you, then you may try the following.
|
Expand | ||
---|---|---|
| ||
GEODI generates error logs during indexing. These logs mostly are about concern content and shuld should be considered as warning warnings or infoinformation. There may be real erros about system , errors; you will be informed about them. Most of them are
|
FAQ
Expand | ||
---|---|---|
| ||
The GEODI content count includes all folders and files in compressed files like ZIP and RAR, so it's typical for the count not to match. Some file types may be in ignore lists. Other than that be sure that GEODI covers all. |