Indexing
Indexing is the result of discovery. GEODI creates brief information about the data and uses this brief to answer searches, reports, and all others. GEODI will not need the original data unless you want to open it in a viewer.
Indexing Speed
GEODI has an indexing speed settings. We suggest that you set it to maximum. Please check the details on Settings page.
There are other factor that affects indexing speed:
The time depends on the CPU, memory, disk, and other server resources that GEODI runs on.
Data throughput of the Network and Disk of data sources are also very important.
Options like OCR or FacePro greatly affect the performance.
Options
The beginning should be “Index all Content”. After hat, you may use all other options. If you set scheduled indexing and have periodic backups you will not need to use options other than maintenance needs.
Monitoring Indexing
Once you start indexing, the process can be monitored.
The progress bar shows the approximate percentage of indexing. Please be careful about the progress bar, which is not linear. GEODI an not know how much time it will require to index future documents, so the progress bar is only an estimate using the previous document indexing time.
This area shows the numbers. The gr graphics show trends in terms of document/second.
This shows if any error or warning was generated. You ca click and download the report, which resides in the % appdata%logs folder.
Error reports and detailed reports about the project.
Sampling
Sampling is possible for both structured and unstructured data. Each d ta source asks you for the sampling values. Sampling saves great time for discovery projects. We sug est you always use sampling for DB discovery. For unstructured data, sampling is also a good starting point. Start ith sampled mode and see what is in the data, whether there are any unnecessary types, or whether there is any permission problem.
Filecontent Filtering
Any corpus contains various file types. Some may not be necessary for the project scope, and some may be too large for the network or the system.
GEODI comes ready with default rules to avoid some file types and place some size limitations, collected from the best practices of many Discovery projects. Here, e documented the rules and how you can modify them.
These filters apply to all files, whether they come from a folder, from an email attachment, from GDE, or embedded into a database.
Filters (%100) | Explanation |
---|---|
IgnoreRules | Ignore rules contain file extensions, directory names and some patterns. Default list contains *.DLL, *.SYS, “programm files” and similar. Any file matches ignore rules are not indexed and logged at all. Settings are in:
If you need to override defauls please do it in %appdata% GDE has some additional ignorerules documented in the related page. |
KnownFiles | These files are the ones GEODI has a reader like PDF, DOCX etc. The fu l list is Supported Formats These files are processed as expected unless there is an IgnoreRule or ProtectRule. Ignorule will set the file type invisible. Protec rule may set so size limitation. |
UnknownFiles | By default unknownfile types are ignored. You may override this settings from Project Wizard advanced settings. If you use “only name and date” then all unknown extensions will be indexed. You may add any unwanted to ignore list but these actions requires to run discovery all over again.
|
ProtectRules | These rules is to protects system and network again too large files. Protec rules apply to known and unknown files. The content are grouped as local and far. There s no limitation for local content which resides in local folders and network folders. Far me ns files from GDE, email attachments and files from web pages. By default, Far content is filtered as any file greater than 100Mb, and Compressed files greater than 500Mb are indexed as name only. You wi l know these files but not their content. Settings are in: <geodi>\Settings\Engine\ResourceBalancing If you need to override defauls please do it in %appdata% |
Query of Content/Files status
there are special queries to query how a content indexed.
status:OnlyName → gives the content with only nane and date. These iles comes from UnkownFiles and ProtectRules.
status:HasScanError → Files unread by file error, encyiption or so. These iles are marked by an ! at the name.
status:IsContainer → Files within a folder or ZIP/RAR.
status:IsCompletedIndex → Content succesful indexed.
status:Crashed → Incase of an index recovery after a system crash GEODI will recover the index. This q ery shows the unsuccesslful content. To avo d these you should use index backup.
Status:PartialRead → Content partialread by protectRules.
GEODICryptedContent → encyripted content.
GEODICryptedContentPart → parlt encyripted content
Troubleshooting
FAQ