Source:Webpage

GEODI can also use web pages and RSS news sources as content. Web pages can have very variable structures, and the Web Page data source offers many options to support this variability.

We want to inform you about some critical issues related to web page indexing. Indexing web pages may require intensive internet usage. Some sites may think it is a DDOS attack and ban your IP. Also, due to legal terms, some page content may not be processed and used outside the site. Our sole responsibility is to provide software; DECE can not be held responsible for any consequences.

Conditions for connection

Access to the web page
Information required for verification required by the token or page for user verification places

You can provide a single address or multiple addresses. Domain restriction settings will work independently for each address.

With level=0 only the given page is indexed. The level must be large enough to access all pages. For cases with paging, the level value can be 1000+.

This setting smartly tries to select the page's actual content; we recommend checking it most of the time.

Many web pages use URL parameters. By default, GEODI creates content for each unique URL. But sometimes, a parameter may not change the page's content. In such situations, you may ignore such parameters for better index results.

For example:

https://sample.com

https://sample.com?backtomail=true

Some websites may have social media links, advertising pages, or similar pages that you don't want into the index.

You may list as many pages as you need to exclude. Each page must be separated with “;”.
Wildcards are allowed. *adds* ; *last.html ignores any files containing adds and ends with last.html.

GEODI has rules on a per-web-page basis. Some rules come pre-configured. For example, only the "info box" containing content is processed on Wikipedia pages. Pagination controls found on some web pages (such as links appearing as 1, 2, 3,..., 10 and determining the pages) are automatically processed.

Some pages are generated using JavaScript. In such cases, the HTML content may not provide the necessary information. The "Render like Browsers" option should be checked in these situations. Indexing will be slower but will yield the desired results. An alternative web browser module must be installed for this option to work.

Page names are formed using the order of og:title → title → page URL

You can query pages using doc: ,

In projects where web pages are crawled, scanned web pages can be grouped by setting EnableSiteGroup to true under Settings in the Shift+Ctrl menu.

There is no need for re-crawling.

In DLV, sites are grouped based on their domain.

A "for more results" link opens all pages.

Web Page Indexing sends frequent requests to websites. To prevent this from being perceived as an attack or causing slowdowns, you can use a Proxy List.

You can modify the list of proxies in the GEODI\Settings\ProxyList directory.

To set request limits on a web page, you can configure the following settings under Web Connection Source Advanced Settings:

TryProxyCount: When set to a value greater than 0, it attempts to connect using other proxy addresses up to the specified value in case of an error. This can be useful for servers that implement client attack control. However, it is recommended to use DomainLockAndSleepMillisecond instead of this setting.
DomainLockAndSleepMillisecond: When set to a value greater than 0, it ensures that only one request is made at a time and enforces a delay (in milliseconds) between consecutive requests.

Source:Webpage

Related content