GEODI can also use web pages and RSS news sources as content. Web pages can have very variable structures. To support this variability, the Web Page data source offers many options.
We want to inform you about some important issues related to web page indexing. Indexing web pages can create may require intensive internet usage. Some sites may interpret such use as "hacking" and may ban you. In addition, the copyrights of the pages you index may not legally allow indexing. In these and other possible cases, we would like you to know that you are fully responsible and that DECE only offers a technical solution.
think it is a DDOS attack and ban your IP. Also, some page content, due to legal terms, may not be processed and used outside the site. Our sole responsibility is to provide software, and DECE can not be held responsible for any consequences.
Expand | ||
---|---|---|
| ||
You can provide a single address or multiple addresses. Domain restriction settings will work independently for each address. |
Expand | |
---|---|
|
...
Some websites may have social media links, advertising pages or similar pages that you are not interested in as content. You can exclude as many pages as you wish from the crawl results. Page addresses must be separated by ";". You can generalize by using "*" when defining addresses.
...
| |
With level=0 only the given page is indexed. The level must be large enough to access all pages. For cases with paging, the level value can be 1000+. |
Expand | |
---|---|
|
...
| |
Many web pages use URL parameters. |
...
By default, GEODI creates content for each unique URL. But in some cases, a parameter may not change the content |
...
of the page. In such situations, you may ignore such parameters to get a better index result. For example: |
...
...
...
if they open the same page, you should put "showComments" in the parameters to be ignored. GEODI treats both as the same page.
...
title | How the page names that will come to the query result are formed |
---|
If the page HTML source contains og:title, og:title is used, or else the information in title will be used.
...
Expand | |
---|---|
|
...
GEODI has rules on a Web Page basis. Some rules come ready-made. For example, on wikipedia pages, only the "box" with the content is processed. Paginators (links that appear as 1,2,3,... 10 and identify pages) on some web pages are processed automatically.
| |
Some websites may have social media links, advertising pages, or similar pages that you don't want in to the index.
|
Expand | |
---|---|
|
...
With level=0 only the given page is indexed. The level must be large enough to access all pages. For cases with paging, the level value can be 1000+.
...
title | If the page is not indexed as it appears |
---|
...
| |
Normally, The page content is extracted from the HTML. But some web page content is dynamically created. In such cases, check this option. The indexing will be slower but |
...
the results will be better. The alternative web browser module must be installed for the option to work. |
Expand | ||
---|---|---|
| ||
Page names formed using og:title → title → page url You can query pages using |