Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

GEODI can also use web pages and RSS news sources as content. Web pages can have very variable structures. To support this variability, the Web Page data source offers many options.

Indexing web pages can create intensive internet usage. Some sites may interpret such use as "hacking" and may ban you. In addition, the copyrights of the pages you index may not legally allow indexing. In these and other possible cases, we would like you to know that you are fully responsible and that DECE only offers a technical solution.

 Addresses

You can provide a single address or multiple addresses. Domain restriction settings will work independently for each address.

 Neglect of Unwanted Pages

Some websites may have social media links, advertising pages or similar pages that you are not interested in as content. You can exclude as many pages as you wish from the crawl results. Page addresses must be separated by ";". You can generalize by using "*" when defining addresses.

  • For example: If you do not want to crawl the page http://www.dece.com.tr/geodi you can type ; (*geodi* or *geodi.html) in the omitted pages field.

 Processing Web Page Parameters

Many web pages use parameters. GEODI treats different parameterized versions of the same page as different pages. However, there are many cases where parameters do not change the content, and in these cases you can ignore them.

For example:

https://ornek.com

https://ornek.com?ShowComments=true

if they open the same page, you should put "showComments" in the parameters to be ignored. GEODI treats both as the same page.

 Page Browsing Rules

GEODI has rules on a Web Page basis. Some rules come ready-made. For example, on wikipedia pages, only the "box" with the content is processed. Paginators (links that appear as 1,2,3,... 10 and identify pages) on some web pages are processed automatically.

 How the page names that will come to the query result are formed

If the page HTML source contains og:title, og:title is used, or else the information in title will be used.

If this information is not available, the address of the page as it appears in the browser will be used.

 Level

With level=0 only the given page is indexed. The level must be large enough to access all pages. For cases with paging, the level value can be 1000+.

 If the page is not indexed as it appears

Some pages are created with JavaScript. In this case the html content of the page does not provide the necessary information. In such cases the "navigate like a browser" option should be checked. The indexing will be slower but as desired. The alternative web browser module must be installed for the option to work.


  • No labels