Page Comparison

GEODI can also use web pages and RSS news sources as content. Web pages can have very variable structures. To support this variability, the Web Page data source offers many options.

We want to inform you about some important issues related to web page indexing. Indexing web pages can create may require intensive internet usage. Some sites may interpret such use as "hacking" and may ban you. In addition, the copyrights of the pages you index may not legally allow indexing. In these and other possible cases, we would like you to know that you are fully responsible and that DECE only offers a technical solution.
think it is a DDOS attack and ban your IP. Also, some page content, due to legal terms, may not be processed and used outside the site. Our sole responsibility is to provide software, and DECE can not be held responsible for any consequences.

Tip

Conditions for connection

Access to the web page
Information required for verification required by the token or page for user verification places

...

Expand

title	Addresses

You can provide a single address or multiple addresses. Domain restriction settings will work independently for each address.

Expand

title

...

Some websites may have social media links, advertising pages or similar pages that you are not interested in as content. You can exclude as many pages as you wish from the crawl results. Page addresses must be separated by ";". You can generalize by using "*" when defining addresses.

...

Level

With level=0 only the given page is indexed. The level must be large enough to access all pages. For cases with paging, the level value can be 1000+.

Expand

title

...

Parameters to Ignore

Many web pages use URL parameters.

...

By default, GEODI creates content for each unique URL. But in some cases, a parameter may not change the content

...

of the page. In such situations, you may ignore such parameters to get a better index result.

For example:

...

...

...

...

Expand

title	Pages to Ignore

Some websites may have social media links, advertising pages, or similar pages that you don't want in to the index.

You may list as many pages as you need to exclude. Each page must be separated with “;”.
Wildcards are allowed. *adds* ; *last.html ignores any files containing adds and ends with last.html.

Expand

title	Page

...

Crawling Rules

GEODI has rules on a

...

per-web-page basis. Some rules come

...

pre-

...

configured. For example

...

, only the "info box"

...

containing content is processed

...

on Wikipedia pages. Pagination controls found on some web pages (such as links appearing as 1, 2, 3,..., 10 and

...

determining the pages)

...

are automatically processed

...

.

Expand

title

...

If the page HTML source contains og:title, og:title is used, or else the information in title will be used.

If this information is not available, the address of the page as it appears in the browser will be used.

Expand

title	Level

With level=0 only the given page is indexed. The level must be large enough to access all pages. For cases with paging, the level value can be 1000+.

...

title	If the page is not indexed as it appears

...

Render like Browsers

Some pages are generated using JavaScript. In such cases, the HTML content of the page may not provide the necessary information. In

...

these situations, the "

...

Render like

...

Browsers" option should be checked.

...

Indexing will be slower but

...

will yield the desired

...

results. For this option to work, an alternative web browser module must be installed

...

.

Expand

title	How page content names are generated?

Page names are formed using the order of og:title → title → page URL

You can query pages using doc: ,

Version	Old Version 1	New Version Current
Changes made by	Dece Translate (Unlicensed)	Arda Ülgü (Unlicensed)
Saved on	Sept 21, 2022	Jun 14, 2024

Versions Compared

Key