index: reindex indexes all pages #25

Closed
opened 6 months ago by René Wagner · 5 comments
Owner

When reindexing, all pages instead of only pages that have been re-crawled since the last index are processed.

This results in a very long indexing process.

When reindexing, all pages instead of only pages that have been re-crawled since the last index are processed. This results in a very long indexing process.
René Wagner added the
bug
label 6 months ago
Poster
Owner

probably related to #8

probably related to #8
René Wagner added a new dependency 6 months ago
Poster
Owner

first investigation:

  • GUS tries to determine which pages have already been indexed by fetched an array of "indexed_urls" from the whoosh index
  • this doesn't seem to work as all pages are indexed again and again
  • anyway this approach is flawed as pages that had been indexed would not be reindexed if the were updated
  • the use of the already implemented field "indexed_at" to determine what should be read seems more useful
first investigation: - GUS tries to determine which pages have already been indexed by fetched an array of "indexed_urls" from the whoosh index - this doesn't seem to work as all pages are indexed again and again - anyway this approach is flawed as pages that had been indexed would not be reindexed if the were updated - the use of the already implemented field "indexed_at" to determine what should be read seems more useful
René Wagner self-assigned this 5 months ago
René Wagner added a new dependency 5 months ago
Poster
Owner

proposal:

  • update "indexed_at" after the indexing is done
  • introduce new fields "last_crawl" and "last_crawl_success", "last_status", "last_message"
  • reindex only fields where "indexed_at" is lower the "last_crawled_success"
  • get rid of "crawl" table (no history of crawls needed)
proposal: - update "indexed_at" after the indexing is done - introduce new fields "last_crawl" and "last_crawl_success", "last_status", "last_message" - reindex only fields where "indexed_at" is lower the "last_crawled_success" - get rid of "crawl" table (no history of crawls needed)
Poster
Owner

index only adds documents but never updates existing ones. This is wrong behaviour according to https://whoosh.readthedocs.io/en/latest/indexing.html

index only adds documents but never updates existing ones. This is wrong behaviour according to https://whoosh.readthedocs.io/en/latest/indexing.html
Poster
Owner

fixed in 87ef15df2e

fixed in 87ef15df2e
René Wagner closed this issue 5 months ago
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Loading…
There is no content yet.