index: reindex indexes all pages
Closedopened 2 months ago by René Wagner · 5 comments
When reindexing, all pages instead of only pages that have been re-crawled since the last index are processed.
This results in a very long indexing process.
René Wagner added the
buglabel 2 months ago
- GUS tries to determine which pages have already been indexed by fetched an array of "indexed_urls" from the whoosh index
- this doesn't seem to work as all pages are indexed again and again
- anyway this approach is flawed as pages that had been indexed would not be reindexed if the were updated
- the use of the already implemented field "indexed_at" to determine what should be read seems more useful
René Wagner self-assigned this 3 weeks ago
- update "indexed_at" after the indexing is done
- introduce new fields "last_crawl" and "last_crawl_success", "last_status", "last_message"
- reindex only fields where "indexed_at" is lower the "last_crawled_success"
- get rid of "crawl" table (no history of crawls needed)
René Wagner added a new dependency 3 weeks ago
Reference in new issue
There is no content yet.
Delete Branch '%!s(MISSING)'
Deleting a branch is permanent. It CANNOT be undone. Continue?