crawl: improve deletion of outdated pages #35

Open
opened 2 months ago by René Wagner · 1 comments
Owner

currently we delete the following pages from the index:
The last_crawl is newer the the last_successfull_crawl (the avoid removing pages that have a long crawl interval like binarys) and the last_successfull_crawl must be older then 30 days

Due to the fact the we skip a complete host during crawl after 5 subsequent fetches that failed it may take a long time until a capsule that has gone down is completely removed from the index.

currently we delete the following pages from the index: The last_crawl is newer the the last_successfull_crawl (the avoid removing pages that have a long crawl interval like binarys) and the last_successfull_crawl must be older then 30 days Due to the fact the we skip a complete host during crawl after 5 subsequent fetches that failed it may take a long time until a capsule that has gone down is completely removed from the index.
René Wagner added the
enhancement
label 2 months ago
Poster
Owner

after crawl is finished:

  • go through all hosts that have reached the 5 subsequent failed crawls
  • for earch host that matches this criteria
    • check the lastest successfull crawl for the host (no matter what page)
    • if the last success crawl is more than 30 days ago -> delete all pages for that host
after crawl is finished: * go through all hosts that have reached the 5 subsequent failed crawls * for earch host that matches this criteria * check the lastest successfull crawl for the host (no matter what page) * if the last success crawl is more than 30 days ago -> delete all pages for that host
René Wagner self-assigned this 1 month ago
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.