index: results to be included #33

Closed
opened 1 month ago by René Wagner · 2 comments
Owner

At the moment due to 6eedbd4190/gus/build_index.py (L103) we only index files whose last crawl was a success.

We should be more relaxed in terms of this, which means:

  • index files that have a successfull crawl within the last 30 days or the last_crawl was a 20
  • we need to filter out redirects, inputs and so on

Maybe we need to adjust the meta data of a page to fullfill this requirements.

At the moment due to https://src.clttr.info/rwa/geminispace.info/src/commit/6eedbd4190b8f1feafb34fcca4007655237fa9df/gus/build_index.py#L103 we only index files whose last crawl was a success. We should be more relaxed in terms of this, which means: - index files that have a successfull crawl within the last 30 days or the last_crawl was a 20 - we need to filter out redirects, inputs and so on Maybe we need to adjust the meta data of a page to fullfill this requirements.
René Wagner added the
enhancement
label 1 month ago
René Wagner changed title from index: to index: results to be included 6 days ago
Poster
Owner

solution:

  • introduce new field last_crawl_success_status
  • index fields that have successfull crawl with status 20 in the last 30 days
solution: * introduce new field `last_crawl_success_status` * index fields that have successfull crawl with status 20 in the last 30 days
Poster
Owner

implemented in fa2db540f6

implemented in fa2db540f6
René Wagner closed this issue 3 days ago
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.