Browse Source

correctly handle robots.txt

Honor the robots.txt entrys of "indexer" and "gus" as well
as the default * section.

The robot_file_map.p must be deleted on a live instance
after this change has been applied to refetch all robots
files, as previously only empty files have been stored.

Signed-off-by: Natalie Pendragon <natpen@natpen.net>
master
René Wagner 8 months ago
committed by Natalie Pendragon
parent
commit
108bfe850a
  1. 14
      docs/handling-robots.md
  2. 20
      gus/crawl.py

14
docs/handling-robots.md

@ -0,0 +1,14 @@
# robots.txt handling
robots.txt is fetched for each (sub)domain before actually crawling the content.
GUS honors the following User-agents:
* indexer
* gus
* *
## robots.txt caching
Every fetched robots.txt is cached in `index/robot_file_map.p`, even if they were empty/missing.
To force a refetch of _all_ robots.txt for _all_ capsulses, simply delete the file named above and run a crawl.

20
gus/crawl.py

@ -392,6 +392,7 @@ def fetch_robots_file(robot_host):
)
rp = GeminiRobotFileParser(robot_url)
rp.read()
return rp
def get_robots_file(robot_host):
@ -443,12 +444,15 @@ def crawl_page(
robots_file = get_robots_file(gr.normalized_host)
crawl_delay = None
if robots_file is not None:
# keep overwriting the value of can_fetch with more specific user-agent values
# last one should win, and if not present, RobotFileParser will just return
# the higher level's value again
can_fetch = robots_file.can_fetch("*", gr.normalized_url)
can_fetch = robots_file.can_fetch("indexer", gr.normalized_url)
can_fetch = robots_file.can_fetch("gus", gr.normalized_url)
logging.debug("Found robots.txt for %s", gr.normalized_url)
# only fetch if both user-agents are allowed to fetch
# RobotFileParser will return the higher level value (*) if no specific
# value is found, but has no understanding the "gus" is a more specific
# form of an indexer
logging.debug("can_fetch indexer: %s",robots_file.can_fetch("indexer", gr.normalized_url))
logging.debug("can_fetch gus: %s",robots_file.can_fetch("gus", gr.normalized_url))
can_fetch = (robots_file.can_fetch("indexer", gr.normalized_url) and
robots_file.can_fetch("gus", gr.normalized_url))
# same approach as above - last value wins
crawl_delay = robots_file.crawl_delay("*")
@ -456,8 +460,8 @@ def crawl_page(
crawl_delay = robots_file.crawl_delay("gus")
if not can_fetch:
logging.debug(
"Blocked by robots files, skipping: %s",
logging.info(
"Blocked by robots.txt, skipping: %s",
gus.lib.logging.strip_control_chars(url),
)
return

Loading…
Cancel
Save