robots.txt is not honored
geminispace.info serves currently the following robots.txt:
User-agent: researcher User-agent: indexer User-agent: archiver Disallow: /search Disallow: /backlinks User-agent: * Disallow: /add-seed Disallow: /threads
but the site is crawled anyway:
Feb 02 11:48:39 geminispace-info poetry: 2021-02-02 11:48:39,065 crawl INFO Fetching resource: gemini://geminispace.info/v/search/476?content_type%3Aimage/jpeg Feb 02 11:48:39 geminispace-info poetry: 2021-02-02 11:48:39,800 crawl INFO Fetching resource: gemini://geminispace.info/v/search/477?content_type%3Aimage/jpeg Feb 02 11:48:40 geminispace-info poetry: 2021-02-02 11:48:40,519 crawl INFO Fetching resource: gemini://geminispace.info/search/477?content_type%3Aimage/jpeg
Seems like robots.txt has not been fetched at all for this crawl run...this is due to the robots.txt for geminispace.info is already listet in
index/robot_file_map.p (with empty values).
possible solution: remove
even after removing
index/robot_file_map.p to force a reload of robots.txt the forbidden segments are still crawled on geminispace.info
Need to investigate whats going on.
robot_file_map.p is the same thing I do to force a fresh fetch of robots.txt - the downside of that approach is that it invalidates everyone's cached robots.txt. Alas.
I've noticed questionable adherence to robots.txt multiple times - I'm pretty sure there's a bug in the robots.txt functionality somewhere!
A fix is currently in test on geminispace.info, some more testing needed but first results look promising.
I'll also check if we can skip the check for "*" user-agent as - if i understand it correctly - the check for "indexer" should return the value for "*" anyway if no special handling for the user-agent ist set.
We have another "logic issue" in the robots.txt parsing.
User-agent: Indexer Disallow: /file.gmi
The RobotFileParser will return the following values:
- can_fetch * -> true
- can_fetch Indexer -> false
- can_fetch gus -> true
This is because the parser does not know that "GUS" is a more specific version of "indexer" and thus returning the value for * if no section for "gus" is given.
Our logic should be:
(can_fetch Indexer && can_fetch gus)
can_fetch * can be omitted cause the more specific querys will return the value for * if not set.
Does this sound reasonable?
this is getting a can of worms... 😁
gemini://chris.vittal.dev/robots.txt serves an empty robot.txt, which seems to be fine for can_fetch, but not for crawl_delay:
Traceback (most recent call last): File "<string>", line 1, in <module> File "/data/gus/gus/crawl.py", line 846, in main run_crawl(args.should_run_destructive, seed_urls=args.seed_urls) File "/data/gus/gus/crawl.py", line 830, in run_crawl crawl_page(resource, 0, should_check_if_expired=False) File "/data/gus/gus/crawl.py", line 493, in crawl_page crawl_delay = robots_file.crawl_delay("*") File "/usr/lib/python3.7/urllib/robotparser.py", line 182, in crawl_delay return self.default_entry.delay AttributeError: 'NoneType' object has no attribute 'delay'
According to https://bugs.python.org/issue35922#msg345733 this shoudl already be fixed...but Debian Buster is on Python 3.7.3 by default 🤦
robots.txt handling is updated and looks good for me.
Patch send upstream.
Another patch needed to add the verbose mode (/v/search segment) to GUS robots.txt.
@vee brought to my attention that the updated implementation has a flaw as well:
For example, imagine the case where someone wants to
disallow all indexers except GUS, because they like/trust GUS. That
was possible with the old implementation, because the most-specific
setting (i.e., GUS) would override the previous, less-specific setting
Deleting a branch is permanent. It CANNOT be undone. Continue?