robots.txt is not honored #7

Closed
opened 10 months ago by René Wagner · 11 comments
Owner

geminispace.info serves currently the following robots.txt:

User-agent: researcher
User-agent: indexer
User-agent: archiver
Disallow: /search
Disallow: /backlinks

User-agent: *
Disallow: /add-seed
Disallow: /threads

but the site is crawled anyway:

Feb 02 11:48:39 geminispace-info poetry[521216]: 2021-02-02 11:48:39,065 crawl    INFO     Fetching resource: gemini://geminispace.info/v/search/476?content_type%3Aimage/jpeg
Feb 02 11:48:39 geminispace-info poetry[521216]: 2021-02-02 11:48:39,800 crawl    INFO     Fetching resource: gemini://geminispace.info/v/search/477?content_type%3Aimage/jpeg
Feb 02 11:48:40 geminispace-info poetry[521216]: 2021-02-02 11:48:40,519 crawl    INFO     Fetching resource: gemini://geminispace.info/search/477?content_type%3Aimage/jpeg
geminispace.info serves currently the following robots.txt: ``` User-agent: researcher User-agent: indexer User-agent: archiver Disallow: /search Disallow: /backlinks User-agent: * Disallow: /add-seed Disallow: /threads ``` but the site is crawled anyway: ``` Feb 02 11:48:39 geminispace-info poetry[521216]: 2021-02-02 11:48:39,065 crawl INFO Fetching resource: gemini://geminispace.info/v/search/476?content_type%3Aimage/jpeg Feb 02 11:48:39 geminispace-info poetry[521216]: 2021-02-02 11:48:39,800 crawl INFO Fetching resource: gemini://geminispace.info/v/search/477?content_type%3Aimage/jpeg Feb 02 11:48:40 geminispace-info poetry[521216]: 2021-02-02 11:48:40,519 crawl INFO Fetching resource: gemini://geminispace.info/search/477?content_type%3Aimage/jpeg ```
René Wagner added the
bug
label 10 months ago
René Wagner self-assigned this 10 months ago
Poster
Owner

Seems like robots.txt has not been fetched at all for this crawl run...this is due to the robots.txt for geminispace.info is already listet in index/robot_file_map.p (with empty values).

possible solution: remove index/robot_file_map.p

Seems like robots.txt has not been fetched at all for this crawl run...this is due to the robots.txt for geminispace.info is already listet in `index/robot_file_map.p` (with empty values). possible solution: remove `index/robot_file_map.p`
René Wagner added the
wontfix
label 10 months ago
René Wagner closed this issue 10 months ago
René Wagner reopened this issue 10 months ago
Poster
Owner

even after removing index/robot_file_map.p to force a reload of robots.txt the forbidden segments are still crawled on geminispace.info

Need to investigate whats going on.

even after removing `index/robot_file_map.p` to force a reload of robots.txt the forbidden segments are still crawled on geminispace.info Need to investigate whats going on.
René Wagner removed the
wontfix
label 10 months ago
Collaborator

Removing robot_file_map.p is the same thing I do to force a fresh fetch of robots.txt - the downside of that approach is that it invalidates everyone's cached robots.txt. Alas.

I've noticed questionable adherence to robots.txt multiple times - I'm pretty sure there's a bug in the robots.txt functionality somewhere!

Removing `robot_file_map.p` is the same thing I do to force a fresh fetch of robots.txt - the downside of that approach is that it invalidates *everyone's* cached robots.txt. Alas. I've noticed questionable adherence to robots.txt multiple times - I'm pretty sure there's a bug in the robots.txt functionality somewhere!
Poster
Owner

I hope i can spare some time in the next days to debug this.

I hope i can spare some time in the next days to debug this.
Poster
Owner

The condition in crawl.py L481 never matches, therefore robots.txt is never checked at all.

I think the reason may be, that fetch_robots_file does not return anything and the robot_file_map therefor contains keys with empty values.

The condition in [crawl.py L481](https://src.clttr.info/rwa/geminispace.info/src/branch/master/gus/crawl.py#L491) never matches, therefore robots.txt is never checked at all. I think the reason may be, that [fetch_robots_file](https://src.clttr.info/rwa/geminispace.info/src/branch/master/gus/crawl.py#L434) does not return anything and the [robot_file_map](https://src.clttr.info/rwa/geminispace.info/src/branch/master/gus/crawl.py#L445) therefor contains keys with empty values.
Poster
Owner

A fix is currently in test on geminispace.info, some more testing needed but first results look promising.

I'll also check if we can skip the check for "*" user-agent as - if i understand it correctly - the check for "indexer" should return the value for "*" anyway if no special handling for the user-agent ist set.

A fix is currently in test on geminispace.info, some more testing needed but first results look promising. I'll also check if we can skip the check for "\*" user-agent as - if i understand it correctly - the check for "indexer" should return the value for "\*" anyway if no special handling for the user-agent ist set.
Poster
Owner

We have another "logic issue" in the robots.txt parsing.

E.g.:

User-agent: Indexer
Disallow: /file.gmi

The RobotFileParser will return the following values:

  • can_fetch * -> true
  • can_fetch Indexer -> false
  • can_fetch gus -> true

This is because the parser does not know that "GUS" is a more specific version of "indexer" and thus returning the value for * if no section for "gus" is given.

Our logic should be:

(can_fetch Indexer && can_fetch gus)

can_fetch * can be omitted cause the more specific querys will return the value for * if not set.

Does this sound reasonable?

We have another "logic issue" in the robots.txt parsing. E.g.: ``` User-agent: Indexer Disallow: /file.gmi ``` The RobotFileParser will return the following values: - can_fetch * -> true - can_fetch Indexer -> false - can_fetch gus -> true This is because the parser does not know that "GUS" is a more specific version of "indexer" and thus returning the value for * if no section for "gus" is given. Our logic should be: ``` (can_fetch Indexer && can_fetch gus) ``` can_fetch * can be omitted cause the more specific querys will return the value for * if not set. Does this sound reasonable?
Poster
Owner

this is getting a can of worms... 😁
gemini://chris.vittal.dev/robots.txt serves an empty robot.txt, which seems to be fine for can_fetch, but not for crawl_delay:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/data/gus/gus/crawl.py", line 846, in main
    run_crawl(args.should_run_destructive, seed_urls=args.seed_urls)
  File "/data/gus/gus/crawl.py", line 830, in run_crawl
    crawl_page(resource, 0, should_check_if_expired=False)
  File "/data/gus/gus/crawl.py", line 493, in crawl_page
    crawl_delay = robots_file.crawl_delay("*")
  File "/usr/lib/python3.7/urllib/robotparser.py", line 182, in crawl_delay
    return self.default_entry.delay
AttributeError: 'NoneType' object has no attribute 'delay'

According to https://bugs.python.org/issue35922#msg345733 this shoudl already be fixed...but Debian Buster is on Python 3.7.3 by default 🤦

this is getting a can of worms... 😁 `gemini://chris.vittal.dev/robots.txt` serves an empty robot.txt, which seems to be fine for can_fetch, but not for crawl_delay: ``` Traceback (most recent call last): File "<string>", line 1, in <module> File "/data/gus/gus/crawl.py", line 846, in main run_crawl(args.should_run_destructive, seed_urls=args.seed_urls) File "/data/gus/gus/crawl.py", line 830, in run_crawl crawl_page(resource, 0, should_check_if_expired=False) File "/data/gus/gus/crawl.py", line 493, in crawl_page crawl_delay = robots_file.crawl_delay("*") File "/usr/lib/python3.7/urllib/robotparser.py", line 182, in crawl_delay return self.default_entry.delay AttributeError: 'NoneType' object has no attribute 'delay' ``` According to https://bugs.python.org/issue35922#msg345733 this shoudl already be fixed...but Debian Buster is on Python 3.7.3 by default 🤦
René Wagner changed title from robots.txt on self is not honored? to robots.txt is not honored 10 months ago
Poster
Owner

robots.txt handling is updated and looks good for me.

Patch send upstream.

Another patch needed to add the verbose mode (/v/search segment) to GUS robots.txt.

robots.txt handling is updated and looks good for me. Patch send upstream. Another patch needed to add the verbose mode (/v/search segment) to GUS robots.txt.
René Wagner closed this issue 10 months ago
René Wagner reopened this issue 10 months ago
Poster
Owner

@vee brought to my attention that the updated implementation has a flaw as well:

For example, imagine the case where someone wants to
disallow all indexers except GUS, because they like/trust GUS. That
was possible with the old implementation, because the most-specific
setting (i.e., GUS) would override the previous, less-specific setting
(i.e., indexer).

@vee brought to my attention that the updated implementation has a flaw as well: > For example, imagine the case where someone wants to > disallow all indexers except GUS, because they like/trust GUS. That > was possible with the old implementation, because the most-specific > setting (i.e., GUS) would override the previous, less-specific setting > (i.e., indexer).
René Wagner removed their assignment 10 months ago
Poster
Owner
https://lists.sr.ht/~natpen/gus/patches/20446
René Wagner closed this issue 9 months ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.