crawl: invalid IPv6 error #9

Open
opened 6 months ago by René Wagner · 3 comments
Owner
Feb 06 10:09:01 geminispace-info poetry[876112]: 2021-02-06 10:09:01,917 crawl    INFO     Fetching resource: gemini://mozz.us/files/gemini-links.gmi
Feb 06 10:09:02 geminispace-info poetry[876112]: Traceback (most recent call last):
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "<string>", line 1, in <module>
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/data/gus/gus/crawl.py", line 833, in main
Feb 06 10:09:02 geminispace-info poetry[876112]:     run_crawl(args.should_run_destructive, seed_urls=args.seed_urls)
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/data/gus/gus/crawl.py", line 817, in run_crawl
Feb 06 10:09:02 geminispace-info poetry[876112]:     crawl_page(resource, 0, should_check_if_expired=False)
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/data/gus/gus/crawl.py", line 650, in crawl_page
Feb 06 10:09:02 geminispace-info poetry[876112]:     resource, current_depth + 1, should_check_if_expired=True
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/data/gus/gus/crawl.py", line 650, in crawl_page
Feb 06 10:09:02 geminispace-info poetry[876112]:     resource, current_depth + 1, should_check_if_expired=True
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/data/gus/gus/crawl.py", line 650, in crawl_page
Feb 06 10:09:02 geminispace-info poetry[876112]:     resource, current_depth + 1, should_check_if_expired=True
Feb 06 10:09:02 geminispace-info poetry[876112]:   [Previous line repeated 4 more times]
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/data/gus/gus/crawl.py", line 599, in crawl_page
Feb 06 10:09:02 geminispace-info poetry[876112]:     redirect_chain=redirect_chain + [gr.fetchable_url],
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/data/gus/gus/crawl.py", line 650, in crawl_page
Feb 06 10:09:02 geminispace-info poetry[876112]:     resource, current_depth + 1, should_check_if_expired=True
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/data/gus/gus/crawl.py", line 599, in crawl_page
Feb 06 10:09:02 geminispace-info poetry[876112]:     redirect_chain=redirect_chain + [gr.fetchable_url],
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/data/gus/gus/crawl.py", line 650, in crawl_page
Feb 06 10:09:02 geminispace-info poetry[876112]:     resource, current_depth + 1, should_check_if_expired=True
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/data/gus/gus/crawl.py", line 650, in crawl_page
Feb 06 10:09:02 geminispace-info poetry[876112]:     resource, current_depth + 1, should_check_if_expired=True
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/data/gus/gus/crawl.py", line 650, in crawl_page
Feb 06 10:09:02 geminispace-info poetry[876112]:     resource, current_depth + 1, should_check_if_expired=True
Feb 06 10:09:02 geminispace-info poetry[876112]:   [Previous line repeated 15 more times]
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/data/gus/gus/crawl.py", line 646, in crawl_page
Feb 06 10:09:02 geminispace-info poetry[876112]:     contained_resources = gr.extract_contained_resources(response.content)
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/data/gus/gus/lib/gemini.py", line 415, in extract_contained_resources
Feb 06 10:09:02 geminispace-info poetry[876112]:     parent_hostname=self.urlsplit.hostname,
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/data/gus/gus/lib/gemini.py", line 90, in __init__
Feb 06 10:09:02 geminispace-info poetry[876112]:     parent_hostname=parent_hostname,
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/data/gus/gus/lib/gemini.py", line 114, in urlsplit_featureful
Feb 06 10:09:02 geminispace-info poetry[876112]:     u = urlsplit(url, "gemini")
Feb 06 10:09:02 geminispace-info poetry[876112]:   File "/usr/lib/python3.7/urllib/parse.py", line 459, in urlsplit
Feb 06 10:09:02 geminispace-info poetry[876112]:     raise ValueError("Invalid IPv6 URL")
Feb 06 10:09:02 geminispace-info poetry[876112]: ValueError: Invalid IPv6 URL
``` Feb 06 10:09:01 geminispace-info poetry[876112]: 2021-02-06 10:09:01,917 crawl INFO Fetching resource: gemini://mozz.us/files/gemini-links.gmi Feb 06 10:09:02 geminispace-info poetry[876112]: Traceback (most recent call last): Feb 06 10:09:02 geminispace-info poetry[876112]: File "<string>", line 1, in <module> Feb 06 10:09:02 geminispace-info poetry[876112]: File "/data/gus/gus/crawl.py", line 833, in main Feb 06 10:09:02 geminispace-info poetry[876112]: run_crawl(args.should_run_destructive, seed_urls=args.seed_urls) Feb 06 10:09:02 geminispace-info poetry[876112]: File "/data/gus/gus/crawl.py", line 817, in run_crawl Feb 06 10:09:02 geminispace-info poetry[876112]: crawl_page(resource, 0, should_check_if_expired=False) Feb 06 10:09:02 geminispace-info poetry[876112]: File "/data/gus/gus/crawl.py", line 650, in crawl_page Feb 06 10:09:02 geminispace-info poetry[876112]: resource, current_depth + 1, should_check_if_expired=True Feb 06 10:09:02 geminispace-info poetry[876112]: File "/data/gus/gus/crawl.py", line 650, in crawl_page Feb 06 10:09:02 geminispace-info poetry[876112]: resource, current_depth + 1, should_check_if_expired=True Feb 06 10:09:02 geminispace-info poetry[876112]: File "/data/gus/gus/crawl.py", line 650, in crawl_page Feb 06 10:09:02 geminispace-info poetry[876112]: resource, current_depth + 1, should_check_if_expired=True Feb 06 10:09:02 geminispace-info poetry[876112]: [Previous line repeated 4 more times] Feb 06 10:09:02 geminispace-info poetry[876112]: File "/data/gus/gus/crawl.py", line 599, in crawl_page Feb 06 10:09:02 geminispace-info poetry[876112]: redirect_chain=redirect_chain + [gr.fetchable_url], Feb 06 10:09:02 geminispace-info poetry[876112]: File "/data/gus/gus/crawl.py", line 650, in crawl_page Feb 06 10:09:02 geminispace-info poetry[876112]: resource, current_depth + 1, should_check_if_expired=True Feb 06 10:09:02 geminispace-info poetry[876112]: File "/data/gus/gus/crawl.py", line 599, in crawl_page Feb 06 10:09:02 geminispace-info poetry[876112]: redirect_chain=redirect_chain + [gr.fetchable_url], Feb 06 10:09:02 geminispace-info poetry[876112]: File "/data/gus/gus/crawl.py", line 650, in crawl_page Feb 06 10:09:02 geminispace-info poetry[876112]: resource, current_depth + 1, should_check_if_expired=True Feb 06 10:09:02 geminispace-info poetry[876112]: File "/data/gus/gus/crawl.py", line 650, in crawl_page Feb 06 10:09:02 geminispace-info poetry[876112]: resource, current_depth + 1, should_check_if_expired=True Feb 06 10:09:02 geminispace-info poetry[876112]: File "/data/gus/gus/crawl.py", line 650, in crawl_page Feb 06 10:09:02 geminispace-info poetry[876112]: resource, current_depth + 1, should_check_if_expired=True Feb 06 10:09:02 geminispace-info poetry[876112]: [Previous line repeated 15 more times] Feb 06 10:09:02 geminispace-info poetry[876112]: File "/data/gus/gus/crawl.py", line 646, in crawl_page Feb 06 10:09:02 geminispace-info poetry[876112]: contained_resources = gr.extract_contained_resources(response.content) Feb 06 10:09:02 geminispace-info poetry[876112]: File "/data/gus/gus/lib/gemini.py", line 415, in extract_contained_resources Feb 06 10:09:02 geminispace-info poetry[876112]: parent_hostname=self.urlsplit.hostname, Feb 06 10:09:02 geminispace-info poetry[876112]: File "/data/gus/gus/lib/gemini.py", line 90, in __init__ Feb 06 10:09:02 geminispace-info poetry[876112]: parent_hostname=parent_hostname, Feb 06 10:09:02 geminispace-info poetry[876112]: File "/data/gus/gus/lib/gemini.py", line 114, in urlsplit_featureful Feb 06 10:09:02 geminispace-info poetry[876112]: u = urlsplit(url, "gemini") Feb 06 10:09:02 geminispace-info poetry[876112]: File "/usr/lib/python3.7/urllib/parse.py", line 459, in urlsplit Feb 06 10:09:02 geminispace-info poetry[876112]: raise ValueError("Invalid IPv6 URL") Feb 06 10:09:02 geminispace-info poetry[876112]: ValueError: Invalid IPv6 URL ```
René Wagner added the
bug
label 6 months ago
René Wagner changed title from Invalid IPv6 error to crawl: invalid IPv6 error 6 months ago
Collaborator

Haven't seen this one before, but it looks like it's almost certainly just because I made a few too many "assumptions of IPV4-ness" in /gus/lib/gemini.py 😅

Haven't seen this one before, but it looks like it's almost certainly just because I made a few too many "assumptions of IPV4-ness" in `/gus/lib/gemini.py` 😅
Poster
Owner

Happen again yesterday, again on gemini://mozz.us/files/gemini-links.gmi

gonna have a look later

Happen again yesterday, again on `gemini://mozz.us/files/gemini-links.gmi` gonna have a look later
Poster
Owner

this site is a large list of links scraped from the mailing list...it includes none-sense links like gemini://example.proxy or gemini://.

Although this should not crash the crawler, i for now decided to simply exclude this page.

this site is a large list of links scraped from the mailing list...it includes none-sense links like `gemini://example.proxy` or `gemini://`. Although this should not crash the crawler, i for now decided to simply exclude this page.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.