crawl: invalid url crashes crawl #37

Closed
opened 6 months ago by René Wagner · 0 comments
Owner
Nov 10 09:45:50 v2202102141844144675 poetry[264342]: 2021-11-10 09:45:50,205 crawl    ERROR    Error checking for exclude of url: gemini://:memex-domain/bin/9man/rc
Nov 10 09:45:50 v2202102141844144675 poetry[264342]: Traceback (most recent call last):
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:   File "<string>", line 1, in <module>
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:   File "/home/gus/gus/crawl.py", line 588, in main
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:     run_crawl(args.should_run_destructive, seed_urls=args.seed_urls)
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:   File "/home/gus/gus/crawl.py", line 577, in run_crawl
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:     crawl_page(resource, 0, should_check_if_expired=False)
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:   File "/home/gus/gus/crawl.py", line 501, in crawl_page
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:     crawl_page(
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:   File "/home/gus/gus/crawl.py", line 501, in crawl_page
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:     crawl_page(
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:   File "/home/gus/gus/crawl.py", line 501, in crawl_page
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:     crawl_page(
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:   [Previous line repeated 93 more times]
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:   File "/home/gus/gus/crawl.py", line 325, in crawl_page
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:     if gr.normalized_host in failure_count and failure_count[gr.normalized_host] > constants.MAXIMUM_FAILED_REQUEST_COUNT:
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:   File "/home/gus/gus/lib/gemini.py", line 197, in _get_normalized_host
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:     ) = self._get_normalized_url_and_host()
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:   File "/home/gus/gus/lib/gemini.py", line 395, in _get_normalized_url_and_host
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:     if self.urlsplit.port == 1965:
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:   File "/usr/lib/python3.9/urllib/parse.py", line 175, in port
Nov 10 09:45:50 v2202102141844144675 poetry[264342]:     raise ValueError(message) from None
Nov 10 09:45:50 v2202102141844144675 poetry[264342]: ValueError: Port could not be cast to integer value as 'memex-domain'
``` Nov 10 09:45:50 v2202102141844144675 poetry[264342]: 2021-11-10 09:45:50,205 crawl ERROR Error checking for exclude of url: gemini://:memex-domain/bin/9man/rc Nov 10 09:45:50 v2202102141844144675 poetry[264342]: Traceback (most recent call last): Nov 10 09:45:50 v2202102141844144675 poetry[264342]: File "<string>", line 1, in <module> Nov 10 09:45:50 v2202102141844144675 poetry[264342]: File "/home/gus/gus/crawl.py", line 588, in main Nov 10 09:45:50 v2202102141844144675 poetry[264342]: run_crawl(args.should_run_destructive, seed_urls=args.seed_urls) Nov 10 09:45:50 v2202102141844144675 poetry[264342]: File "/home/gus/gus/crawl.py", line 577, in run_crawl Nov 10 09:45:50 v2202102141844144675 poetry[264342]: crawl_page(resource, 0, should_check_if_expired=False) Nov 10 09:45:50 v2202102141844144675 poetry[264342]: File "/home/gus/gus/crawl.py", line 501, in crawl_page Nov 10 09:45:50 v2202102141844144675 poetry[264342]: crawl_page( Nov 10 09:45:50 v2202102141844144675 poetry[264342]: File "/home/gus/gus/crawl.py", line 501, in crawl_page Nov 10 09:45:50 v2202102141844144675 poetry[264342]: crawl_page( Nov 10 09:45:50 v2202102141844144675 poetry[264342]: File "/home/gus/gus/crawl.py", line 501, in crawl_page Nov 10 09:45:50 v2202102141844144675 poetry[264342]: crawl_page( Nov 10 09:45:50 v2202102141844144675 poetry[264342]: [Previous line repeated 93 more times] Nov 10 09:45:50 v2202102141844144675 poetry[264342]: File "/home/gus/gus/crawl.py", line 325, in crawl_page Nov 10 09:45:50 v2202102141844144675 poetry[264342]: if gr.normalized_host in failure_count and failure_count[gr.normalized_host] > constants.MAXIMUM_FAILED_REQUEST_COUNT: Nov 10 09:45:50 v2202102141844144675 poetry[264342]: File "/home/gus/gus/lib/gemini.py", line 197, in _get_normalized_host Nov 10 09:45:50 v2202102141844144675 poetry[264342]: ) = self._get_normalized_url_and_host() Nov 10 09:45:50 v2202102141844144675 poetry[264342]: File "/home/gus/gus/lib/gemini.py", line 395, in _get_normalized_url_and_host Nov 10 09:45:50 v2202102141844144675 poetry[264342]: if self.urlsplit.port == 1965: Nov 10 09:45:50 v2202102141844144675 poetry[264342]: File "/usr/lib/python3.9/urllib/parse.py", line 175, in port Nov 10 09:45:50 v2202102141844144675 poetry[264342]: raise ValueError(message) from None Nov 10 09:45:50 v2202102141844144675 poetry[264342]: ValueError: Port could not be cast to integer value as 'memex-domain' ```
René Wagner added the
bug
label 6 months ago
René Wagner self-assigned this 6 months ago
René Wagner closed this issue 6 months ago
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.