source of geminispace.info - the search provider for gemini space
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
Natalie Pendragon e0ea8f1de5 Add GUS licence 1 year ago
gus [serve] Make seed request handling async again for now 1 year ago
.gitignore [crawl] pickle and unpickle the robot_file_map 1 year ago
LICENSE Add GUS licence 1 year ago
README.md Remove outdated TODO 1 year ago
poetry.lock [crawl] Start indexing response sizes 1 year ago
pyproject.toml Add easy CLI way of removing domains from index 1 year ago

README.md

Developer Quickstart

Note that doing this currently requires you to perform a full crawl of Geminispace. With little content, and few people hacking on this, it's probably fine, but we should definitely keep tabs on this to ensure we're kind and respectful to content and server owners (I think the solution is that we need a way to create a mock index sooner than later).

  1. Get Python and Poetry
  2. Generate a local Geminispace index with poetry run crawl --destructive
  3. Serve GUS locally with poetry run serve

At this point you should be able to interact with a running local version of GUS, modulo perhaps some mucking about with SSL (which is left as an exercise to the reader because I am not an expert at all in that stuff :).

Contributing

Please send patches to ~natpen/gus@lists.sr.ht.

For an introduction to mailing list-based Git collaboration, see this introduction, as well as this guide to mailing list etiquette.

Roadmap / TODOs

  • log output of crawl: I see some errors fly by, and it would be nice to be able to review later and investigate.
  • get crawl to run on a schedule with systemd
  • add more statistics: this could go in the index statistics page, and, in addition to using the index itself, could also pull information from the jetforce logs.
    • server uptime (from indexes)
    • num new servers per week/month (from indexes)
    • num GUS queries per day (from server logs)
    • most common queries (not sure about this one) (from server logs)
    • num cross-domain redirects
    • num domains with robots
  • add tests: there aren't any yet!
  • add functionality to create a mock index: this would be useful for local hacking on serve.py, so one does not need to perform a real scrape of Geminispace to do said hacking.
  • exclude raw-text links: I think there is a "raw-text block" type of construct in the Gemini spec now, so I should probably add a TODO to refactor the extract_gemini_links function to exclude any links found within such a block.
  • track number of inbound links