source of geminispace.info - the search provider for gemini space
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
René Wagner c5bfdafcf5 exclude godocs.io 3 days ago
docs robots.txt sections "*" and "indexer" are honored 4 months ago
gus exclude godocs.io 3 days ago
infra some exception handling and updated service files 4 weeks ago
scripts [threads] Only work with textual pages 10 months ago
serve exclude godocs.io 3 days ago
tests/gus Add a few more url parsing test cases 4 months ago
.git-blame-ignore-revs Add .git-blame-ignore-revs file 7 months ago
.gitignore remove seed-requests from repo 4 months ago
LICENSE Add GUS licence 1 year ago
README.md introduce systemd-unit for indexer 5 months ago
logging.ini gsi specific updates 2021-01-29 5 months ago
poetry.lock move exclude definition to own file 4 weeks ago
pyproject.toml Update gusmobile clone location in pyproject.toml 8 months ago

README.md

Gemini Universal Search (GUS)

Dependencies

  1. Install python (>3.5) and poetry
  2. Run: poetry install

Making an initial index

Make sure you have some gemini URLs for testing which are nicely sandboxed to avoid indexing huge parts of the gemini space.

  1. Create a "seed-requests.txt" file with you test gemini URLs
  2. Run: poetry run crawl -d
  3. Run: poetry run build_index -d

Now you'll have created index.new directory, rename it to index.

Running the frontend

  1. Run: poetry run serve
  2. Navigate your gemini client to: "gemini://localhost/"

Running the frontend in production with systemd

  1. update infra/gus.service to match your needs (directory, user)
  2. copy infra/gus.service to /etc/systemd/system/
  3. run systemctl enable gus and systemctl start gus

Running the crawl to update the index

  1. Run: poetry run crawl
  2. Run: poetry run build_index
  3. Restart frontend

Running the crawl & indexer in production with systemd

  1. update infra/gus-crawl.service & infra/gus-index.service to match your needs (directory, user)
  2. update infra/gus-crawl.timer to match your needs (OnCalendar definition)
  3. copy both files to /etc/systemd/system/
  4. run systemctl enable gus-crawl.timer & systemctl start gus-crawl.timer to start the timer

Running the test suite

Run: poetry run pytest

Roadmap / TODOs

  • TODO: improve crawl and build_index automation
  • TODO: add functionality to create a mock index
  • TODO: exclude raw-text blocks from indexed content
  • TODO: strip control characters from logged output like URLs
  • TODO: fix bug in calulation of backlinks (iirc the bug is visible on gemini.circumlunar.space)
  • TODO: refactor manual exclusion logic to be regex-based instead of prefix-based. we could get more nuanced with exclusion logic this way
  • TODO: write a "clean" script that removes domains/pages from index, db, and statistics files, in accordance with the various exclusion lists and patterns
  • TODO: speed up statistics page, it's gotten reaaaaaaally slow
  • TODO: speed up newest hosts/pages pages, they've gotten reaaaaaaally slow