Browse Source

add systemd-units for automatic crawling

The template runs the crawler once a week on saturday afternoon.
If other launch times are wanted, gus-crawl.timer needs to be
modified.
remotes/src/master
René Wagner 9 months ago
parent
commit
6396d9f186
  1. 31
      README.md
  2. 12
      infra/gus-crawl.service
  3. 13
      infra/gus-crawl.timer

31
README.md

@ -3,8 +3,8 @@
## Dependencies
1. Install python and poetry
2. Run: "poetry install"
1. Install python (>3.5) and [poetry](https://python-poetry.org)
2. Run: `poetry install`
## Making an initial index
@ -13,34 +13,45 @@ Make sure you have some gemini URLs for testing which are nicely
sandboxed to avoid indexing huge parts of the gemini space.
1. Create a "seed-requests.txt" file with you test gemini URLs
2. Run: "poetry run crawl -d"
3. Run: "poetry run build_index -d"
2. Run: `poetry run crawl -d`
3. Run: `poetry run build_index -d`
Now you'll have created `index.new` directory, rename it to `index`.
## Running the frontend
# Running the frontend
1. Run: "poetry run serve"
1. Run: `poetry run serve`
2. Navigate your gemini client to: "gemini://localhost/"
## automatic frontend with systemd-unit
1. update `infra/gus.service` to match your needs (directory, user)
2. copy `infra/gus.service` to `/etc/systemd/system/`
3. run `systemctl enable gus` and `systemctl start gus`
# Updating the index
1. Run: "poetry run crawl"
2. Run: "poetry run build_index"
1. Run: `poetry run crawl`
2. Run: `poetry run build_index`
3. Restart frontend
## systemd-unit for crawling
1. update `infra/gus-crawl.service` to match your needs (directory, user)
2. update `infra/gus-crawl.timer` to match your needs (OnCalendar definition)
3. copy both files to `/etc/systemd/system/`
4. run `systemctl enable gus-crawl.timer` & `systemctl start gus-crawl.timer` to start the timer
## Running test suite
Run: "poetry run pytest"
Run: `poetry run pytest`
## Roadmap / TODOs
- TODO: improve crawl and build_index automation
- TODO: get crawl to run on a schedule with systemd
- TODO: add functionality to create a mock index
- TODO: exclude raw-text blocks from indexed content
- TODO: strip control characters from logged output like URLs

12
infra/gus-crawl.service

@ -0,0 +1,12 @@
# /etc/systemd/system/gus.service
[Unit]
Description=Gemini Universal Search - Crawler
[Service]
User=gus
Group=gus
Type=oneshot
WorkingDirectory=/home/gus/code/gus
Environment="PYTHONUNBUFFERED=1"
ExecStart=/home/gus/.poetry/bin/poetry run crawl

13
infra/gus-crawl.timer

@ -0,0 +1,13 @@
[Unit]
Description=Gemini Universal Search - Crawler Timer
ConditionVirtualization=!container
[Timer]
OnCalendar=Sat 18:00:00
AccuracySec=1h
Persistent=true
RandomizedDelaySec=6000
[Install]
WantedBy=timers.target
Loading…
Cancel
Save