Browse Source

use cronjob for automated start

master
René Wagner 2 months ago
parent
commit
f928815d49
Signed by: rwa GPG Key ID: 2B8BCD69606E7F19
  1. 11
      README.md
  2. 13
      infra/gus-crawl.timer
  3. 2
      infra/gus-index.service
  4. 1
      infra/rebuild_index.sh

11
README.md

@ -39,9 +39,8 @@ Now you'll have created `index.new` directory, rename it to `index`.
### Running the crawl & indexer in production with systemd
1. update `infra/gus-crawl.service` & `infra/gus-index.service` to match your needs (directory, user)
2. update `infra/gus-crawl.timer` to match your needs (OnCalendar definition)
3. copy both files to `/etc/systemd/system/`
4. run `systemctl enable gus-crawl.timer` & `systemctl start gus-crawl.timer` to start the timer
2. copy both files to `/etc/systemd/system/`
3. set up a cron job for root with the following params: `0 9 */3 * * systemctl start gus-crawl --no-block`
## Running the test suite
@ -50,12 +49,6 @@ Run: `poetry run pytest`
## Roadmap / TODOs
- TODO: improve crawl and build_index automation
- TODO: add functionality to create a mock index
- TODO: exclude raw-text blocks from indexed content
- TODO: strip control characters from logged output like URLs
- TODO: fix bug in calulation of backlinks (iirc the bug is visible on gemini.circumlunar.space)
- TODO: refactor manual exclusion logic to be regex-based instead of prefix-based. we could get more nuanced with exclusion logic this way
- TODO: write a "clean" script that removes domains/pages from index, db, and statistics files, in accordance with the various exclusion lists and patterns
- TODO: speed up statistics page, it's gotten reaaaaaaally slow
- TODO: speed up newest hosts/pages pages, they've gotten reaaaaaaally slow

13
infra/gus-crawl.timer

@ -1,13 +0,0 @@
[Unit]
Description=
ConditionVirtualization=!container
[Timer]
OnCalendar=*-*-1/3 08:00:00
AccuracySec=15m
Persistent=true
RandomizedDelaySec=600
[Install]
WantedBy=timers.target

2
infra/gus-index.service

@ -9,5 +9,5 @@ Group=gus
Type=oneshot
WorkingDirectory=/home/gus
Environment="PYTHONUNBUFFERED=1"
ExecStart=/bin/bash -c /home/gus/infra/update_index.sh
ExecStart=/bin/bash -c /home/gus/infra/rebuild_index.sh
ExecStopPost=sudo systemctl restart gus

1
infra/rebuild_index.sh

@ -1,6 +1,5 @@
cp -r /home/gus/index /home/gus/index.new
/home/gus/.poetry/bin/poetry run build_index -d
rm -rf /home/gus/index.old
#rm -rf /home/gus/index.new/MAIN.tmp/
mv /home/gus/index /home/gus/index.old
mv /home/gus/index.new /home/gus/index

Loading…
Cancel
Save