You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
marty1885 cfe8ee511b fix stupid null deref 2 days ago
diesgn_docs initial commit 2 months ago
systemd misc updates 2 months ago
thrid_party update dremini 2 weeks ago
tlgs fix stupid null deref 2 days ago
tlgsutils attempt at fixing counter race 2 days ago
.gitignore initial commit 2 months ago
.gitmodules initial commit 2 months ago
CMakeLists.txt add hash to index, avoids access write SQL queries 2 weeks ago
LICENSE initial commit 2 months ago
Package.cmake basic packaging 2 months ago
README.md misc 5 days ago

README.md

TLGS - Totally Legit Gemini Search

Overview

TLGS is a search engine for Gemini. It's slightly overengineered for what it currently is and uses weird tech. And I'm proud of that. The current code basse is kinda messy - I promise to clean them up. The main features/characteristics are as follows:

  • Using the state of the art C++20
  • Parses and indexes textual contents on Gemninispace
  • Highly concurrent and asynchronous
  • Stores index on PostgreSQL
  • Developed for Linux. But should work on Windows, OpenBSD, HaikuOS, macOS, etc..
  • Only fetch headers for files it can't index to save bandwith and time
  • Handles all kinds of source encoding
  • Link analysis using the HITS algorithm

As of now, indexing of news sites, RFCs, documentations are mostly disabled. But likely be enabled once I have the mean and resources to scale the setup.

Using this project

Requirments

Building and running the project

To build the project. You'll need a fully C++20 capable compiler. The following compilers should work as of writing this README

  • GCC >= 11.2
  • MSVC >= 16.25

Install all dependencies. And run the commands to

mkdir build
cd build
cmake ..
make -j

Creating and maintaining the index

To create the inital index:

  1. Initialize the database ./tlgs/tlgs_ctl/tlgs_ctl ../tlgs/config.json populate_schema
  2. Place the seed URLs into seeds.text
  3. In the build folder, run ./tlgs/crawler/tlgs_crawler -s seeds.text -c 4 ../tlgs/config.json

Now the crawler will start crawling the geminispace while also updating outdated indices (if any). To update an existing index. Run:

./tlgs/crawler/tlgs_crawler -c 2 ../tlgs/config.json
# -c is the maximum concurrent connections the crawler will make

NOTE: TLGS's crawler is distributable. You can run multiple instances in parallel. But some intances may drop out early towards the end or crawling. Though it does not effect the result of crawling.

Running the capsule

openssl req -new -subj "/CN=my.host.name.space" -x509 -newkey ec -pkeyopt ec_paramgen_curve:prime256v1 -days 36500 -nodes -out cert.pem -keyout key.pem
cd tlgs/server
./tlgs_server ../../../tlgs/server_config.json

Via systemd

sudo systemctl start tlgs_server
sudo systemctl start tlgs_crawler

TODOs

  • Code cleanup
  • Randomize the order of crawling. Avoid bashing a single capsule
  • Support parsing markdown
  • Try indexing news sites
  • Optimize the crawler even more
  • Link analysis using SALSA
    • SALSA is implemented. But it is slower with the same rank quality
    • Maybe Gemini is not complicated enough for HITS to fail
  • BM25 for text scoring
  • Dedeuplicate search result
  • Impement Filters
  • Proper(?) way to migrate schema