The first open-source AI infrastructure focused on Togo
TogoLM is an open-source AI knowledge layer for Togo — a complete pipeline from raw web scraping to a fine-tuned LLM and public REST API that developers, startups, and institutions can build upon.
| Layer | Description |
|---|---|
| Corpus | 62 000+ structured Togolese documents — laws, government data, press, education |
| RAG Engine | Retrieval-Augmented Generation over the Togolese corpus |
| Public API | REST endpoints consumable by any developer or app |
| Fine-tuned LLM | Mistral 7B adapted to the Togolese context (training in progress) |
| Showcase | Next.js interface to explore and query the corpus |
Togolese public data is scattered across dozens of government portals and absent from the training sets of international LLMs. TogoLM provides a reusable, open infrastructure layer for Togo and francophone West Africa.
togolm/
├── corpus/
│ ├── scrapers/
│ │ └── spiders/ # Scrapy spiders — one per source
│ ├── processors/
│ │ ├── cleaner.py # HTML → clean text
│ │ ├── chunker.py # Split into 400-word chunks
│ │ ├── embedder.py # Local (MiniLM) or Gemini embeddings
│ │ └── ingestor.py # JSONL → PostgreSQL + pgvector
│ └── datasets/ # Scraped JSONL files (gitignored)
├── api/
│ ├── app/
│ │ ├── main.py # FastAPI entry point
│ │ ├── routers/ # corpus.py, documents.py, query.py
│ │ └── services/rag.py # Vector retrieval + Gemini generation
│ └── tests/ # 11 pytest integration tests
├── finetuning/
│ ├── dataset/
│ │ ├── generator.py # Q&A pair generation (Gemini)
│ │ └── formatter.py # Alpaca / ShareGPT format
│ ├── train/
│ │ ├── config.py # QLoRA hyperparameters
│ │ └── trainer.py # SFTTrainer fine-tuning script
│ └── notebooks/
│ └── train_colab.ipynb # Google Colab training notebook
├── showcase/ # Next.js 16 + Tailwind v4 frontend
│ ├── app/
│ │ ├── page.tsx # Homepage — stats + source table
│ │ ├── corpus/page.tsx # Browse corpus by source/category
│ │ ├── search/page.tsx # Full-text search UI
│ │ └── chat/page.tsx # Streaming RAG chat UI
│ └── lib/api.ts # Typed API client + SSE stream
├── scripts/
│ └── init.sql # PostgreSQL schema (documents + chunks)
├── .env.example
└── pyproject.toml
git clone https://github.com/togolm/togolm.git
cd togolm
cp .env.example .env
# Edit .env — set POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DBpsql -U $POSTGRES_USER -d $POSTGRES_DB -f scripts/init.sqluv sync# Run from the corpus/ directory (where scrapy.cfg lives)
cd corpus
uv run scrapy crawl service_public \
-o datasets/service_public.jsonl \
-s JOBDIR=.crawls/service_public \
--logfile .crawls/service_public.logAvailable spiders: journal_officiel, presidence, assemblee_nationale,
inseed, gouv_ministry, service_public, icilome, togofirst,
togoactualite, lomeinfos, republicoftogo, savoirnews, letogolais,
otr, wikipedia, cnss, legitogo, inam, international
# Uses the local sentence-transformers model (no API key needed)
uv run python -m corpus.processors.ingestor corpus/datasets/service_public.jsonl
# Multiple files at once
uv run python -m corpus.processors.ingestor corpus/datasets/*.jsonluv run uvicorn api.app.main:app --reload --port 8000# Query the corpus
curl -X POST http://localhost:8000/v1/query \
-H "Content-Type: application/json" \
-d '{"question": "Comment créer une entreprise au Togo ?"}'
# Stream the answer (SSE)
curl -N -X POST http://localhost:8000/v1/query/stream \
-H "Content-Type: application/json" \
-d '{"question": "Quel est le budget de l'\''État togolais ?"}'Full API reference → docs/api-reference.md
cd showcase
npm install
npm run dev # http://localhost:3000| Method | Path | Description |
|---|---|---|
GET |
/v1/stats |
Corpus statistics (doc/chunk counts per source) |
GET |
/v1/categories |
Available corpus categories |
GET |
/v1/documents |
Paginated document list with source/category filters |
GET |
/v1/documents/{id} |
Document detail + chunks |
GET |
/v1/search?q= |
French full-text search (ts_rank + ILIKE fallback) |
POST |
/v1/query |
RAG query → full JSON response |
POST |
/v1/query/stream |
RAG query → SSE stream (chunk by chunk) |
POST |
/v1/embed |
Generate a 384-dim embedding vector |
| Backend | Model | Requires |
|---|---|---|
| Local (default) | paraphrase-multilingual-MiniLM-L12-v2 |
Nothing (auto-downloaded ~120 MB) |
| Gemini | gemini-embedding-001 (384-dim) |
GEMINI_API_KEY in .env |
The embedder auto-selects based on the presence of a valid GEMINI_API_KEY.
The fine-tuning pipeline targets Mistral 7B Instruct v0.3 with QLoRA.
# 1. Generate Q&A pairs from the corpus (requires GEMINI_API_KEY)
uv run python -m finetuning.dataset.generator \
--out finetuning/datasets/qa_raw.jsonl \
--limit 500
# 2. Format to Alpaca / ShareGPT
uv run python -m finetuning.dataset.formatter \
--input finetuning/datasets/qa_raw.jsonl \
--output finetuning/datasets/train.jsonl \
--format alpaca
# 3. Fine-tune (Google Colab recommended)
# Open: finetuning/notebooks/train_colab.ipynb| Source | Category | Docs | Status |
|---|---|---|---|
| icilome.com | Press | 22 387 | ✅ Ingested |
| togoactualite.com | Press | 13 196 | ✅ Ingested |
| togofirst.com | Press / Economy | 8 863 | ✅ Ingested |
| lomeinfos.com | Press | 4 361 | ✅ Ingested |
| jo.gouv.tg | Legal | 4 066 | ✅ Ingested |
| fr.wikipedia.org | Encyclopedic | 1 839 | ✅ Ingested |
| letogolais.com | Press | 1 377 | ✅ Ingested |
| finances.gouv.tg | Economy | 1 027 | ✅ Ingested |
| savoirnews.net | Press | 1 019 | ✅ Ingested |
| commerce.gouv.tg | Economy | 580 | ✅ Ingested |
| environnement.gouv.tg | Agriculture | 509 | ✅ Ingested |
| agriculture.gouv.tg | Agriculture | 438 | ✅ Ingested |
| tourisme.gouv.tg | Economy | 371 | ✅ Ingested |
| otr.tg | Legal / Fiscal | 323 | ✅ Ingested |
| presidenceduconseil.gouv.tg | Politics | 303 | ✅ Ingested |
| education.gouv.tg | Education | 295 | ✅ Ingested |
| urbanisme.gouv.tg | Economy | 228 | ✅ Ingested |
| justice.gouv.tg | Legal | 214 | ✅ Ingested |
| cnss.tg | Administrative | 192 | ✅ Ingested |
| republicoftogo.com | Press | 152 | ✅ Ingested |
| service-public.gouv.tg | Administrative | 141 | ✅ Ingested |
| securite.gouv.tg | Politics | 98 | ✅ Ingested |
| inseed.tg | Economy / Statistics | 96 | ✅ Ingested |
| energie.gouv.tg | Economy | 54 | ✅ Ingested |
| presidence.gouv.tg | Politics | 17 | ✅ Ingested |
| legitogo.gouv.tg | Legal | 9 | ✅ Ingested |
| assemblee-nationale.tg | Legal | 8 | ✅ Ingested |
| inam.tg | Health | 3 | ✅ Ingested |
| europa.eu | International | 1 | ✅ Ingested |
| unicef.org | International | 1 | ✅ Ingested |
| sante.gouv.tg | Health | — | ⏭ React SPA — no sitemap |
| travail.gouv.tg | Legal | — | ⏭ No site deployed |
| infrastructure.gouv.tg | Economy | — | ⏭ SSL error |
| plan.gouv.tg | Economy | — | ⏭ Unreachable |
We welcome contributions — new corpus sources, scrapers, API improvements, translations.
Issue labels: corpus · api · finetuning · showcase · bug · enhancement
| Model | Type | Link |
|---|---|---|
togolm-7b-instruct-v1 |
Fine-tuned LLM (Mistral 7B QLoRA) | 🤗 togolm/togolm-7b-instruct-v1 |
togolm-corpus-v1 |
Training dataset (Q&A pairs) | 🤗 togolm/togolm-corpus-v1 |
from transformers import pipeline
pipe = pipeline("text-generation", model="togolm/togolm-7b-instruct-v1")
result = pipe(
"Comment créer une entreprise au Togo ?",
max_new_tokens=300,
temperature=0.7,
)
print(result[0]["generated_text"])| Component | License |
|---|---|
| Code | MIT |
| Corpus | CC BY 4.0 |
| Fine-tuned model | Apache 2.0 |
Project lead: Omar Farouk KOUGBADA · GDE Flutter · Director, KOF CORPORATION · Lomé, Togo