Indexing Pipeline¶
repobrain's 7-stage async pipeline processes a 1000-file repo in under 5 minutes (vs 25+ minutes for repowise).
Stages¶
Stage 1: Discovery
Walk the repo filesystem, detect languages, write file manifest to SQLite immediately.
Result: complete file list available for progress tracking.
Stage 2: Parse (ProcessPoolExecutor)
Tree-sitter parsing is CPU-bound. Run in a process pool to bypass the GIL.
Each worker returns a ParseResult with symbols and imports.
Stage 3: Graph Build ─┐
As ParseResults stream in, │ These two stages run
add nodes and edges to │ concurrently via
NetworkX graph. │ asyncio.gather()
Stage 4: Git Analysis ─┘
GitPython walks commit history up to max_commits.
TemporalMetricsCalculator applies exponential decay.
OwnershipAnalyzer computes temporal-weighted ownership.
CoChangeAnalyzer finds file pairs changed together.
Stage 5: Embedding (ThreadPoolExecutor + semaphore)
Batch embedding of file content via Anthropic embeddings API.
Semaphore limits concurrency to avoid rate limits.
Stage 6: RAG-Aware Doc Generation
For each file:
1. Fetch existing docs for all dependency files from LanceDB
2. Build prompt: file_content + dependency_docs + graph_centrality + hotspot_score
3. Call claude-sonnet-4-6 to generate documentation
4. Record token usage via TokenspyCostAdapter
Stage 7: Atomic Commit
For each file, wrap all writes in coordinator.transaction():
- SQL: upsert file record + git metrics
- LanceDB: upsert embedding + doc
- NetworkX: node already written in Stage 3
On any exception: SQL rollback + delete LanceDB records + remove graph nodes
Why This Is Faster¶
| Bottleneck | repowise | repobrain |
|---|---|---|
| Parsing | Sequential, single-threaded | ProcessPoolExecutor |
| Git + Graph | Sequential | Concurrent (asyncio.gather) |
| Embedding | Sequential | ThreadPoolExecutor + semaphore |
| Doc generation | Sequential | asyncio.Semaphore(N) |
Incremental Updates¶
On repobrain index --incremental:
1. Compute SHA-256 hash of each file
2. Compare against stored content_hash in SQLite
3. Only re-process files whose hash changed
4. Re-run stages 2–7 for changed files only
5. GitMetricsRepository.upsert() always refreshes global percentile ranks via PERCENT_RANK() window function