Skip to content

Indexing Pipeline

repobrain's 7-stage async pipeline processes a 1000-file repo in under 5 minutes (vs 25+ minutes for repowise).

Stages

Stage 1: Discovery
  Walk the repo filesystem, detect languages, write file manifest to SQLite immediately.
  Result: complete file list available for progress tracking.

Stage 2: Parse (ProcessPoolExecutor)
  Tree-sitter parsing is CPU-bound. Run in a process pool to bypass the GIL.
  Each worker returns a ParseResult with symbols and imports.

Stage 3: Graph Build  ─┐
  As ParseResults stream in,   │ These two stages run
  add nodes and edges to        │ concurrently via
  NetworkX graph.               │ asyncio.gather()

Stage 4: Git Analysis ─┘
  GitPython walks commit history up to max_commits.
  TemporalMetricsCalculator applies exponential decay.
  OwnershipAnalyzer computes temporal-weighted ownership.
  CoChangeAnalyzer finds file pairs changed together.

Stage 5: Embedding (ThreadPoolExecutor + semaphore)
  Batch embedding of file content via Anthropic embeddings API.
  Semaphore limits concurrency to avoid rate limits.

Stage 6: RAG-Aware Doc Generation
  For each file:
    1. Fetch existing docs for all dependency files from LanceDB
    2. Build prompt: file_content + dependency_docs + graph_centrality + hotspot_score
    3. Call claude-sonnet-4-6 to generate documentation
    4. Record token usage via TokenspyCostAdapter

Stage 7: Atomic Commit
  For each file, wrap all writes in coordinator.transaction():
    - SQL: upsert file record + git metrics
    - LanceDB: upsert embedding + doc
    - NetworkX: node already written in Stage 3
  On any exception: SQL rollback + delete LanceDB records + remove graph nodes

Why This Is Faster

Bottleneck repowise repobrain
Parsing Sequential, single-threaded ProcessPoolExecutor
Git + Graph Sequential Concurrent (asyncio.gather)
Embedding Sequential ThreadPoolExecutor + semaphore
Doc generation Sequential asyncio.Semaphore(N)

Incremental Updates

On repobrain index --incremental: 1. Compute SHA-256 hash of each file 2. Compare against stored content_hash in SQLite 3. Only re-process files whose hash changed 4. Re-run stages 2–7 for changed files only 5. GitMetricsRepository.upsert() always refreshes global percentile ranks via PERCENT_RANK() window function