Skip to main content

Knowledge ingest pipeline

What this page is

The bulk RAG ingest path engineers use to load runbooks, advisories, and prior post-incident reports into the agent memory plane. Covers the chunker, the deduplication contract, the embedding flow, and the retrieval-quality probe operators run before promoting a corpus.

Why it exists this way

Two existing memory stores cover narrow corpora: EpisodicMemoryStore writes one row per closed case; ThreatIntelMemory writes one row per IOC. Loading larger documents (runbooks, vendor advisories) previously needed bespoke scripts per source. aurorasoc.memory.ingest gives that work a single async entry point with measurable retrieval quality.

How it works

The module is at packages/backend/aurorasoc/memory/ingest.py.

KnowledgeIngester accepts a sequence of KnowledgeDocument values and:

  1. Chunks each document into overlapping windows that prefer sentence boundaries. Chunk size and overlap are constructor knobs so corpora with different shapes can tune them.
  2. Produces stable, content-addressed chunk identifiers ({source_id}#{index}#{hash[:12]}) so re-ingesting the same document is idempotent.
  3. Embeds chunks in batches via the shared TextEmbedder, letting one corpus pass amortise the model load.
  4. Upserts into the existing vector_embeddings table under a caller-named collection. Retrieval at query time uses the same SQL EpisodicMemoryStore already exercises.

The pipeline is fail-soft. An embedding batch that fails because the model is unreachable does not abort the corpus; the failure count surfaces in the returned IngestReport so the CLI exits non-zero on the next CI run and operators see the regression.

measure_recall_at_k is the quality probe. Given a labelled probe set (queries plus the relevant document IDs), it computes recall@k and mean-reciprocal-rank using the same embed-then-cosine-search code path as production retrieval. The probe's relevant_source_ids matches against the chunk identifier's document prefix so a hit on any chunk of a relevant document counts; this keeps probe authoring simple (one ID per document, not one per chunk).

A CLI wrapper at tools/scripts/rag/ingest_corpus.py walks a directory of .md/.txt files and ingests them under a named collection. It exits non-zero on any embedding failure so batch jobs can rely on it as a CI gate.

What goes wrong

  • TextEmbedder.is_degraded returns True, the fallback SHA-256 embedder ran instead of sentence-transformers. Retrieval quality drops sharply; the warning is logged and the corpus is usable but should be reingested once the embedding model is reachable.
  • Recall@5 below threshold on a probe run, investigate the chunker (overly aggressive splitting destroys context), then the embedder (model swap regressed), then the probe set itself (queries too synthetic to represent real analyst intent).
  • A re-ingest writes nothing because every chunk hashes to the same identifier the upsert finds, this is the idempotency contract working as designed; the report's chunks_skipped_dedup reflects it.