Knowledge ingest pipeline
What this page is
The bulk RAG ingest path engineers use to load runbooks, advisories, and prior post-incident reports into the agent memory plane. Covers the chunker, the deduplication contract, the embedding flow, and the retrieval-quality probe operators run before promoting a corpus.
Why it exists this way
Two existing memory stores cover narrow corpora:
EpisodicMemoryStore writes one row per closed case;
ThreatIntelMemory writes one row per IOC. Loading larger
documents (runbooks, vendor advisories) previously needed
bespoke scripts per source. aurorasoc.memory.ingest gives
that work a single async entry point with measurable retrieval
quality.
How it works
The module is at packages/backend/aurorasoc/memory/ingest.py.
KnowledgeIngester accepts a sequence of
KnowledgeDocument values and:
- Chunks each document into overlapping windows that prefer sentence boundaries. Chunk size and overlap are constructor knobs so corpora with different shapes can tune them.
- Produces stable, content-addressed chunk identifiers
(
{source_id}#{index}#{hash[:12]}) so re-ingesting the same document is idempotent. - Embeds chunks in batches via the shared
TextEmbedder, letting one corpus pass amortise the model load. - Upserts into the existing
vector_embeddingstable under a caller-namedcollection. Retrieval at query time uses the same SQLEpisodicMemoryStorealready exercises.
The pipeline is fail-soft. An embedding batch that fails
because the model is unreachable does not abort the corpus;
the failure count surfaces in the returned IngestReport so
the CLI exits non-zero on the next CI run and operators see
the regression.
measure_recall_at_k is the quality probe. Given a labelled
probe set (queries plus the relevant document IDs), it
computes recall@k and mean-reciprocal-rank using the same
embed-then-cosine-search code path as production retrieval.
The probe's relevant_source_ids matches against the chunk
identifier's document prefix so a hit on any chunk of a
relevant document counts; this keeps probe authoring simple
(one ID per document, not one per chunk).
A CLI wrapper at
tools/scripts/rag/ingest_corpus.py
walks a directory of .md/.txt files and ingests them
under a named collection. It exits non-zero on any embedding
failure so batch jobs can rely on it as a CI gate.
What goes wrong
TextEmbedder.is_degradedreturns True, the fallback SHA-256 embedder ran instead of sentence-transformers. Retrieval quality drops sharply; the warning is logged and the corpus is usable but should be reingested once the embedding model is reachable.- Recall@5 below threshold on a probe run, investigate the chunker (overly aggressive splitting destroys context), then the embedder (model swap regressed), then the probe set itself (queries too synthetic to represent real analyst intent).
- A re-ingest writes nothing because every chunk hashes to
the same identifier the upsert finds, this is the
idempotency contract working as designed; the report's
chunks_skipped_dedupreflects it.