Automatic Extraction

Knowledge can scan your existing sources and extract implicit rules, decisions, and constraints automatically. Nothing is published without human review.


Why Automatic Extraction?

Most teams already have rules - they're just not structured. They live in READMEs, architecture decisions, runbooks, code comments, Slack threads, and wiki pages.

Manually entering each rule into a registry takes time most teams don't have. Automatic extraction solves the cold-start problem: point it at your sources and get a populated registry in minutes.

And as your documentation evolves, you can re-extract regularly to surface new implicit rules - your registry stays current without manual maintenance.


How It Works

How It Works

1. Point at your sources

> "Extract rules from ./docs and ./CLAUDE.md for the Engineering scope"

Target specific directories and file types:

> "Extract rules from ./docs (*.md), ./src (README.md), and ./CLAUDE.md for Engineering"

2. Each chunk is analyzed

Files are split into contextual chunks (~1500 characters with 10% overlap). Each chunk is analyzed with a structured extraction prompt that identifies:

  • Invariant candidates : absolute constraints ("All API endpoints must require authentication")
  • Rule candidates : active directives, mandatory or advisory ("Use conventional commits")
  • Decision candidates : historical choices with context ("We chose PostgreSQL for transactional data")

Each extraction includes:

  • Confidence score (0.0 - 1.0, minimum 0.6 to be kept)
  • Source excerpt that motivated the extraction
  • Suggested tags and namespace
  • An explanation of why this was identified

3. Deduplication filters noise

Every extraction is compared against existing entries in the target scope using semantic similarity:

SimilarityResult
>= 0.92Exact duplicate - automatically filtered out and logged in the extraction report
>= 0.80Similar - draft created with REPLACES relation pointing to the existing entry
< 0.80New - draft created without relations

4. Human review

Nothing is published without validation. Every extraction becomes a draft visible in the dashboard, showing:

  • The proposed entry with its detected type (Invariant, Rule, or Decision) and confidence score
  • The source excerpt that motivated the extraction
  • Detected relations to existing entries - duplicates, replacements, tensions
  • An explanation of why this was identified

Three actions:

  • Approve : publishes the draft as a real Knowledge entry with source: auto_extracted attribution
  • Reject : discards the draft
  • Edit : modify the content, tags, or namespace before approving

Source Types

Git Repositories

Ask your AI agent to scan a repository. It reads local files, chunks them, and sends them to Knowledge for analysis. Implicit rules and constraints surface - even ones that are not documented anywhere.

> "Scan /path/to/repo for .ts, .py, .yaml, and .md files in the Engineering scope"

Specific Documents

> "Extract rules from runbook.md and architecture.md for Engineering"

Ingestion API

For sources that don't live on disk, push documents directly via the API:

curl -X POST https://api.asplenz.com/knowledge/v1/extract/stream \
  -H "Authorization: Bearer kn_..." \
  -H "Content-Type: application/json" \
  -d '{
    "scope_id": "scp-...",
    "documents": [
      {
        "content": "All deployments must go through staging first.",
        "metadata": {"author": "ops-team", "source": "runbook-v3"}
      }
    ],
    "auto_run": true
  }'

Additional Connectors

Slack, Teams, Notion, Confluence, and Excel connectors are available on Team and Scale plans.

See integrations →

AI Configuration

Extraction requires AI access. Two options:

OptionDescription
Asplenz-managedNo configuration needed. AI usage billed at cost on your invoice.
Your own API keyBring your own key. You control your provider contract and data residency.

Organizations with strict data residency or Zero Data Retention requirements should use their own API key.


Permissions

ActionRequired permissionMinimum role
Launch extractionextract_runsenior-dev
View runs and draftsextract_readdeveloper
Approve / reject / edit draftsextract_reviewtech-lead
Push via Ingestion APIextract_streamadmin

Configuration

ParameterDefaultDescription
ModelConfigurableAI model used for extraction
Temperature0.1Low for factual extraction
Min confidence0.6Below this, extractions are discarded
Max extractions per chunk5Limits noise
Max drafts per run200Caps total output
Chunk size~1500 charsParagraph-based splitting with 10% overlap
Dedup exact threshold0.92Similarity above this = duplicate
Dedup similar threshold0.80Similarity above this = REPLACES relation

Best Practices

Start broad, then refine. Run extraction on your entire docs/ directory first. Review the results, then narrow your patterns to the most productive sources.

Re-extract regularly. Run extraction quarterly, after a major rewrite, or whenever new docs appear. Smart deduplication ensures your registry won't be polluted with duplicates.

Review in batches. Review all pending drafts for a run in a single session - reject the noise, approve the good ones.

Use tags consistently. The extraction suggests tags, but review them for consistency. A clean tagging system makes the registry more searchable.


Learn More