Automatic Extraction

Knowledge can scan your existing sources and extract implicit rules, decisions, and constraints automatically. Nothing is published without human review.

Why Automatic Extraction?

Most teams already have rules - they're just not structured. They live in READMEs, architecture decisions, runbooks, code comments, Slack threads, and wiki pages.

Manually entering each rule into a registry takes time most teams don't have. Automatic extraction solves the cold-start problem: point it at your sources and get a populated registry in minutes.

And as your documentation evolves, you can re-extract regularly to surface new implicit rules - your registry stays current without manual maintenance.

How It Works

1. Point at your sources

> "Extract rules from ./docs and ./CLAUDE.md for the Engineering scope"

Target specific directories and file types:

> "Extract rules from ./docs (*.md), ./src (README.md), and ./CLAUDE.md for Engineering"

2. Each chunk is analyzed

Files are split into contextual chunks (~1500 characters with 10% overlap). Each chunk is analyzed with a structured extraction prompt that identifies:

Invariant candidates : absolute constraints ("All API endpoints must require authentication")
Rule candidates : active directives, mandatory or advisory ("Use conventional commits")
Decision candidates : historical choices with context ("We chose PostgreSQL for transactional data")

Each extraction includes:

Confidence score (0.0 - 1.0, minimum 0.6 to be kept)
Source excerpt that motivated the extraction
Suggested tags and namespace
An explanation of why this was identified

3. Deduplication filters noise

Every extraction is compared against existing entries in the target scope using semantic similarity:

Similarity	Result
>= 0.92	Exact duplicate - automatically filtered out and logged in the extraction report
>= 0.80	Similar - draft created with REPLACES relation pointing to the existing entry
< 0.80	New - draft created without relations

4. Human review

Nothing is published without validation. Every extraction becomes a draft visible in the dashboard, showing:

The proposed entry with its detected type (Invariant, Rule, or Decision) and confidence score
The source excerpt that motivated the extraction
Detected relations to existing entries - duplicates, replacements, tensions
An explanation of why this was identified

Three actions:

Approve : publishes the draft as a real Knowledge entry with source: auto_extracted attribution
Reject : discards the draft
Edit : modify the content, tags, or namespace before approving

Source Types

Git Repositories

Ask your AI agent to scan a repository. It reads local files, chunks them, and sends them to Knowledge for analysis. Implicit rules and constraints surface - even ones that are not documented anywhere.

> "Scan /path/to/repo for .ts, .py, .yaml, and .md files in the Engineering scope"

Specific Documents

> "Extract rules from runbook.md and architecture.md for Engineering"

Ingestion API

For sources that don't live on disk, push documents directly via the API:

curl -X POST https://api.asplenz.com/knowledge/v1/extract/stream \
  -H "Authorization: Bearer kn_..." \
  -H "Content-Type: application/json" \
  -d '{
    "scope_id": "scp-...",
    "documents": [
      {
        "content": "All deployments must go through staging first.",
        "metadata": {"author": "ops-team", "source": "runbook-v3"}
      }
    ],
    "auto_run": true
  }'

Additional Connectors

Slack, Teams, Notion, Confluence, and Excel connectors are available on Team and Scale plans.

See integrations →

AI Configuration

Extraction requires AI access. Two options:

Option	Description
Asplenz-managed	No configuration needed. AI usage billed at cost on your invoice.
Your own API key	Bring your own key. You control your provider contract and data residency.

Organizations with strict data residency or Zero Data Retention requirements should use their own API key.

Permissions

Action	Required permission	Minimum role
Launch extraction	extract_run	senior-dev
View runs and drafts	extract_read	developer
Approve / reject / edit drafts	extract_review	tech-lead
Push via Ingestion API	extract_stream	admin

Configuration

Parameter	Default	Description
Model	Configurable	AI model used for extraction
Temperature	0.1	Low for factual extraction
Min confidence	0.6	Below this, extractions are discarded
Max extractions per chunk	5	Limits noise
Max drafts per run	200	Caps total output
Chunk size	~1500 chars	Paragraph-based splitting with 10% overlap
Dedup exact threshold	0.92	Similarity above this = duplicate
Dedup similar threshold	0.80	Similarity above this = REPLACES relation

Best Practices

Start broad, then refine. Run extraction on your entire docs/ directory first. Review the results, then narrow your patterns to the most productive sources.

Re-extract regularly. Run extraction quarterly, after a major rewrite, or whenever new docs appear. Smart deduplication ensures your registry won't be polluted with duplicates.

Review in batches. Review all pending drafts for a run in a single session - reject the noise, approve the good ones.

Use tags consistently. The extraction suggests tags, but review them for consistency. A clean tagging system makes the registry more searchable.