Automatic Extraction
Knowledge can scan your existing sources and extract implicit rules, decisions, and constraints automatically. Nothing is published without human review.
Why Automatic Extraction?
Most teams already have rules - they're just not structured. They live in READMEs, architecture decisions, runbooks, code comments, Slack threads, and wiki pages.
Manually entering each rule into a registry takes time most teams don't have. Automatic extraction solves the cold-start problem: point it at your sources and get a populated registry in minutes.
And as your documentation evolves, you can re-extract regularly to surface new implicit rules - your registry stays current without manual maintenance.
How It Works
1. Point at your sources
> "Extract rules from ./docs and ./CLAUDE.md for the Engineering scope"
Target specific directories and file types:
> "Extract rules from ./docs (*.md), ./src (README.md), and ./CLAUDE.md for Engineering"
2. Each chunk is analyzed
Files are split into contextual chunks (~1500 characters with 10% overlap). Each chunk is analyzed with a structured extraction prompt that identifies:
- Invariant candidates : absolute constraints ("All API endpoints must require authentication")
- Rule candidates : active directives, mandatory or advisory ("Use conventional commits")
- Decision candidates : historical choices with context ("We chose PostgreSQL for transactional data")
Each extraction includes:
- Confidence score (0.0 - 1.0, minimum 0.6 to be kept)
- Source excerpt that motivated the extraction
- Suggested tags and namespace
- An explanation of why this was identified
3. Deduplication filters noise
Every extraction is compared against existing entries in the target scope using semantic similarity:
| Similarity | Result |
|---|---|
| >= 0.92 | Exact duplicate - automatically filtered out and logged in the extraction report |
| >= 0.80 | Similar - draft created with REPLACES relation pointing to the existing entry |
| < 0.80 | New - draft created without relations |
4. Human review
Nothing is published without validation. Every extraction becomes a draft visible in the dashboard, showing:
- The proposed entry with its detected type (Invariant, Rule, or Decision) and confidence score
- The source excerpt that motivated the extraction
- Detected relations to existing entries - duplicates, replacements, tensions
- An explanation of why this was identified
Three actions:
- Approve : publishes the draft as a real Knowledge entry with source: auto_extracted attribution
- Reject : discards the draft
- Edit : modify the content, tags, or namespace before approving
Source Types
Git Repositories
Ask your AI agent to scan a repository. It reads local files, chunks them, and sends them to Knowledge for analysis. Implicit rules and constraints surface - even ones that are not documented anywhere.
> "Scan /path/to/repo for .ts, .py, .yaml, and .md files in the Engineering scope"
Specific Documents
> "Extract rules from runbook.md and architecture.md for Engineering"
Ingestion API
For sources that don't live on disk, push documents directly via the API:
curl -X POST https://api.asplenz.com/knowledge/v1/extract/stream \
-H "Authorization: Bearer kn_..." \
-H "Content-Type: application/json" \
-d '{
"scope_id": "scp-...",
"documents": [
{
"content": "All deployments must go through staging first.",
"metadata": {"author": "ops-team", "source": "runbook-v3"}
}
],
"auto_run": true
}'Additional Connectors
Slack, Teams, Notion, Confluence, and Excel connectors are available on Team and Scale plans.
See integrations →AI Configuration
Extraction requires AI access. Two options:
| Option | Description |
|---|---|
| Asplenz-managed | No configuration needed. AI usage billed at cost on your invoice. |
| Your own API key | Bring your own key. You control your provider contract and data residency. |
Organizations with strict data residency or Zero Data Retention requirements should use their own API key.
Permissions
| Action | Required permission | Minimum role |
|---|---|---|
| Launch extraction | extract_run | senior-dev |
| View runs and drafts | extract_read | developer |
| Approve / reject / edit drafts | extract_review | tech-lead |
| Push via Ingestion API | extract_stream | admin |
Configuration
| Parameter | Default | Description |
|---|---|---|
| Model | Configurable | AI model used for extraction |
| Temperature | 0.1 | Low for factual extraction |
| Min confidence | 0.6 | Below this, extractions are discarded |
| Max extractions per chunk | 5 | Limits noise |
| Max drafts per run | 200 | Caps total output |
| Chunk size | ~1500 chars | Paragraph-based splitting with 10% overlap |
| Dedup exact threshold | 0.92 | Similarity above this = duplicate |
| Dedup similar threshold | 0.80 | Similarity above this = REPLACES relation |
Best Practices
Start broad, then refine. Run extraction on your entire docs/ directory first. Review the results, then narrow your patterns to the most productive sources.
Re-extract regularly. Run extraction quarterly, after a major rewrite, or whenever new docs appear. Smart deduplication ensures your registry won't be polluted with duplicates.
Review in batches. Review all pending drafts for a run in a single session - reject the noise, approve the good ones.
Use tags consistently. The extraction suggests tags, but review them for consistency. A clean tagging system makes the registry more searchable.