System Architecture2026

Contract Clause Extractor

An offline experiment that extracts payment terms and limitation of liability clauses from contract PDFs/TXTs using sentence embeddings and supervised binary classifiers.

System snapshot

Sentence-embedding + supervised scoring for two clause types (offline, one row per document)

run.py: command-line workflow with explicit/interactive run modes and summary output

pipeline.reader: discovers .pdf/.txt inputs and produces readable/success, skipped, and failed cases

pipeline.sentencizer: spaCy-based sentence segmentation and filtering into usable sentences

pipeline.encoder: local sentence-transformer batch embedding of document sentences

pipeline.extractor: per-field best-sentence selection via classifier probabilities + confidence threshold

pipeline.output/evaluation/tests: CSV schema output, labelled evaluation harness, and automated verification

Design focus

Offline-first execution with no runtime API keys
Predictable single-pass processing with explicit skipped/failed states
Fixed, machine-readable CSV output schema
Reproducible evaluation and documented failure modes (threshold vs coverage)
Modular separation of read → sentencize → embed → extract → output

Context

The Architectural Challenge

The experiment targets a practical clause extraction need: identify payment terms and limitation of liability clauses from contract documents in a way that works for less technical users and can be run entirely offline. The design explicitly avoids runtime API dependencies, and it also addresses real-world document ingestion issues such as unreadable inputs or image-only/scanned files. Conceptually, the modelling problem is framed as sentence-level clause extraction under strong variation in legal phrasing, requiring an approach that can generalize across wording differences while producing controlled, downstream-friendly outputs.

Project parameters

Domain: AI
Type: Experiment
Complexity Level: Intermediate

Technology stack

PythonspaCySentenceTransformersscikit-learnpandaspypdfpytestjoblib

Core Innovation

Rather than proposing an extraction UI or a single heuristic, the repository emphasizes a full extraction pipeline that is offline-capable and operationally disciplined: it compares multiple strategies (rule-based, GenAI/RAG, and supervised embedding-based classification), selects a supervised approach for production, and then hardens the pipeline with threshold behaviour, evaluation sweeps, and tests. The system’s core distinction is treating clause extraction as sentence-scoring with local embeddings and two trained classifiers, then returning at most one sentence per clause field only when confidence clears a fixed threshold.

Implementation

Implementation Details

The end-to-end workflow starts in run.py. It discovers input files from a flat folder and processes them one by one. If reading fails, the pipeline emits a failed record immediately. If reading succeeds, it segments the document into sentences; documents with empty/unusable sentence sets are emitted as skipped records.

For usable documents, the pipeline encodes all sentences in a batch using a local sentence-transformer encoder. It then scores the sentences with two trained classifiers: one for payment terms and one for limitation of liability. For each clause field, the extractor selects the single highest-probability sentence. If that sentence’s probability is below a configured confidence threshold, the corresponding field is left empty. The output record includes the filename, extracted text fields, boolean “found” flags, a document status, and notes.

In addition to extraction, evaluate.py runs a fixed labelled sentence set through the same encoder and classifiers to observe threshold behaviour. A tests/ suite verifies the CSV schema, extractor behaviour, sentencizer behaviour, encoder loading contracts, file reading behaviour, and end-to-end execution paths.

Latency profile

Local batch sentence scoring (no network latency during extraction runs)

Operationally, the pipeline runs entirely locally: it reads document text, loads local models, encodes sentences in a batch, and scores them with the two classifiers. Because there is no live service call, the run time is dominated by local model loading and sentence encoding rather than network latency.

System focus

Predictable offline extraction with controlled output states

The primary operational goal is not high-throughput serving but reliable, predictable document handling. The pipeline is designed to return one CSV row per discovered file while distinguishing readable success cases from skipped image-only/empty-text cases and failed unreadable files. It also aims for reproducibility via a locked output contract (schema), a fixed evaluation harness, and documented threshold/coverage failure modes.

Outcomes

Outcomes & Future Iterations

The repository’s documented evidence indicates that the supervised embedding approach was selected as the production method for both clause types after comparing against rule-based and GenAI/RAG alternatives. In the fixed evaluation harness, the production threshold reportedly recovers all payment-term positives within the labelled sample set. For limitation of liability, the evaluation reportedly misses one example whose phrasing is tied to loss of profits/loss of business/loss of data, and the README attributes this miss to a model/data coverage gap rather than a threshold-calibration problem.

The experiment also documents practical limitations: only one sentence is returned per clause field, scanned image-only PDFs are skipped (rather than OCR’d), liability appears to be weaker partly because the dataset is smaller, and some clause phrasing patterns are underrepresented in training. Overall, the value is a reproducible, offline, contract-like output pipeline for clause extraction under constrained deployment conditions.

Why this matters

This experiment matters because it shows a complete path from model comparison to a controlled offline application: it makes clause extraction sentence-scoring with local embeddings operationally usable by less technical users, enforces an explicit output contract via CSV schema, and handles real ingestion failures through skipped/failed document states. It also demonstrates how to reason about extraction errors by distinguishing threshold issues from model/data coverage limitations, supported by a fixed evaluation harness and automated tests—helpful characteristics for deploying clause extraction where offline execution and structured outputs are more important than a demo interface.

Key future improvements highlighted include expanding annotation coverage for under-detected liability phrasings, reconstructing multi-sentence clauses instead of returning only one sentence, adding OCR support for scanned PDFs, extending input formats beyond .pdf and .txt, and exposing threshold configuration if future evaluation justifies it.