Shepherd Insurance

Shepherd Insurance achieves 270x productivity boost in claims data extraction with OCR automation

270x faster than manual entryProductivity Boost

Under 40 seconds (vs. 3+ hours manually)Processing Time (350-page loss run)

100%Extraction Accuracy (known carriers)

The Challenge

In Property & Casualty insurance, underwriting decisions depend heavily on loss runs — structured claims history documents submitted by brokers in PDF format. The challenge is that every carrier formats these documents differently, making automated parsing difficult. For Shepherd Insurance, a startup scaling its underwriting operations, this meant that processing a single 350-page loss run required an underwriter to spend 3+ hours on manual data entry. With no standardized format across carriers, each document demanded custom interpretation. This bottleneck constrained underwriting throughput and introduced the risk of human error in a domain where data accuracy directly affects pricing and risk decisions.

The Solution

Shepherd's engineering team built an automated document processing pipeline centered on SenseML from Sensible, a JSON-based domain-specific language that sits on top of AWS and Azure OCR services. Rather than attempting a general solution upfront, the team took a deliberate approach: they identified the top 8 most common carrier formats in their submission volume and wrote dedicated extraction configurations for each. This gave the platform deterministic, auditable parsing logic for the majority of incoming documents. Underwriters could upload PDFs directly into the platform, with claims data automatically extracted and persisted — no manual entry required. The team evaluated LLM-based approaches, including GPT-4 with large context windows, RAG pipelines, and vector database solutions, before concluding that NLP-driven OCR with a structured DSL was the right fit for this use case.

Results

The automated pipeline delivers a 270x productivity improvement over manual extraction. A 350-page loss run that previously took an underwriter over 3 hours to process manually is now completed in under 40 seconds. For known carrier formats — the top 8 configurations — the system achieves 100% extraction accuracy, eliminating a class of data entry errors that could affect underwriting decisions.

Processing time: <40 seconds vs. 3+ hours manually
Productivity gain: 270x faster
Accuracy: 100% for supported carrier formats

Beyond the raw numbers, underwriters shifted from data entry work to higher-value analysis, and the pipeline removed a key scaling constraint as submission volume grew.

Key Takeaways

OCR with deterministic extraction outperforms LLMs for structured table parsing where 100% accuracy and auditability are non-negotiable — LLMs introduced latency and non-determinism that were unacceptable in this context.
Covering the top N formats first is a pragmatic strategy: building configurations for the 8 most common carriers captured the majority of volume before investing in a general-purpose solution.
Experimentation is load-bearing work, not a detour — testing RAG, vector DBs, GPT-4, and paid AI solutions was what surfaced the right architecture.
Fit-for-purpose tooling matters: a DSL-based OCR layer (SenseML) provided the control and auditability that off-the-shelf LLM wrappers could not.