Extraction comparison
Same prompt, same emails, two backends. Hosted: Claude Haiku via the Anthropic CLI subscription (the demo’s default). Local: qwen3:8b running on a single RTX 3080 with 10 GB VRAM via Ollama. Field-level scoring treats the Claude extract as the reference baseline, so the accuracy figure reads as “does the local model agree with the hosted one?”
Local Qwen 8B agrees with hosted Claude on 78.3% of fields.
That number is achieved by an 8-billion-parameter model running on a single consumer GPU at zero marginal cost. Where it disagrees, the gap is mostly specialty Norwegian valve vocabulary the model hasn’t seen enough during training (e.g. skyvespjeld, syrefast, ensilasje). A larger Qwen (14B / 32B) or a fine-tune on JS Cock terminology would close most of that gap. For now the demo runs hosted by default; self-hosted is a real fallback if cost or data residency forces it.
What we ran: 10 inquiries from your archive, total 9,869 output tokens generated locally in 174 seconds.
Where Qwen agreed and where it slipped
| JT | Status | Accuracy | Verdict (rule) | Latency | Sample diffs |
|---|---|---|---|---|---|
| JT118202 | OK | 77.8% 14/18 | partial / partial | 19.4s | +2 more |
| JT119284 | OK | 86.2% 25/29 | partial / partial | 18.8s | +2 more |
| JT119626 | schema-drift | 83.3% 15/18 | partial / complete | 7.5s | +1 more |
| JT121885 | OK | 83.3% 15/18 | partial / partial | 8.6s | +1 more |
| JT122083 | OK | 68.6% 35/51 | partial / complete | 22.2s | +14 more |
| JT122170 | OK | 88.9% 16/18 | partial / partial | 9.1s | |
| JT122579 | schema-drift | 88.9% 16/18 | partial / partial | 19.8s | |
| JT122623 | schema-drift | 88.9% 16/18 | unusable / partial | 19.5s | |
| JT122961 | OK | 65.5% 19/29 | unusable / unusable | 28.3s | +8 more |
| JT124009 | schema-drift | 72.2% 13/18 | complete / complete | 20.5s | +3 more |
What this means for J.S.Cock
- Hosted Claude is the default. Higher accuracy on Norwegian industrial vocabulary, no GPU to manage, runs on Sebastian’s subscription.
- Self-hosted Qwen is a real option. 8B at 4-bit fits on a mid-range GPU you may already own. 78.3% agreement on a sample without any fine-tuning. A larger model + domain-specific fine-tune would close the remaining gap.
- The deciding factor is data residency, not accuracy. If your customers want their inquiry contents to never leave your infrastructure, self-hosted is technically achievable. Otherwise hosted wins on accuracy + ops.
- This was tested once on Sebastian’s RTX 3080. Real production benchmarking happens on your hardware with your terminology — that’s a Phase 2 / retainer task, not a demo deliverable.