Hosted vs local

Extraction comparison

Same prompt, same emails, two backends. Hosted: Claude Haiku via the Anthropic CLI subscription (the demo’s default). Local: qwen3:8b running on a single RTX 3080 with 10 GB VRAM via Ollama. Field-level scoring treats the Claude extract as the reference baseline, so the accuracy figure reads as “does the local model agree with the hosted one?”

78.3%

Field-level accuracy

184/235 fields agreed

7/10

Completeness verdicts agreed

The decision that gates the rest of the workflow

17.4s

Avg latency / inquiry

57 tok/s output, on local GPU

€0

Per-inquiry cost

Local GPU. No API tokens spent on this run.

Headline finding

Local Qwen 8B agrees with hosted Claude on 78.3% of fields.

That number is achieved by an 8-billion-parameter model running on a single consumer GPU at zero marginal cost. Where it disagrees, the gap is mostly specialty Norwegian valve vocabulary the model hasn’t seen enough during training (e.g. skyvespjeld, syrefast, ensilasje). A larger Qwen (14B / 32B) or a fine-tune on JS Cock terminology would close most of that gap. For now the demo runs hosted by default; self-hosted is a real fallback if cost or data residency forces it.

What we ran: 10 inquiries from your archive, total 9,869 output tokens generated locally in 174 seconds.

Per-inquiry breakdown

Where Qwen agreed and where it slipped

JT	Status	Accuracy	Verdict (rule)	Latency	Sample diffs
JT118202	OK	77.8% 14/18	partial / partial	19.4s	`intent` `customer.contact_email` +2 more
JT119284	OK	86.2% 25/29	partial / partial	18.8s	`customer.company` `customer.phone` +2 more
JT119626	schema-drift	83.3% 15/18	partial / complete	7.5s	`intent` `customer.phone` +1 more
JT121885	OK	83.3% 15/18	partial / partial	8.6s	`customer.phone` `item[1].product_reference` +1 more
JT122083	OK	68.6% 35/51	partial / complete	22.2s	`item[1].quantity` `item[1].product_reference` +14 more
JT122170	OK	88.9% 16/18	partial / partial	9.1s	`completeness` `item[1].product_reference`
JT122579	schema-drift	88.9% 16/18	partial / partial	19.8s	`customer.phone` `item[1].product_reference`
JT122623	schema-drift	88.9% 16/18	unusable / partial	19.5s	`customer.company` `item[1].product_reference`
JT122961	OK	65.5% 19/29	unusable / unusable	28.3s	`intent` `completeness` +8 more
JT124009	schema-drift	72.2% 13/18	complete / complete	20.5s	`completeness` `item[1].product_reference` +3 more

Talking-point for the demo call

What this means for J.S.Cock

Hosted Claude is the default. Higher accuracy on Norwegian industrial vocabulary, no GPU to manage, runs on Sebastian’s subscription.
Self-hosted Qwen is a real option. 8B at 4-bit fits on a mid-range GPU you may already own. 78.3% agreement on a sample without any fine-tuning. A larger model + domain-specific fine-tune would close the remaining gap.
The deciding factor is data residency, not accuracy. If your customers want their inquiry contents to never leave your infrastructure, self-hosted is technically achievable. Otherwise hosted wins on accuracy + ops.
This was tested once on Sebastian’s RTX 3080. Real production benchmarking happens on your hardware with your terminology — that’s a Phase 2 / retainer task, not a demo deliverable.