All articles
Speech recognition16 min read

Speech recognition compared: Scriptix, Speechmatics and HappyScribe

An in-depth procurement-grade comparison of three European-relevant speech-to-text vendors: accuracy methodology, language coverage, custom model mechanics, deployment topology, sovereignty posture, accessibility fit and total cost of ownership.

FO
Frans Olsthoorn
Founder & CEO, Scriptix
LinkedIn

Speech-to-text procurement used to begin and end with one number: word error rate on a public benchmark. That single metric is still useful, but it is no longer enough. Today the decisive questions for European buyers are about fit: which deployment topologies are supported, where the audio is processed, how the engine adapts to domain-specific speech, how the output integrates with publishing workflows under WCAG 2.2 and the European Accessibility Act, and what the total cost of ownership looks like once you account for editors, integrations and compliance work.

This piece compares three vendors that procurement teams in Europe regularly shortlist: Scriptix, Speechmatics and HappyScribe. Each has a clear identity, a defensible audience and a real product. None of them is the right choice for every buyer. The goal here is to make the trade-offs explicit, with references to the underlying standards and vendor documentation, so a procurement team can match a vendor to a workflow without unpleasant surprises six months into a contract.

The three vendors at a glance

Before drilling into methodology, it helps to anchor the comparison in a single table. The rows below summarise capabilities that procurement teams ask about most often. Detail and citations follow in the body.

CapabilityScriptixSpeechmaticsHappyScribe
Primary audiencePublic sector, broadcasters, courts, accessibility teamsEnterprise platforms embedding ASR, captioning vendorsCreators, researchers, podcasters, SMB
Languages (ASR)48 with strong EU and minority-language focus50+ broad coverage with autonomous language identification120+ via a mix of in-house and third-party engines
Realtime / live captioningYes, broadcast-grade streaming API + realtime editorYes, streaming API with partial revisionLimited; primarily file-based
Custom acoustic & language modelsFull custom training on paired audio + transcripts, per tenantCustom dictionary and limited acoustic adaptation (enterprise)Vocabulary lists only
Subtitle and transcript editorsTranscript, subtitle (frame-accurate) and realtime editors includedNone, API only — BYO editorPolished web editor for creators
EU data residency by defaultYes (Netherlands)Configurable, multi-regionEU region available but engine-dependent
On-premise deploymentYes, licence-key activated inside customer infraYes, container (enterprise)No, SaaS only
Fully air-gapped tenantYes, on request — no callbacks, telemetry or licence phone-homeNot offered as a productised tierNo
SSO (SAML / OIDC) & RBACYes, standard on all plansEnterprise tierLimited
WCAG / EAA workflow toolingHybrid human-in-the-loop editors aligned to EBU and NER metricsEngine output, BYO editor and QAYes, creator-focused
Pricing modelVolume bundles + fixed-fee on-premise licenceVolume-tiered, enterprise-negotiatedPer-minute, transparent

Every row in this table is unpacked below with the reasoning, the standard and the source. Sources are listed at the end of the article.

1. Accuracy: what the numbers actually mean

On clean studio audio in well-resourced languages (English, French, German, Spanish), modern ASR engines from all three vendors land within a few percentage points of each other on word error rate. Recent academic surveys of end-to-end vs cascaded ASR converge on the same finding: the headline WER on LibriSpeech and similar clean benchmarks has plateaued. The interesting variance shows up the moment the audio stops being clean.

Council meetings with cross-talk, courtrooms with mixed accents and lavalier dropouts, multilingual press briefings, regional dialects of Dutch or Catalan, and broadcaster archives recorded on equipment from three decades ago all have one thing in common: a generic acoustic model trained on internet-scale clean speech will be wrong more often, and the errors will cluster on exactly the words the audience cares about (names, places, numbers, technical terms).

This is why the right question during a vendor evaluation is not 'what is your WER on LibriSpeech?' but 'what does WER look like on a representative sample of my own audio, after the adaptation mechanism your platform offers has been applied?'. The three vendors answer that question very differently — see Section 3 — and the difference matters more than any benchmark.

2. Language coverage and the long tail

HappyScribe advertises 120+ languages, achieved by orchestrating multiple underlying engines. The breadth is real, but quality varies sharply by language: a top-tier language like English will run on a first-class engine, while a long-tail language may fall through to a generic backend with materially lower accuracy. For procurement, the practical question is which exact languages your workload actually needs, and which underlying engine handles each one.

Speechmatics covers 50+ languages on a single in-house architecture, with autonomous language identification useful for switchboards or surveillance use cases. Coverage is consistent within the supported set but does not stretch to the long tail of regional minority languages.

Scriptix supports 48 languages with a deliberate bias toward European public-sector and broadcaster requirements: high-quality Dutch, Flemish, German, French, Spanish, Italian, the Nordics, Polish, Czech, Greek and several regional variants that matter for accessibility compliance in multilingual jurisdictions such as Belgium, Switzerland and Catalonia. Coverage outside Europe exists but is not the design centre.

The right lens here is not the headline language count, it is whether the languages you need are first-class on the engine you would actually deploy.

3. Custom models: vocabulary lists vs trained adaptation

All three vendors offer something they call customisation, but the mechanisms are fundamentally different and the outcomes differ accordingly.

Vocabulary lists (HappyScribe)

A vocabulary list is a hint to the language model at decode time. It biases the decoder toward expected strings and helps with spelling of named entities once the engine has already roughly recognised the sound. It does not change the acoustic model. It does not learn from your data. If the engine cannot hear the way your domain pronounces 'voir dire', 'amicus curiae' or 'Eindhoven', adding those strings to a word list will not help — it only fixes spelling on the rare occasion the decoder guesses correctly.

Custom dictionary plus limited acoustic adaptation (Speechmatics)

Speechmatics offers a richer customisation surface in its enterprise tier, including a custom dictionary and bounded acoustic adaptation. This sits between a pure vocabulary list and full retraining: meaningful for domain terminology, less effective when the underlying acoustic conditions (room, microphones, accents) diverge significantly from the training distribution.

Full custom acoustic and language models (Scriptix)

Scriptix trains per-tenant custom acoustic and language models on paired audio and transcripts supplied by the customer. The engine learns the mapping between the way your speakers actually sound and the way those sounds should be transcribed in your context. Twenty to fifty hours of well-transcribed paired audio reliably shifts accuracy on a narrow domain; 100 to 300 hours typically closes about half of the remaining error gap; beyond 500 hours the curve flattens but rare named entities and edge cases keep improving. The trained model is an artefact owned by the customer, versioned and rolled back like any other production asset. Training happens inside the tenant; the paired data never leaves the deployment.

Pragmatically: if your audio looks like generic broadcast English, a vocabulary list is enough. If it does not — and most courts, councils, broadcasters and ministries fall into 'does not' — only trained adaptation will close the gap.

4. Workflow and the WCAG / EAA reality

Both WCAG 2.2 (specifically Success Criteria 1.2.2 and 1.2.4, both Level A and AA) and the European Accessibility Act (Directive (EU) 2019/882, enforceable since 28 June 2025) require that captions are accurate and synchronised, and that prerecorded transcripts are made available for audio content on in-scope services. Public-sector bodies in the EU operate under the parallel Web Accessibility Directive (Directive (EU) 2016/2102). Pure ASR output, without human review, rarely meets the accuracy bar these texts require for anything beyond informal video.

This is why workflow matters as much as engine quality. Speechmatics ships an excellent engine but no editor — every customer builds (or buys) the surrounding QA, role management, exports and audit trail. For a software vendor embedding ASR into their own product, that flexibility is a feature. For a broadcaster, court or ministry that needs to publish a WCAG-compliant transcript next week, it is months of engineering.

HappyScribe ships a polished web editor that is genuinely pleasant to use, but it stops at the SaaS boundary. There is no on-premise option, no fine-grained role-based access suitable for evidentiary chains of custody, and no audit trail of the kind procurement teams in regulated environments expect.

Scriptix occupies the middle ground deliberately. A production-grade transcript editor, a subtitle editor with frame-accurate timing aligned to EBU R 128 and EBU R 037 conventions, a realtime editor for live captioning, an API for integration, and the operational features (SAML/OIDC SSO, audit logs, role-based access, retention policies) that public-sector procurement actually requires.

5. Live subtitling and the NER metric

For live broadcast and live government feeds, word error rate is the wrong yardstick. The metric that European regulators and broadcasters have converged on is NER (Number, Edition, Recognition), formalised by Ofcom and adopted in some form by most major EBU members. NER weights errors by their impact on the viewer: a missed filler counts almost nothing, a wrong negation or a wrong number counts a lot. A captioning service with a clean WER but a poor NER score is failing its viewers.

All three vendors can produce live captions, but only Scriptix and Speechmatics treat live as a first-class product. HappyScribe is primarily file-based. Scriptix exposes a tunable commit policy (when to lock a segment and stop revising) and is regularly measured against NER on broadcaster-supplied samples; Speechmatics provides the engine and leaves the policy and measurement to the integrator.

6. Deployment topology and digital sovereignty

Under GDPR (Regulation (EU) 2016/679), residency is a baseline expectation rather than a differentiator: any vendor serious about European buyers offers EU processing. The meaningful distinctions sit one and two levels above residency:

  • Residency. Audio is processed and stored in a named EU region. HappyScribe (EU region available), Speechmatics (configurable) and Scriptix (Netherlands by default) all clear this bar.
  • On-premise. The engine runs inside the customer's own datacentre, Kubernetes cluster or VPC. Scriptix and Speechmatics offer this, both through licence-key activation. HappyScribe does not.
  • Air-gapped. The engine runs with no outbound network: no callbacks, no telemetry, no licence phone-home. Scriptix supports this on request and runs it in production for courts and supreme courts. Speechmatics does not offer this as a productised tier. HappyScribe cannot.

For ministries, courts, intelligence services and any organisation operating under NIS2 or sector-specific sovereignty requirements, the air-gapped tier is often a hard procurement gate.

7. Pricing and total cost of ownership

Headline price-per-minute is a poor indicator of TCO once volume grows. The shape of each vendor's pricing reveals the audience they are optimising for.

  • HappyScribe: per-minute, transparent, self-serve. Excellent for low to medium volume and for projects that fit cleanly into the SaaS editor.
  • Speechmatics: volume-tiered, negotiated at enterprise scale. Customers usually integrate the engine into their own surface and absorb the editor / workflow build cost themselves.
  • Scriptix: volume bundles for cloud usage and a fixed-fee on-premise licence that decouples cost from minutes once usage is high enough to justify it. For a national broadcaster processing thousands of hours a month, the fixed-fee model often pays back inside the first year and continues to compound as volume grows.

The right TCO comparison sums engine cost, editor and workflow cost (build or buy), compliance and audit cost, integration cost and any premium for sovereignty. The vendor with the lowest minute price rarely wins that calculation at scale.

8. Which vendor fits which buyer

  • Pick HappyScribe if you are a creator, researcher or small team that wants a polished editor, per-minute billing and minimal procurement overhead.
  • Pick Speechmatics if you are building a product around ASR, you need a strong engine and you have the engineering capacity to build the surrounding workflow, QA and compliance layer yourself.
  • Pick Scriptix if you operate in the European public sector, broadcast or accessibility space, you need EU residency or on-premise/air-gapped deployment, you have domain-specific audio that benefits from real custom-model training, and you want the editor suite and operational controls in the box rather than as a homework assignment.

Conclusion

There is no single right answer to 'which ASR vendor should we use?'. There is, however, a wrong one: choosing on base accuracy alone and discovering six months later that the deployment model, the workflow or the licence terms do not survive contact with your real operating environment. WCAG 2.2 and the European Accessibility Act have raised the floor, the NER metric is becoming the lingua franca for live captioning, and sovereignty has graduated from a contract clause to a gating criterion. Match the vendor to the constraint that matters most to your organisation, and pressure-test the rest on your own audio before you sign.

Sources & references

  1. WCAG 2.2, W3C Recommendation (5 October 2023)
  2. Directive (EU) 2019/882 — European Accessibility Act
  3. Directive (EU) 2016/2102 — Web Accessibility Directive
  4. EBU R 128 / R 037 — Subtitling standards
  5. Ofcom — Measuring live subtitling quality (NER model)
  6. Speechmatics — Documentation
  7. HappyScribe — Help center
  8. Scriptix — Developer documentation
  9. Regulation (EU) 2016/679 — GDPR
  10. Schmidt & Köhn — End-to-end vs cascaded ASR (Interspeech 2023)

About the author

FO
Frans Olsthoorn
Founder & CEO, Scriptix

Frans Olsthoorn founded Scriptix in 2010 and has spent more than fifteen years shipping speech recognition into European broadcasters, courts and government bodies. He writes about ASR, accessibility regulation and the realities of running AI workloads inside customer infrastructure.

Connect on LinkedIn

Keep reading