All articles
Custom models7 min read

Vocabulary lists vs custom models: what actually moves accuracy

Most ASR vendors call a glossary a custom model. It is not. Here is why Scriptix does not ship vocabulary lists at all, and only offers real custom models trained on your audio and transcripts.

FO
Frans Olsthoorn
Founder & CEO, Scriptix
LinkedIn

Almost every speech to text vendor offers something they call a custom model. In practice the term covers two very different things, and confusing them costs customers a lot of time, money and trust in ASR as a whole. Scriptix deliberately does not offer vocabulary lists. We only offer real custom models. This piece explains why.

A vocabulary is a hint, not a model

A vocabulary (sometimes called a custom dictionary, boost list or word list) is a list of words and phrases you tell the recognizer to expect. The engine biases its language model towards those terms at decoding time. It can nudge the spelling of proper nouns, product names, jargon and acronyms when the engine was already close. It does not change how the system hears the audio. It does not learn anything. Remove the list and the engine forgets entirely.

Crucially, a vocabulary only helps the engine write a word down once it has already decided what it heard. It never teaches the engine what the word sounds like. If the acoustic model has never encountered the way your domain pronounces "Eindhoven", "voir dire" or "amicus curiae", adding those strings to a word list does not help it recognise them in audio.

What a vocabulary will not fix: an accent the base model has never heard, a microphone setup that distorts certain phonemes, a phrasing pattern that is alien to the training data, or a domain where almost every other sentence contains technical content the base model has no exposure to. That is most of the work our customers actually need to get right.

A custom model is trained on your data

A real custom model is the result of training. Two ingredients matter:

  • Audio that represents your real recording conditions (microphones, room acoustics, accents, cross talk, background noise).
  • Matching, timestamped or paragraph aligned transcripts of that audio, ideally produced by humans you trust.

From those, a custom acoustic model learns the way your speakers actually sound, and a custom language model learns the way they actually phrase things. Because the audio is paired with the correct text, the engine learns the mapping between the two: how the words come out in your room, with your microphones, your speakers, your accents and your jargon. The output is a new model artefact that runs in place of the generic one, not a list bolted on at decode time. The improvement is permanent and compounding: every time you retrain on a larger corpus, the model gets closer to your reality.

How much data is enough

A practical rule of thumb that works for most of our customers:

  • 20 to 50 hours of well transcribed audio is enough to meaningfully shift accuracy on a narrow domain.
  • 100 to 300 hours unlocks a clearly different model, often closing half of the remaining error gap.
  • Beyond 500 hours the curve flattens, but rare named entities and edge cases keep improving.

Quality of the transcripts matters more than raw quantity. Fifty hours of clean, human verified transcript will outperform two hundred hours of lightly corrected ASR output every time.

When the investment pays back

Custom models pay back when three things are true: you have a recurring volume of similar audio, the generic model is making the same class of mistakes again and again, and you have access to (or can produce) reliable transcripts to train on. Courts, councils, parliamentary services and large broadcasters typically tick all three boxes. A one off podcast project typically does not.

Why Scriptix does not offer vocabularies

We took the decision not to ship a vocabulary feature at all. The reason is honest: in our experience it sets the wrong expectation. Customers add a word list, see a marginal change, and conclude that ASR cannot solve their problem. The problem was never that the engine could not spell the word. It was that the engine could not hear it. A vocabulary cannot fix that. A trained model can.

So Scriptix only offers real custom models. You upload paired audio and transcripts from your own domain and we train a custom acoustic and language model on top of the base engine. Training happens inside your tenant; your audio and your transcripts never leave the deployment. The trained model becomes an artefact you own, version and roll back like any other production asset.

If a vendor describes a boost list as a custom model, it is worth asking the next question. Are you training on my audio, or are you giving the decoder a hint? Both are legitimate tools. They are not the same product, and only one of them will move the numbers on audio that does not already look like the training set.

About the author

FO
Frans Olsthoorn
Founder & CEO, Scriptix

Frans Olsthoorn founded Scriptix in 2010 and has spent more than fifteen years shipping speech recognition into European broadcasters, courts and government bodies. He writes about ASR, accessibility regulation and the realities of running AI workloads inside customer infrastructure.

Connect on LinkedIn

Keep reading