Configuration options
Part of the beauty of the Whisper API is the flexibility it offers. You can control every aspect of the transcription process by passing in a wide range of configuration options. In this guide, we will explore some of the configuration options available to you.
Setting the Language
We automatically detect the language of the audio file, but sometimes our detection might not be accurate. You can set the language of the audio file by passing the language
query parameter in the request
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "language=en" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
The language parameter is a 2-letter language code. You can find the list of supported languages here.
Setting the Output Format
We currently support 2 output formats: text
and srt
. The default output format is text
. You can set it by passing the format
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "language=en" \ -F "format=srt" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
When the transcription is complete, you will receive a response like this:
{ "task_id": "b9e2e1dc-7fda-4d4b-b414-4289dfc409e1", "status": "completed", "result": "1\n00:00:00,000 --> 00:00:06,520\nHello, world!\n\n2\n00:00:06,520 --> 00:00:10,520\nThis is a test.\n\n3\n00:00:10,520 --> 00:00:14", "language": "en", "format": "srt"}
Setting the Model Size
The model size determines the quality of the transcription and also how long it will take. This largely depends on your file size and the quality of the audio. A smaller model size will be faster but might not be as accurate as a larger model size. Similarly, a larger model size will be more accurate but will take longer to process.
We currently support the following model sizes from smallest to largest:
Model Size | Description | Tier |
---|---|---|
tiny.en | Smallest English-only model | Free |
tiny | Smallest multilingual model | Free |
base.en | Base English-only model | Free |
base | Base multilingual model | Free |
small.en | Small English-only model | Free |
small | Small multilingual model | Free |
medium.en | Medium English-only model | Free |
medium | Medium multilingual model | Free |
large-v1 | Original large model | Paid |
large-v2 | Improved large model | Paid |
large-v3 | Latest large model | Paid |
large | Alias for latest large model | Paid |
distil-large-v2 | Distilled version of large-v2 | Paid |
distil-medium.en | Distilled medium English-only model | Paid |
distil-small.en | Distilled small English-only model | Paid |
distil-large-v3 | Distilled version of large-v3 | Paid |
large-v3-turbo | Optimized large-v3 for speed | Paid |
turbo | Fastest model, balanced accuracy | Paid |
The default model size is base
. You can set the model size by passing the model
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "language=en" \ -F "format=srt" \ -F "model_size=large" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Advanced Search Parameters
Beam Search Parameters
Beam search is a heuristic search algorithm that explores a graph by expanding the most promising nodes. In the context of speech recognition, it maintains multiple hypotheses (beams) during decoding to find the most likely transcription.
Beam Size
Beam size controls how many alternative hypotheses are maintained during decoding. A larger beam size increases the chance of finding the optimal transcription but requires more computation time and memory.
The default beam size is 5. You can set the beam size by passing the beam_size
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "beam_size=10" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Best of
This parameter determines how many of the highest-scoring beam search results to consider for the final output. It works in conjunction with beam_size
to balance between exploration and final selection quality.
The default value is 5. You can set the best of value by passing the best_of
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "beam_size=10" \ -F "best_of=3" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Patience
Patience controls how long the beam search continues searching for better hypotheses. A higher value means the search will be more thorough but potentially slower.
The default value is 1.0. You can set the patience value by passing the patience
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "beam_size=10" \ -F "best_of=3" \ -F "patience=2.0" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Penalty Settings
These parameters help control the quality and characteristics of the generated text by penalizing certain undesirable behaviors in the model’s output.
Text Generation Penalties
Length Penalty
This parameter influences how the model handles the length of generated sequences. Values less than 1.0 encourage shorter outputs, while values greater than 1.0 encourage longer outputs.
The default value is 1.0. You can set the length penalty by passing the length_penalty
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "length_penalty=0.8" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Repetition Penalty
This helps prevent the model from getting stuck in repetitive patterns by applying a penalty to tokens that have already been generated. Higher values make repetition less likely.
The default value is 1.0. You can set the repetition penalty by passing the repetition_penalty
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "repetition_penalty=1.2" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
No Repeat N-Gram Size
This parameter prevents the model from generating the same sequence of n tokens that appeared before. It’s particularly useful for avoiding phrase-level repetition.
The default value is 0. You can set the no repeat n-gram size by passing the no_repeat_ngram_size
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "no_repeat_ngram_size=3" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Sampling Temperature
Temperature modifies the probability distribution over the model’s predictions. Lower values make the model more confident and deterministic, while higher values increase randomness and creativity.
The default temperature values are [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]. You can set the temperature by passing the temperature
parameter in the request.
Single Temperature
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "temperature=0.8" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Multiple Temperatures
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "temperature=[0.0,0.5,1.0]" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Threshold Settings
These thresholds help control the model’s behavior in edge cases and improve the quality of transcription by filtering out low-confidence or problematic outputs.
Detection Thresholds
Compression Ratio Threshold
This threshold helps detect and filter out hallucinated speech in silent parts of the audio. It works by comparing the length of the generated text to what would be expected given the audio duration.
The default value is 2.4. You can set the compression ratio threshold by passing the compression_ratio_threshold
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "compression_ratio_threshold=2.0" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Log Probability Threshold
This threshold filters out transcribed segments based on their average log probability. Lower values allow more uncertain transcriptions, while higher values enforce stricter filtering.
The default value is -1.0. You can set the log probability threshold by passing the log_prob_threshold
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "log_prob_threshold=-1.5" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
No Speech Detection
This parameter controls how sensitive the model is to detecting silence or non-speech audio segments. Higher values make the model more likely to classify audio as non-speech.
The default value is 0.6. You can set the no speech threshold by passing the no_speech_threshold
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "no_speech_threshold=0.5" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Text Conditioning
Text conditioning allows the model to use additional context to improve transcription accuracy and maintain consistency across segments.
Prompt and Prefix Settings
Condition on Previous Text
This parameter determines whether the model should consider previously transcribed text when generating new segments. It helps maintain consistency and context across the entire transcription.
The default value is true
. You can set the condition on previous text by passing the condition_on_previous_text
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "condition_on_previous_text=false" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Initial Prompt
This provides initial context to the model before it begins transcription. It can help guide the model’s style and context understanding from the start.
The default value is null
. You can set the initial prompt by passing the initial_prompt
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "initial_prompt=Meeting transcript:" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Prefix
Similar to initial_prompt but specifically prepended to each segment. It’s useful for maintaining consistent formatting or speaker identification throughout the transcription.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "prefix=Speaker 1:" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Timestamp Options
Timestamps are crucial for aligning transcribed text with the original audio and enabling features like subtitle generation.
Timestamp Configuration
Without Timestamps
This option allows you to skip timestamp generation entirely. This can speed up transcription when timing information isn’t needed.
The default value is false
. You can disable timestamps by passing the without_timestamps
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "without_timestamps=true" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Word-Level Timestamps
This enables precise timing information for each word, rather than just at the segment level. It’s particularly useful for applications requiring word-level synchronization.
The default value is false
. You can enable word-level timestamps by passing the word_timestamps
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "word_timestamps=true" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Voice Activity Detection (VAD)
VAD Settings
VAD helps identify and focus on segments containing speech, improving efficiency and accuracy by filtering out non-speech audio sections before processing.
Enable VAD Filter
This parameter enables voice activity detection to filter out non-speech audio segments before transcription.
The default value is false
. You can enable the VAD filter by passing the vad_filter
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "vad_filter=true" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
VAD Parameters
This allows fine-tuning of the voice activity detection system through custom parameters, enabling optimization for specific audio characteristics or requirements.
The default value is null
.
The following VAD parameters can be configured:
-
threshold
: 0.5 (VAD threshold)- Controls how sensitive the system is in detecting speech. A lower value (e.g., 0.3) will detect more audio as speech, while a higher value (e.g., 0.7) will be more selective. Values range from 0 to 1.
-
min_speech_duration_ms
: 250 (Minimum speech duration in milliseconds)- Sets the shortest duration that will be considered as valid speech. Shorter speech segments will be filtered out. This helps eliminate brief sounds or noise that shouldn’t be transcribed.
-
min_silence_duration_ms
: 2000 (Minimum silence duration in milliseconds)- Defines how long a quiet period needs to be to be considered actual silence rather than just a pause in speech. This helps prevent over-segmentation of natural speech patterns.
-
window_size_samples
: 1024 (Window size in samples for VAD processing)- Determines how much audio is analyzed at once when detecting speech. Larger windows provide more context but require more processing power. This is typically best left at the default unless you have specific requirements.
-
speech_pad_ms
: 400 (Padding duration for speech segments in milliseconds)- Adds extra time before and after detected speech segments to ensure no speech is cut off. This helps capture natural fade-ins and fade-outs of speech.
You can set these parameters by passing the vad_parameters
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "vad_filter=true" \ -F 'vad_parameters={"threshold":0.6,"min_speech_duration_ms":300}' \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Multilingual Mode
This setting enables the model to handle multiple languages within the same audio file. It’s particularly useful for content with language switching or mixed language use.
The default value is false
. You can enable multilingual mode by passing the multilingual
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "multilingual=true" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Chunk Length
This parameter controls how the audio is split into smaller segments for processing. It can help manage memory usage and enable parallel processing of longer audio files.
The default value is null
. You can set the chunk length by passing the chunk_length
parameter in the request.
curl \ -F "url=https://files.whisper-api.com/example.mp4" \ -F "chunk_length=30" \ -H "X-API-Key: YOUR_API_KEY" \ https://api.whisper-api.com/transcribe
Conclusion
Whoa! That was a lot of information. But don’t worry, you don’t have to remember everything. You can always refer back to this guide whenever you need to customize your transcription requests.