Configuration options

Part of the beauty of the Whisper API is the flexibility it offers. You can control every aspect of the transcription process by passing in a wide range of configuration options. In this guide, we will explore some of the configuration options available to you.

Setting the Language

We automatically detect the language of the audio file, but sometimes our detection might not be accurate. You can set the language of the audio file by passing the language query parameter in the request

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "language=en" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

The language parameter is a 2-letter language code. You can find the list of supported languages here.

Setting the Output Format

We currently support 2 output formats: text and srt. The default output format is text. You can set it by passing the format parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "language=en" \
  -F "format=srt" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

When the transcription is complete, you will receive a response like this:

{
  "task_id": "b9e2e1dc-7fda-4d4b-b414-4289dfc409e1",
  "status": "completed",
  "result": "1\n00:00:00,000 --> 00:00:06,520\nHello, world!\n\n2\n00:00:06,520 --> 00:00:10,520\nThis is a test.\n\n3\n00:00:10,520 --> 00:00:14",
  "language": "en",
  "format": "srt"
}

Setting the Model Size

The model size determines the quality of the transcription and also how long it will take. This largely depends on your file size and the quality of the audio. A smaller model size will be faster but might not be as accurate as a larger model size. Similarly, a larger model size will be more accurate but will take longer to process.

We currently support the following model sizes from smallest to largest:

Model Size	Description	Tier
tiny	Smallest multilingual model	Free
small	Small multilingual model	Free
medium	Medium multilingual model	Free
base	Base multilingual model	Paid
large-v1	Original large model	Paid
large-v2	Improved large model	Paid
large-v3	Latest large model	Paid

The default model size is base. You can set the model size by passing the model parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "language=en" \
  -F "format=srt" \
  -F "model_size=large" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Advanced Search Parameters

Beam Search Parameters

Beam search is a heuristic search algorithm that explores a graph by expanding the most promising nodes. In the context of speech recognition, it maintains multiple hypotheses (beams) during decoding to find the most likely transcription.

Beam Size

Beam size controls how many alternative hypotheses are maintained during decoding. A larger beam size increases the chance of finding the optimal transcription but requires more computation time and memory.

The default beam size is 5. You can set the beam size by passing the beam_size parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "beam_size=10" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Best of

This parameter determines how many of the highest-scoring beam search results to consider for the final output. It works in conjunction with beam_size to balance between exploration and final selection quality.

The default value is 5. You can set the best of value by passing the best_of parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "beam_size=10" \
  -F "best_of=3" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Patience

Patience controls how long the beam search continues searching for better hypotheses. A higher value means the search will be more thorough but potentially slower.

The default value is 1.0. You can set the patience value by passing the patience parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "beam_size=10" \
  -F "best_of=3" \
  -F "patience=2.0" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Penalty Settings

These parameters help control the quality and characteristics of the generated text by penalizing certain undesirable behaviors in the model’s output.

Text Generation Penalties

Length Penalty

This parameter influences how the model handles the length of generated sequences. Values less than 1.0 encourage shorter outputs, while values greater than 1.0 encourage longer outputs.

The default value is 1.0. You can set the length penalty by passing the length_penalty parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "length_penalty=0.8" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Repetition Penalty

This helps prevent the model from getting stuck in repetitive patterns by applying a penalty to tokens that have already been generated. Higher values make repetition less likely.

The default value is 1.0. You can set the repetition penalty by passing the repetition_penalty parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "repetition_penalty=1.2" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

No Repeat N-Gram Size

This parameter prevents the model from generating the same sequence of n tokens that appeared before. It’s particularly useful for avoiding phrase-level repetition.

The default value is 0. You can set the no repeat n-gram size by passing the no_repeat_ngram_size parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "no_repeat_ngram_size=3" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Sampling Temperature

Temperature modifies the probability distribution over the model’s predictions. Lower values make the model more confident and deterministic, while higher values increase randomness and creativity.

The default temperature values are [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]. You can set the temperature by passing the temperature parameter in the request.

Single Temperature

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "temperature=0.8" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Multiple Temperatures

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "temperature=[0.0,0.5,1.0]" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Threshold Settings

These thresholds help control the model’s behavior in edge cases and improve the quality of transcription by filtering out low-confidence or problematic outputs.

Detection Thresholds

Compression Ratio Threshold

This threshold helps detect and filter out hallucinated speech in silent parts of the audio. It works by comparing the length of the generated text to what would be expected given the audio duration.

The default value is 2.4. You can set the compression ratio threshold by passing the compression_ratio_threshold parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "compression_ratio_threshold=2.0" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Log Probability Threshold

This threshold filters out transcribed segments based on their average log probability. Lower values allow more uncertain transcriptions, while higher values enforce stricter filtering.

The default value is -1.0. You can set the log probability threshold by passing the log_prob_threshold parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "log_prob_threshold=-1.5" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

No Speech Detection

This parameter controls how sensitive the model is to detecting silence or non-speech audio segments. Higher values make the model more likely to classify audio as non-speech.

The default value is 0.6. You can set the no speech threshold by passing the no_speech_threshold parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "no_speech_threshold=0.5" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Text Conditioning

Text conditioning allows the model to use additional context to improve transcription accuracy and maintain consistency across segments.

Prompt and Prefix Settings

Condition on Previous Text

This parameter determines whether the model should consider previously transcribed text when generating new segments. It helps maintain consistency and context across the entire transcription.

The default value is true. You can set the condition on previous text by passing the condition_on_previous_text parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "condition_on_previous_text=false" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Initial Prompt

This provides initial context to the model before it begins transcription. It can help guide the model’s style and context understanding from the start.

The default value is null. You can set the initial prompt by passing the initial_prompt parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "initial_prompt=Meeting transcript:" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Prefix

Similar to initial_prompt but specifically prepended to each segment. It’s useful for maintaining consistent formatting or speaker identification throughout the transcription.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "prefix=Speaker 1:" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Timestamp Options

Timestamps are crucial for aligning transcribed text with the original audio and enabling features like subtitle generation.

Timestamp Configuration

Without Timestamps

This option allows you to skip timestamp generation entirely. This can speed up transcription when timing information isn’t needed.

The default value is false. You can disable timestamps by passing the without_timestamps parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "without_timestamps=true" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Word-Level Timestamps

This enables precise timing information for each word, rather than just at the segment level. It’s particularly useful for applications requiring word-level synchronization.

The default value is false. You can enable word-level timestamps by passing the word_timestamps parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "word_timestamps=true" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Voice Activity Detection (VAD)

VAD Settings

VAD helps identify and focus on segments containing speech, improving efficiency and accuracy by filtering out non-speech audio sections before processing.

Enable VAD Filter

This parameter enables voice activity detection to filter out non-speech audio segments before transcription.

The default value is false. You can enable the VAD filter by passing the vad_filter parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "vad_filter=true" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

VAD Parameters

This allows fine-tuning of the voice activity detection system through custom parameters, enabling optimization for specific audio characteristics or requirements.

The default value is null.

The following VAD parameters can be configured:

threshold: 0.5 (VAD threshold)
- Controls how sensitive the system is in detecting speech. A lower value (e.g., 0.3) will detect more audio as speech, while a higher value (e.g., 0.7) will be more selective. Values range from 0 to 1.
min_speech_duration_ms: 250 (Minimum speech duration in milliseconds)
- Sets the shortest duration that will be considered as valid speech. Shorter speech segments will be filtered out. This helps eliminate brief sounds or noise that shouldn’t be transcribed.
min_silence_duration_ms: 2000 (Minimum silence duration in milliseconds)
- Defines how long a quiet period needs to be to be considered actual silence rather than just a pause in speech. This helps prevent over-segmentation of natural speech patterns.
window_size_samples: 1024 (Window size in samples for VAD processing)
- Determines how much audio is analyzed at once when detecting speech. Larger windows provide more context but require more processing power. This is typically best left at the default unless you have specific requirements.
speech_pad_ms: 400 (Padding duration for speech segments in milliseconds)
- Adds extra time before and after detected speech segments to ensure no speech is cut off. This helps capture natural fade-ins and fade-outs of speech.

You can set these parameters by passing the vad_parameters parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "vad_filter=true" \
  -F 'vad_parameters={"threshold":0.6,"min_speech_duration_ms":300}' \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Multilingual Mode

This setting enables the model to handle multiple languages within the same audio file. It’s particularly useful for content with language switching or mixed language use.

The default value is false. You can enable multilingual mode by passing the multilingual parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "multilingual=true" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Chunk Length

This parameter controls how the audio is split into smaller segments for processing. It can help manage memory usage and enable parallel processing of longer audio files.

The default value is null. You can set the chunk length by passing the chunk_length parameter in the request.

curl \
  -F "url=https://files.whisper-api.com/example.mp4" \
  -F "chunk_length=30" \
  -H "X-API-Key: YOUR_API_KEY" \
  https://api.whisper-api.com/transcribe

Conclusion

Whoa! That was a lot of information. But don’t worry, you don’t have to remember everything. You can always refer back to this guide whenever you need to customize your transcription requests.