Transcription API
Transcription API
The transcription API converts audio files to text. It supports automatic language detection, chunk-level prompting presets, and streaming language-lock updates.
Workflow
- Presign — Get a presigned upload URL
- Upload — Upload the audio file
- Create — Start the transcription job
- Poll — Check status until
completed
Endpoints
POST /v1/uploads/presign
Get a presigned URL to upload your audio file.
Request:
{
"filename": "interview.mp3",
"content_type": "audio/mpeg",
"size_bytes": 10485760
}
Response:
{
"upload_url": "https://storage.example.com/...",
"object_key": "uploads/abc123/interview.mp3"
}
POST /v1/transcriptions
Create a transcription job.
Request:
{
"object_key": "uploads/abc123/interview.mp3",
"filename": "interview.mp3",
"content_type": "audio/mpeg",
"size_bytes": 10485760,
"duration_seconds": 120,
"language": "auto",
"model": "gemma4-e2b-asr-stream",
"chunk_prompt_preset": "default"
}
Parameters:
| Field | Type | Required | Description |
|---|---|---|---|
object_key |
string | yes | Object key from presign response |
filename |
string | yes | Original filename |
content_type |
string | yes | MIME type of the audio |
size_bytes |
integer | yes | File size in bytes |
duration_seconds |
integer | yes | Audio duration in seconds |
language |
string | no | Language code or "auto" (default: "auto") |
model |
string | no | Gemma ASR model (default: "gemma4-e2b-asr-stream") |
chunk_prompt_preset |
string | no | Chunk-level ASR preset: default, verbatim, remove_fillers, or concise |
Available models:
| Model | Description |
|---|---|
gemma4-e2b-asr-stream |
Default. Streaming Gemma ASR with chunk-level language locking |
gemma4-e2b-asr |
Non-streaming Gemma ASR |
Available chunk presets:
| Preset | Description |
|---|---|
default |
Standard transcript with digits preserved |
verbatim |
Keep filler words and repetitions |
remove_fillers |
Remove filler words per chunk while preserving meaning |
concise |
Make each chunk more concise while preserving meaning |
Response:
{
"user": {
"id": "550e8400-e29b-41d4-a716-446655440001",
"email": "user@example.com",
"name": "Jane Doe"
},
"transcription": {
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "queued",
"source_filename": "interview.mp3",
"source_content_type": "audio/mpeg",
"requested_language": "auto",
"model": "gemma4-e2b-asr-stream",
"created_at": "2026-04-20T10:00:00Z"
},
"usage": {
"free_remaining": 480,
"package_seconds_remaining": 0
},
"estimate": {
"queue_position": 1,
"estimated_wait_seconds": 30
},
"quote": {
"mode": "free_test",
"seconds": 120,
"charged_amount_kopeks": 0
}
}
GET /v1/transcriptions/:id
Get a transcription by ID.
Response (completed):
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"transcript_text": "Hello, this is the transcribed text...",
"transcript_segments": [
{ "start": 0.0, "end": 2.5, "text": "Hello," },
{ "start": 2.5, "end": 5.1, "text": "this is the transcribed text..." }
],
"detected_language": "en",
"duration_seconds": 120,
"pricing_breakdown": {
"mode": "free_test",
"seconds": 120,
"charged_amount_kopeks": 0
},
"created_at": "2026-04-20T10:00:00Z",
"completed_at": "2026-04-20T10:00:45Z"
}
GET /v1/transcriptions
List all transcriptions for the authenticated user.
GET /v1/transcriptions/:id/stream
Server-Sent Events (SSE) stream for real-time transcription updates. Useful for showing partial results as the audio is being processed.
Partial updates may include language_state, which reports detected language, whether the language is locked, and how many chunks have been processed. After 3 returned chunks, the remaining chunks use the detected language from the ASR response.
GET /v1/transcriptions/:id/eta
Get estimated time of arrival for a queued transcription.
Statuses
| Status | Description |
|---|---|
queued |
Waiting in the processing queue |
processing |
Currently being transcribed |
completed |
Transcription finished successfully |
failed |
Transcription failed (see error_message) |