Transcription API

The transcription API converts audio files to text. It supports automatic language detection, chunk-level prompting presets, and streaming language-lock updates.

Workflow

Presign — Get a presigned upload URL
Upload — Upload the audio file
Create — Start the transcription job
Poll — Check status until completed

Endpoints

`POST /v1/uploads/presign`

Get a presigned URL to upload your audio file.

Request:

{
  "filename": "interview.mp3",
  "content_type": "audio/mpeg",
  "size_bytes": 10485760
}

Response:

{
  "upload_url": "https://storage.example.com/...",
  "object_key": "uploads/abc123/interview.mp3"
}

`POST /v1/transcriptions`

Create a transcription job.

Request:

{
  "object_key": "uploads/abc123/interview.mp3",
  "filename": "interview.mp3",
  "content_type": "audio/mpeg",
  "size_bytes": 10485760,
  "duration_seconds": 120,
  "language": "auto",
  "model": "gemma4-e2b-asr-stream",
  "chunk_prompt_preset": "default"
}

Parameters:

Field	Type	Required	Description
`object_key`	string	yes	Object key from presign response
`filename`	string	yes	Original filename
`content_type`	string	yes	MIME type of the audio
`size_bytes`	integer	yes	File size in bytes
`duration_seconds`	integer	yes	Audio duration in seconds
`language`	string	no	Language code or `"auto"` (default: `"auto"`)
`model`	string	no	Gemma ASR model (default: `"gemma4-e2b-asr-stream"`)
`chunk_prompt_preset`	string	no	Chunk-level ASR preset: `default`, `verbatim`, `remove_fillers`, or `concise`

Available models:

Model	Description
`gemma4-e2b-asr-stream`	Default. Streaming Gemma ASR with chunk-level language locking
`gemma4-e2b-asr`	Non-streaming Gemma ASR

Available chunk presets:

Preset	Description
`default`	Standard transcript with digits preserved
`verbatim`	Keep filler words and repetitions
`remove_fillers`	Remove filler words per chunk while preserving meaning
`concise`	Make each chunk more concise while preserving meaning

Response:

{
  "user": {
    "id": "550e8400-e29b-41d4-a716-446655440001",
    "email": "user@example.com",
    "name": "Jane Doe"
  },
  "transcription": {
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "queued",
    "source_filename": "interview.mp3",
    "source_content_type": "audio/mpeg",
    "requested_language": "auto",
    "model": "gemma4-e2b-asr-stream",
    "created_at": "2026-04-20T10:00:00Z"
  },
  "usage": {
    "free_remaining": 480,
    "package_seconds_remaining": 0
  },
  "estimate": {
    "queue_position": 1,
    "estimated_wait_seconds": 30
  },
  "quote": {
    "mode": "free_test",
    "seconds": 120,
    "charged_amount_kopeks": 0
  }
}

`GET /v1/transcriptions/:id`

Get a transcription by ID.

Response (completed):

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "transcript_text": "Hello, this is the transcribed text...",
  "transcript_segments": [
    { "start": 0.0, "end": 2.5, "text": "Hello," },
    { "start": 2.5, "end": 5.1, "text": "this is the transcribed text..." }
  ],
  "detected_language": "en",
  "duration_seconds": 120,
  "pricing_breakdown": {
    "mode": "free_test",
    "seconds": 120,
    "charged_amount_kopeks": 0
  },
  "created_at": "2026-04-20T10:00:00Z",
  "completed_at": "2026-04-20T10:00:45Z"
}

`GET /v1/transcriptions`

List all transcriptions for the authenticated user.

`GET /v1/transcriptions/:id/stream`

Server-Sent Events (SSE) stream for real-time transcription updates. Useful for showing partial results as the audio is being processed. Partial updates may include language_state, which reports detected language, whether the language is locked, and how many chunks have been processed. After 3 returned chunks, the remaining chunks use the detected language from the ASR response.

`GET /v1/transcriptions/:id/eta`

Get estimated time of arrival for a queued transcription.

Statuses

Status	Description
`queued`	Waiting in the processing queue
`processing`	Currently being transcribed
`completed`	Transcription finished successfully
`failed`	Transcription failed (see `error_message`)