Transcription API

Transcription API

The transcription API converts audio files to text. It supports automatic language detection, chunk-level prompting presets, and streaming language-lock updates.

Workflow

  1. Presign — Get a presigned upload URL
  2. Upload — Upload the audio file
  3. Create — Start the transcription job
  4. Poll — Check status until completed

Endpoints

POST /v1/uploads/presign

Get a presigned URL to upload your audio file.

Request:

{
  "filename": "interview.mp3",
  "content_type": "audio/mpeg",
  "size_bytes": 10485760
}

Response:

{
  "upload_url": "https://storage.example.com/...",
  "object_key": "uploads/abc123/interview.mp3"
}

POST /v1/transcriptions

Create a transcription job.

Request:

{
  "object_key": "uploads/abc123/interview.mp3",
  "filename": "interview.mp3",
  "content_type": "audio/mpeg",
  "size_bytes": 10485760,
  "duration_seconds": 120,
  "language": "auto",
  "model": "gemma4-e2b-asr-stream",
  "chunk_prompt_preset": "default"
}

Parameters:

Field Type Required Description
object_key string yes Object key from presign response
filename string yes Original filename
content_type string yes MIME type of the audio
size_bytes integer yes File size in bytes
duration_seconds integer yes Audio duration in seconds
language string no Language code or "auto" (default: "auto")
model string no Gemma ASR model (default: "gemma4-e2b-asr-stream")
chunk_prompt_preset string no Chunk-level ASR preset: default, verbatim, remove_fillers, or concise

Available models:

Model Description
gemma4-e2b-asr-stream Default. Streaming Gemma ASR with chunk-level language locking
gemma4-e2b-asr Non-streaming Gemma ASR

Available chunk presets:

Preset Description
default Standard transcript with digits preserved
verbatim Keep filler words and repetitions
remove_fillers Remove filler words per chunk while preserving meaning
concise Make each chunk more concise while preserving meaning

Response:

{
  "user": {
    "id": "550e8400-e29b-41d4-a716-446655440001",
    "email": "user@example.com",
    "name": "Jane Doe"
  },
  "transcription": {
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "queued",
    "source_filename": "interview.mp3",
    "source_content_type": "audio/mpeg",
    "requested_language": "auto",
    "model": "gemma4-e2b-asr-stream",
    "created_at": "2026-04-20T10:00:00Z"
  },
  "usage": {
    "free_remaining": 480,
    "package_seconds_remaining": 0
  },
  "estimate": {
    "queue_position": 1,
    "estimated_wait_seconds": 30
  },
  "quote": {
    "mode": "free_test",
    "seconds": 120,
    "charged_amount_kopeks": 0
  }
}

GET /v1/transcriptions/:id

Get a transcription by ID.

Response (completed):

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "transcript_text": "Hello, this is the transcribed text...",
  "transcript_segments": [
    { "start": 0.0, "end": 2.5, "text": "Hello," },
    { "start": 2.5, "end": 5.1, "text": "this is the transcribed text..." }
  ],
  "detected_language": "en",
  "duration_seconds": 120,
  "pricing_breakdown": {
    "mode": "free_test",
    "seconds": 120,
    "charged_amount_kopeks": 0
  },
  "created_at": "2026-04-20T10:00:00Z",
  "completed_at": "2026-04-20T10:00:45Z"
}

GET /v1/transcriptions

List all transcriptions for the authenticated user.

GET /v1/transcriptions/:id/stream

Server-Sent Events (SSE) stream for real-time transcription updates. Useful for showing partial results as the audio is being processed. Partial updates may include language_state, which reports detected language, whether the language is locked, and how many chunks have been processed. After 3 returned chunks, the remaining chunks use the detected language from the ASR response.

GET /v1/transcriptions/:id/eta

Get estimated time of arrival for a queued transcription.

Statuses

Status Description
queued Waiting in the processing queue
processing Currently being transcribed
completed Transcription finished successfully
failed Transcription failed (see error_message)