How We Built Nagovori: Architecture, Security, and Scale

We regularly get questions: "Where are my recordings processed?", "Who can access my data?", "Why is it so fast?" This post answers those questions with a candid look at our architecture — enough detail to understand how things work without exposing implementation specifics.

High-Level Architecture

Nagovori is a web application with a job queue. When you upload a file, here's what happens:

Upload — the file is transmitted over an encrypted connection (HTTPS/TLS 1.3)
Queue — the file enters a processing queue. You see your position and estimated wait time
Processing — a worker picks the file, converts it to the required format, and runs it through the speech recognition model
Result — the transcript is saved and appears in your dashboard

The system is built on a microservice architecture: the web frontend, API server, processing workers, and storage are separate components that scale independently.

Recognition Models

We use multiple models depending on the task:

Primary model — optimized for Russian, delivers 95%+ accuracy on clean audio
Multilingual model — for English and other supported languages
Lightweight model — for short voice messages where speed matters more than marginal accuracy gains

Models run on GPU servers with NVIDIA hardware. This lets us process one hour of audio in 2–5 minutes.

Data Security

Encryption

All data in transit uses HTTPS (TLS 1.3)
Files at rest are encrypted
Internal service-to-service communication is secured

Storage

Servers are located in Russia
Files are retained for processing and remain accessible in user history
Users can delete their data at any time through their dashboard

Access Control

Only the account owner can access their transcriptions and files
System administrators do not have access to user audio content or transcripts
Authentication is handled through a dedicated identity service (OIDC-based)

Compliance

Personal data processing complies with Russian Federal Law No. 152-FZ
Privacy policy and terms of service are publicly available
Data is not shared with third parties for marketing purposes

Messenger Integrations

Bots for Telegram, VK, and Max follow the same pattern:

User forwards a voice message to the bot
The bot receives the file via the messenger's API
The file enters the same processing queue as web uploads
The result is returned to the user in the chat

Important: bots don't store messages beyond processing, don't read conversations, and don't have access to chats where they haven't been added.

Text-to-Speech (TTS)

TTS works in reverse: the user inputs text, the system sends it to a synthesis model, and returns an audio file. Multiple professional-quality voices are available. TTS and transcription share the same minute balance — one account, one pool of minutes.

Scaling

The job queue handles traffic spikes gracefully. If 100 users upload files simultaneously, the system doesn't crash — files queue up and process sequentially. When load increases, we add more workers. This approach is simpler and more reliable than trying to process everything in real time.

Current processing capacity handles thousands of hours of audio per day. During peak periods, average wait times stay under 5 minutes for a one-hour file.

What We're Building Next

Our roadmap includes:

Faster processing — targeting 1 minute for a 1-hour file
Speaker diarization — identifying who said what
Developer API — RESTful API for integrating transcription into third-party products
Improved punctuation — better handling of complex sentence structures

Technology Choices

A few notable decisions:

Next.js for the frontend — server-side rendering for SEO, React for interactivity
Go for the backend — concurrency model fits the queue-based architecture
PostgreSQL + ClickHouse — relational data in Postgres, analytics and metrics in ClickHouse
Docker and Kubernetes — containerized deployment with automated scaling

Conclusion

Nagovori's architecture prioritizes simplicity for the user and security for their data. Upload a file, wait briefly, use the text — everything else happens behind the scenes. If you have specific questions about our security practices, reach out through our support channels.