Break language barriers in Korea. Call anyone, in any language.
Real-time voice translation for phone calls. Powered by dual AI sessions and software-only echo cancellation, WIGVO lets you call any phone number in any language. The other person just answers a normal call — no app needed.
View on GitHubK-Culture brought millions to Korea — but when they need to make a phone call, the language wall hits hard. Booking a restaurant, calling a hospital, reaching a landlord — everyday calls that locals handle in seconds become impossible without Korean. And Koreans living abroad face the same wall in reverse.
WIGVO bridges the gap with real-time phone translation. Our dual-session architecture runs two parallel AI interpreters — one for each speaker — delivering natural, bidirectional voice translation over standard phone lines.
Two parallel OpenAI Realtime sessions handle each side of the conversation independently, ensuring natural turn-taking.
Call any phone number on any carrier. Works with landlines, mobile phones, and VoIP — the recipient doesn't need any app.
Software-only echo cancellation eliminates feedback loops without hardware, keeping conversations natural.
The person you're calling just answers their phone normally. Zero setup, zero downloads on their end.
Speak naturally in your language. WIGVO translates your voice in real-time and delivers it as natural speech to the other person — and vice versa.
Type what you want to say, and WIGVO speaks it to the other person in their language. Perfect for noisy environments or precise messages.
Let WIGVO's AI agent handle the call for you. Describe what you need — book a reservation, schedule an appointment — and the agent makes the call.
Audio Quality Gap: High-bandwidth environments (16-24kHz PCM16) assume wideband audio and client-side AEC. PSTN operates on G.711 μ-law 8kHz narrowband codec with 80-600ms variable delay and constant codec compression noise.
Echo Loop: AI-translated TTS audio returns through the PSTN after 80-600ms, re-entering the STT → translation → TTS pipeline in an infinite loop. 8 out of 10 initial test calls experienced this. Without client-side AEC available in high-bandwidth app environments, PSTN requires a software-only solution.
VAD Failure: OpenAI Server VAD assumes clean wideband audio. PSTN background noise (RMS 50-200) registers as "speech in progress" to Server VAD, causing speech_stopped to fire 15-72 seconds late or not at all.
| System | PSTN | Bidirectional | S2S | Echo Handling | Accessibility |
|---|---|---|---|---|---|
| SeamlessM4T | O | O | N/A | ||
| Moshi / Hibiki | O | N/A | |||
| Google Duplex | O | N/D | |||
| Samsung Galaxy AI | O | O | O | HW AEC | |
| SKT A.dot | O | O | O | Carrier Infra | |
| WIGVO | O | O | O | Software | O |
When a browser client connects via WebSocket to the relay server, the server manages 2 independent Realtime LLM sessions and a Twilio phone gateway. An AudioRouter delegates events to one of 3 pipelines (V2V, T2V, FullAgent) via the Strategy pattern.

STT-Translation Separation: Delegating translation to the Realtime API causes hallucinations that add content not present in the original speech. STT uses Realtime API's built-in Whisper-1, while translation is handled by GPT-4o-mini Chat API (temperature=0). context_prune_keep=0 completely blocks the Realtime API's own translation.
Blocks the echo loop where TTS audio returns through PSTN.

The Critical Breakthrough — Drop vs Replace: "Dropping" audio causes Server VAD to interpret it as a stream interruption and freeze. "Replacing" with μ-law silence (0xFF) maintains stream continuity while VAD correctly recognizes it as silence. This "Drop vs Replace" paradigm is the core principle applied consistently across both Echo Gate and VAD.
7-Stage Evolution: (1) Audio Fingerprint via Pearson correlation — failed completely due to G.711 μ-law nonlinear quantization destroying correlation. (2) Fixed 2.5s Echo Gate — solved echo but disrupted conversation flow. (3) Dynamic Cooldown — proportional to TTS length, but AGC noise spike after gate release. (4) Final: Silence Injection + RMS + Dynamic Settling + Silero.
Result: Echo loop rate reduced from 8/10 initial calls → 0/148 production calls.
OpenAI Server VAD is a black box with no frame-level control during echo windows. RMS thresholds of 150→80→30→20 were all attempted, but no single stable threshold exists for PSTN. The solution: switch to local Silero VAD with a PSTN-specific independent architecture.
Result: speech_stopped latency reduced from 15-72 seconds → 480ms.
When PSTN noise enters Whisper-1, it generates "plausible" text learned from training data (YouTube, broadcasts). Broadcast-style patterns like "MBC News, this is Lee Deokyoung" and "Thanks for watching" leaked into the translation pipeline and actually reached recipients' phones in production.
Result: Hallucination leak rate below 0.3%, average 0.7 blocks per call (148 calls). 95%+ cases handled by L1 with zero additional latency.
The initial monolithic AudioRouter was refactored into a thin delegator + 3 independent pipelines via the Strategy pattern (73% code reduction).
Latency:
Echo & Safety:
Cost:


| Method | Echo Loop | Conversation Delay | Adopted |
|---|---|---|---|
| Audio Fingerprint (Pearson) | Unresolved | — | |
| Fixed Echo Gate (2.5s) | Resolved | Disrupted | |
| Dynamic Cooldown | Resolved | Improved | |
| Silence Injection + RMS + Dynamic Settling + Silero | Resolved | Minimized | O |
Finding: In PSTN environments, signal correlation-based echo detection does not work. Only direct control of echo windows with silence frame replacement is stable. The Realtime API's generation characteristics are suitable for STT but not for translation — separating translation to a temperature=0 Chat API improves both accuracy and stability.