Industry · Restaurants Region · New York City, USA Deployed · Production since 2024

Milina — an AI voice agent that handles 50+ reservations a night at $0.09 per call

A NYC restaurant was losing dinners to unanswered calls. We built a LiveKit-based voice agent on Deepgram, GPT-4o-mini, and Cartesia that handles reservations, inquiries, and waitlist in English and Spanish — and callers routinely don't realize they're talking to AI.

91%Task completion rate
$0.09Average cost per call
50+Calls per weekend night
<700msResponse latency p50

Client

Independent full-service restaurant in Manhattan. One location, ~90 covers, weekend-heavy reservation pattern with a long waitlist on Fri–Sat. Team of 25 front-of-house.

Engagement

6-week build, 2-week shadow mode, live since 2024. Monthly retainer for script tuning, menu updates, and new language support.

The phone was the bottleneck — not the kitchen.

On a busy Friday, the host stand runs two phones. One of them is almost always in use with a reservation request, a modification, or someone asking for directions. The second phone rolls to voicemail roughly 30% of the time during the 6–9pm rush, and roughly 60% of voicemails never convert into a booking — most callers just dial the next place.

The owner had tried OpenTable's widget, a third-party reservations call center, and a human answering service. OpenTable doesn't capture the 40% of reservations that come by phone; the call center charged $1.80 per call and the scripts were wooden; the human service couldn't access the reservation system directly and kept double-booking. None of them answered in Spanish, which the neighborhood demographic needed.

The brief was concrete: answer every call within three rings, book directly into their Resy stack, speak natural English and Spanish, and keep per-call cost under $0.25 so it would be cheaper than the call center they were replacing.

A production-grade LiveKit voice agent with two-model turn-taking and native reservation-system integration.

We picked every component for latency, bilingual quality, and per-minute economics. No platform lock-in — just the stack that worked.

Voice infra: LiveKit Cloud. We started on Vapi for the MVP, then migrated to LiveKit at week 3 when we needed deterministic control over turn-taking behavior and lower media-layer latency. LiveKit's agents framework lets us run the turn-taking model, STT, LLM, and TTS in the same process with direct audio routing — which is where the sub-700ms p50 response comes from.

STT: Deepgram Nova-2. We evaluated Whisper Large-v3, Deepgram Nova-2, and AssemblyAI Universal-2 against a 200-call test set recorded from the client's actual phone line (NYC accents, Spanish-accented English, restaurant background noise). Nova-2 won on word error rate for our specific audio profile, at roughly one-third the latency of Whisper.

LLM: GPT-4o-mini with structured tool calls. The agent is not free-form — it runs a state machine with six intents (reservation, modify, cancel, waitlist, hours/info, handoff) and 14 sub-intents. GPT-4o-mini is more than enough for this scope at one-twentieth the cost of GPT-4o, and the structured-output schema keeps it on-rails. Claude Haiku is the fallback for complex rescheduling logic where the reasoning step matters.

TTS: Cartesia Sonic English + Spanish. We A/B tested ElevenLabs Flash, Cartesia Sonic, and Azure Neural against blind listener panels of 40 real customers. Cartesia Sonic won on bilingual consistency — the same voice in English and Spanish without the jarring voice-switch most TTS engines do. Latency budget: 120ms to first audio chunk.

Reservation system integration: Resy API + SevenRooms fallback. Direct API calls for table availability, booking, modification, and cancellation. Writes a structured note to the POS (Toast) so the host sees context when the guest arrives. The entire conversation transcript is saved and tagged — searchable by the owner when they want to know why a specific booking happened.

Architecture (data flow)

1.Caller dialsTwilio SIP trunk → LiveKit agent room
2.Turn-takingLiveKit VAD + end-of-utterance model → <120ms decision
3.STTDeepgram Nova-2 streaming → partial and final transcripts
4.Intent + responseGPT-4o-mini with function calling → Resy/SevenRooms tools
5.Tool callscheck_availabilitycreate_bookingmodify_bookingadd_to_waitlisthandoff_to_host
6.TTSCartesia Sonic streaming → first audio chunk at 120ms
7.Post-callTranscript + intent tags → Toast POS note + owner dashboard
8.ObservabilityLangSmith traces + Helicone LLM logs + Matomo call analytics
LiveKit Cloud Deepgram Nova-2 GPT-4o-mini Cartesia Sonic Claude Haiku (fallback) Twilio SIP Resy API SevenRooms Toast POS LangSmith Helicone Python 3.12 FastAPI PostgreSQL

Four months in production. Numbers are from live traffic, not pilots.

Every metric below is a live-traffic aggregate from the client's call analytics. No cherry-picked test runs.

91%

Task completion rate

Calls where the caller's goal — book, modify, cancel, or get an answer — was achieved without a human transfer.

$0.09

Average cost per call

All-in compute cost: STT + LLM + TTS + telephony. 20x cheaper than the call center they replaced.

+22%

Reservations month-over-month

Captured from the previously-lost voicemail traffic plus recovered after-hours call volume.

<700ms

p50 response latency

Median end-of-utterance to first TTS audio chunk. The conversation feels like a conversation, not a script.

100%

Inbound calls answered

Zero rolled to voicemail over the last 60 days. Bilingual (English + Spanish) from call one.

0

Double-bookings in 4 months

Resy API check-then-book is atomic. The agent never commits a reservation it can't hold.

In the first month I kept getting regulars texting me saying "who's the new host? She's really good." Nobody realized it was AI until we told them.

— Restaurant owner, Manhattan

Four decisions where most voice-AI projects fail.

1. Two weeks of shadow mode before going live. The agent answered every call alongside the host and produced a response — but the host spoke. We compared what the agent would have said against what the host said, on 400+ real calls, and fixed every edge case before a customer heard the agent's voice.

2. Bilingual TTS that doesn't voice-switch. 35% of callers switch between English and Spanish mid-sentence. Most TTS engines pick up the language switch by changing voice entirely — which is jarring and tells the caller immediately they're talking to a bot. Cartesia Sonic holds the same voice identity across languages.

3. Atomic booking with optimistic confirmation. The agent says "let me grab that for you" and holds the slot via Resy's hold endpoint before confirming. If the API call fails mid-sentence, the agent says "hmm, let me check one more thing" and retries. The caller never hears "there was an error."

4. Per-call budget as a first-class constraint. We built a cost dashboard on day one and watched every intent's cost. When GPT-4o-mini drifted up on a tool-call-heavy intent, we caught it in hours, not weeks. Without per-call tracking a voice agent quietly becomes a line item that kills the ROI case.

From discovery to production in 6 weeks.

We can tell you in a 20-minute call whether a voice agent will ship for you.

Bring us your call volume, your reservation / scheduling system, and your current conversion rate — we'll say yes, no, or "not yet" with numbers. No pitch deck.