Project

CallKaro

CallKaro: Agent to Call Indian Shops and Haggle in Hindi

A full-stack voice pipeline that researches products, finds local stores, calls them in Hindi, negotiates prices, and compares quotes — autonomously.

We spend a lot of time buying electronics in India. You know how it works — you don’t go to a website and click “Add to Cart.” You call three or four local dealers, ask for their best price, mention that the shop down the road offered you less, and eventually settle on a number. This ritual plays out millions of times a day across the country, entirely over the phone, entirely in Hindi.

I wanted to see if I could get an AI to do that. CallKaro is what came out of it — a voice agent that calls real shops, speaks Hindi, asks about prices, pushes back gently on high quotes, and compares results across stores.

Why Indian Electronics Prices Don’t Exist Online

Three things are true about buying electronics in India:

Local dealers don’t list prices online
Walk into any appliance market — Lajpat Rai in Delhi, SP Road in Bangalore, Lamington Road in Mumbai — and you’ll find hundreds of shops. Almost none of them publish prices. The price depends on who’s asking, what day it is, and how much inventory they have.

Negotiation is expected
The first price a shopkeeper quotes is never the final price. You’re expected to push back. Mentioning a competitor’s quote or simply saying “thoda zyada lag raha hai” can shave thousands of rupees off.

Hindi-first market
Over 600 million people in India speak Hindi. In Tier 2 and Tier 3 cities, most shopkeepers do business exclusively in Hindi. Anything that only works in English misses the majority of the market.

CallKaro addresses all three: it makes real phone calls, negotiates in natural Hindi, and aggregates results across multiple shops into a structured comparison.

Five Stages: From “I Want an AC” to a Price Comparison

The whole thing is a five-stage pipeline. You say what you want to buy and where you live. CallKaro researches the product, finds nearby stores, calls each one, and gives you a ranked comparison.

Stage 1: Intake
An LLM pulls out what you actually need from what you type — product type, capacity, budget, location, brand preferences.

Stage 2: Research
The LLM searches the web for current market prices, dealer margins, seasonal pricing, and negotiation tactics. All of this gets injected into the voice agent’s prompt so it knows what a fair price looks like and can push back credibly.

Stage 3: Store Discovery
Google Maps scraping and web search find nearby stores, phone numbers, ratings, and reviews.

Stage 4: Voice Calling
For each store, CallKaro spins up a LiveKit WebRTC room, dispatches a voice agent, and dials the store over SIP. The agent speaks Hindi, asks about price, warranty, installation, delivery, negotiates gently, and hangs up once it has enough information.

Stage 5: Cross-Store Analysis
Transcripts are fed to an LLM that extracts structured data and ranks stores by total estimated cost.

How the Voice Pipeline Turns Phone Audio Into Hindi Conversation

This is where the most interesting engineering lives. Here’s the signal flow for a single conversational turn:

VAD (Silero)
Detects when someone is speaking. I tuned min_speech_duration to 80ms and min_silence_duration to 800ms so the agent waits a beat before assuming the shopkeeper has finished.

STT (Sarvam saaras:v3)
Converts Hindi speech to English-transliterated text — “Adtees hazaar ka hai” rather than Devanagari script.

LLM (Claude Haiku 4.5)
Receives the shopkeeper’s transcribed speech and generates the next response in Romanized Hindi.

Normalization Layer
Sits between the LLM and TTS. Handles number-to-Hindi-word conversion, Devanagari transliteration, action marker stripping, and spacing fixes. Most of the bugs I’ll talk about below lived here.

TTS (Sarvam bulbul:v3)
Converts Romanized Hindi text to natural-sounding Hindi speech. The “shubh” voice at 8kHz for telephony, 16kHz for browser sessions.

Turn Detection
A transformer-based model that predicts when someone’s done talking, running on top of VAD signals. This is critical for Hindi — people pause mid-sentence all the time, and you don’t want the agent jumping in too early.

Bugs I ran into

Streaming Broke Hindi Numbers: “28” + “000” Became “28” and “Zero”

The LLM outputs prices as digits: “38000”. The normalization layer converts this to Hindi words for TTS: “adtees hazaar.” Simple enough — until streaming broke it.

LLM responses arrive as token chunks. The number “28000” might arrive as "28" and "000". If you normalize each chunk independently, you get “attaaees” (28) followed by “zero” instead of “attaaees hazaar” (28,000).

The fix was a buffered normalizer that holds trailing digits until the next chunk arrives:

class _NumberBufferedNormalizer:
    def __init__(self):
        self._buffer = ""

    def process(self, chunk: str) -> str:
        chunk = self._buffer + chunk
        self._buffer = ""
        m = re.search(r"(\d+)$", chunk)
        if m:
            self._buffer = m.group(1)
            chunk = chunk[:m.start()]
        return _normalize_for_tts(chunk) if chunk else ""

    def flush(self) -> str:
        if self._buffer:
            result = _normalize_for_tts(self._buffer)
            self._buffer = ""
            return result
        return ""

Hindi number conversion covers every number from 0 to 99 with individual words (Hindi does not have a regular tens-and-ones pattern like English), plus compound forms for thousands, lakhs, and crores. It also handles special patterns: 1,500 becomes “dedh hazaar” (one-and-a-half thousand), 2,500 becomes “dhaai hazaar”, and 37,500 becomes “saadhe saintees hazaar”.

The LLM Would Randomly Switch to English Mid-Call

Every now and then, the LLM would just… respond in fluent English. Mid-call. The Hindi TTS would crash because it doesn’t know what to do with English text.

I built a heuristic detector that checks for common Hindi marker words — achha, ji, haan, theek, kya, hai, etc. If a response longer than 20 characters has none of them, it gets flagged as a character break and a canned Hindi fallback is used instead.

Background Noise Transcribed as random words

Real phone calls have a lot of nothing — silence, hold music, background chatter, line noise. The Sarvam STT model would sometimes transcribe these as random English words like “table,” “the,” “and” — and the LLM would try to respond to them. Single-word transcripts matching a known set of common STT artifacts now get flagged as garbage and ignored.

Hindi Script Characters Leaked Into Latin-Only TTS and Crashed It

Despite explicit instructions to output only Romanized Hindi, the LLM would occasionally leak Devanagari characters. Even a single one could mess up the TTS. So I added a fast single-pass transliterator as a safety net — it detects any Devanagari and converts it to Romanized equivalent, handling consonant-matra combinations correctly.

Stripping Whitespace Killed Natural Pauses in Speech

This one was subtle. The Sarvam TTS engine uses whitespace to figure out where sentences end for natural pauses. When the normalization layer called .strip() on LLM output chunks, it removed leading spaces that the TTS was relying on as sentence boundaries. The result was unnatural, run-together speech with no pauses.

The fix: don’t call .strip(). Preserve all whitespace from LLM tokens through the entire pipeline.

Sarvam STT/TTS: Monkey-Patches, Preprocessing, and Audio Settings

STT WebSocket keeps dying after ~90 seconds
The LiveKit plugin’s _run() method breaks on normal stream completion instead of reconnecting. I monkey-patched SpeechStream._run to loop and reconnect until the session actually ends.

TTS just works with Romanized Hindi
Sarvam’s enable_preprocessing=True handles pronunciation internally, so I don’t need a custom dictionary. The LLM outputs “achha” and bulbul:v3 pronounces it correctly. I use the “shubh” voice at 8kHz for telephony, 16kHz for browser testing.

Phone calls are 8kHz, browser is 16kHz
The voice agent picks the sample rate automatically based on whether there’s a phone number in the call metadata.

Adding Testing to the pipeline

How do you test a voice agent that speaks Hindi and negotiates prices? I ended up with three layers.

Layer 1: Shopkeeper Scenario Simulation

I created 11 scripted shopkeeper scenarios from real call transcripts. Each one has a personality — cooperative shopkeepers who give prices right away, defensive ones who refuse to negotiate, evasive ones who go off-topic, ones who put you on hold, ones who interrupt. The test framework feeds shopkeeper lines one at a time into the actual LLM and checks the agent’s response after each turn.

Layer 2: ConstraintChecker (8 Rules)

Every agent response is checked against 8 behavioral constraints:

  1. No Devanagari — Hindi script characters crash the TTS
  2. Single question — no stacking 3 questions in one breath
  3. Response length — under 300 characters
  4. No action markers — no *pauses* or (laughs) text
  5. No newlines — TTS reads everything as one utterance
  6. No English translations — no (Yes, I'm listening) parentheticals
  7. No end_call text — use the tool, do not type [end_call]
  8. No invented details — do not fabricate specific brands or addresses

Layer 3: ConversationScorer (Weighted Scoring)

Full conversations are scored on five dimensions:

DimensionWeightWhat It Measures
Constraints40%Average ConstraintChecker score across all turns
Topic coverage25%How many of price/warranty/installation/delivery/exchange were discussed
Price echo15%Did the agent repeat the shopkeeper’s exact price number back
Brevity10%Average response length (under 100 chars = perfect)
No repetition10%Adjacent responses are not copies of each other

The full test suite has 188 unit tests covering normalization, sanitization, transcript parsing, store ranking, prompt building, and constraint checking, plus 26 live API tests that hit the actual Sarvam and Anthropic endpoints.

A real call transcript

Here’s a trimmed transcript from a real call to a local electronics shop in Bangalore, asking about a 7 kg washing machine:

Agent:  "Hello, yeh Prime Electronics hai? Saat kg fully
         automatic top load washing machine ke baare mein
         poochna tha."
Shop:   "Haan, batao."
Agent:  "Aapke paas saat kg fully automatic washing machine
         kitne ka hai?"
Shop:   "Baaees hazaar se start hai. LG, Whirlpool, Samsung,
         Haier -- sab mil jaega."
Agent:  "Achha, LG ka kya price hai?"
Shop:   "LG ka attaaees hazaar."
Agent:  "Achha, attaaees hazaar. Chaubees hazaar ke aas paas
         de sakte ho?"
Shop:   "24000 nahi ho paega bhaiya. 28000 fixed price hai."
Agent:  "Theek hai. Warranty kitni milegi?"
Shop:   "LG mein das saal ki warranty hai."
Agent:  "Achha, das saal. Installation aur delivery free hai?"
Shop:   "Haan haan, wo free kar denge. Ek do din mein ho
         jaegi delivery."
Agent:  "Theek hai ji, bahut badiya. Dhanyavaad, namaste."

The agent asked about price, tried to negotiate down from ₹28,000 to ₹24,000 (got shut down), confirmed warranty, installation, and delivery — all in 8 turns. Across the full 11-scenario test suite, the agent consistently scores above 0.85 with constraint compliance above 0.95.

What it costs to run

A single comparison involves 3-4 phone calls to different stores. Each call hits three paid APIs, so they calls have to be cheap to execute.

Sarvam AI (STT + TTS)
₹0.35/hour for speech-to-text, ₹0.18 per 10K characters for text-to-speech. A typical 2-3 minute call costs almost nothing. Sarvam gives you ₹1,000 in free credits on signup, which goes a long way.

Anthropic (Claude Haiku 4.5)
Haiku is the cheapest Claude model and it’s more than good enough for this. Each call uses a few hundred tokens of input (system prompt + conversation history) and maybe 50-100 tokens of output per turn. A full comparison of 3-4 calls costs a few cents.

LiveKit (WebRTC + SIP)
LiveKit Cloud’s free tier gives you 5,000 participant-minutes per month. Each call uses roughly 2-3 minutes, so you can do a lot of comparisons before hitting the limit. SIP trunking for the actual phone calls isn’t set up since it is cumbersome for a side project, but that will add some cost to this equation.

The bottom line: a full price comparison across 3-4 stores costs somewhere in the low single-digit rupees. I haven’t done exact unit economics yet, but it’s cheap enough that I’m not worried about it at this stage.

What’s Next

Smarter negotiation
The research phase already pulls dealer margins and online reference prices. Right now the agent doesn’t use this aggressively enough during calls. I want it to dynamically adjust how hard it pushes based on how the quoted price compares to what it already knows.

Voice A/B testing
Sarvam’s bulbul:v3 has 39 speakers. I’m curious whether different voices — male vs female, formal vs casual — affect what price the shopkeeper quotes. My hypothesis: voice characteristics influence perceived social status, which changes the starting price.

Product-specific conversations
Right now the agent asks the same generic questions regardless of what you’re buying. But an AC purchase involves tonnage and copper piping costs; a washing machine involves load capacity and drum type; a laptop involves use case and RAM. I want product-specific conversation trees that go deeper on what actually matters.


CallKaro is live at callkaro.nikamma.in. The code is on GitHub. The stack is LiveKit for WebRTC, Sarvam AI for Hindi STT/TTS, Claude Haiku 4.5 for the LLM, and a plain Python HTTP server holding it all together.