Adarsh Maurya

Entrepreneur

I build AI and Web3 solutions.

AYRA: Building a Private, Voice‑First Agentic Companion (and the Engineering Behind It)

AYRA started with a simple thesis: a truly personal AI companion shouldn’t feel like a chat window. It should feel present—voice-first, emotionally aware, and capable of building continuity over time. This post focuses on how we built it on top of the Mridu codebase (agent orchestration + voice pipeline backend) and what we learned shipping a system that behaves like software—not just a demo.

agentic AI LangGraph FastAPI WebSockets STT/TTS Pinecone Groq Deepgram

What AYRA Solves (The Actual Problem)

Most “voice assistants” break down in predictable ways:

No real memory: every interaction resets the context, or memory becomes unreliable/hallucination‑prone.
Unstable orchestration: a single prompt tries to do everything—routing, reasoning, tool selection, memory writing.
Voice latency kills UX: streaming STT/TTS is hard, and agentic reasoning adds even more latency.
No safety or governance model: tools, permissions, and action-taking are bolted on late.

AYRA is designed as a private, loyal, emotionally-aware digital companion, but the engineering goal is more concrete: durable orchestration, real-time voice interaction, tiered memory, tool-ready execution, and graceful degradation.

Two Repos, One System: Why AYRA Uses Mridu

1) Mridu — the engine (agent + voice backend)

Mridu contains the runtime: real-time voice conversation over WebSockets, STT/TTS providers, LangGraph agent orchestration, memory (Pinecone + session tracking), and integration docs that describe the end-to-end pipeline.

Voice conversation flow (conceptual)

User speaks
  → WebSocket (/ws/audio)
  → Deepgram (STT)
  → Utterance end
  → Enhanced LLM → LangGraph workflow
  → (intent → retrieval → reasoning → tools → memory extraction)
  → TTS (Sarvam / ElevenLabs)
  → WebSocket streams audio back

2) Ayra — the representation (site / project presence)

The Ayra repository is the public-facing layer: the website and narrative that explains the architecture, privacy model, and roadmap. It makes the engineering legible—especially important for trust-centered systems.

System Architecture (High-Level)

AYRA is built like a pipeline, not a prompt:

Frontend: captures audio and provides a voice-first UX.
Backend (Mridu): WebSocket voice gateway + agent runtime.
Agentic brain: a LangGraph workflow to keep orchestration modular and inspectable.
Memory: Pinecone for semantic long-term recall; database for sessions/logs; optional Redis tier for speed.

The key design choice: voice is not “a feature on top of chat.” Voice is the primary loop, so orchestration must behave correctly under streaming conditions and partial events.

The Agentic Core: Why LangGraph (Not Just “LLM Calls”)

In Mridu, the agent system is isolated as its own package (for maintainability, testing, and clean boundaries):

Agent package structure (simplified)

backend/app/agent/
  ├─ core/        # graph orchestration + nodes
  ├─ memory/      # retrieval + storage policy
  ├─ tools/       # registry + execution contracts
  ├─ models/      # state + message schemas
  └─ api/         # FastAPI integration surface

Durable state with a typed AgentState

A typed, explicit state model allows debugging and evolution without turning the system into “prompt spaghetti.” The state tracks session metadata, transcript events, working memory, retrieved memories, tool calls, outputs, and debug info.

Orchestration as explicit workflow nodes

The LangGraph workflow is composed of explicit nodes: stt_input → intent_classifier → retriever → reasoner → tool_executor → memory_extractor → tts_output. Each node has a single responsibility, which is how we keep the system extensible and safer to modify.

Voice Pipeline: WebSockets + Streaming STT/TTS

Voice experience lives or dies by streaming correctness and latency. Mridu treats voice as a first-class pipeline:

Transport: WebSocket endpoint (/ws/audio)
STT: Deepgram real-time transcription
TTS: Sarvam / ElevenLabs (streamable audio output)
Integration pattern: provider factory + adapter layer so the audio stack stays stable while the agent evolves

This separation is what lets us iterate on intelligence without constantly risking regressions in STT/TTS streaming.

Memory: From “Chat History” to “Personal Continuity”

A companion that doesn’t remember isn’t a companion—but “just store everything” creates noise and erodes trust. Our memory pipeline is intentionally staged:

Intent classification decides whether retrieval is needed.
Retriever fetches relevant semantic memories (Pinecone).
Reasoner grounds the response in retrieved context.
Memory extractor decides what (if anything) becomes durable memory.

This prevents two common failure modes: storing everything (garbage memory) or storing nothing (no continuity).

Key Engineering Challenges (and What We Did About Them)

1) Latency: agentic reasoning + voice streaming

Voice has a strict latency budget. Adding orchestration, retrieval, and tool routing can push it over the edge. What helped:

Explicit stages so we can measure and optimize per-node latency.
Intent-based retrieval so we don’t hit vector DB on every turn.
Graceful fallbacks when optional services are unavailable.

2) Streaming reliability (duplicates, partial events, race conditions)

WebSockets and partial transcripts create edge cases: utterance-end triggers, reconnects, and duplicate events. The main lesson: treat “same message twice” as normal and design for idempotency and deduplication.

3) Memory correctness is harder than memory storage

Vector DB integration is easy; deciding what is worth remembering is not. That’s why memory extraction is a dedicated stage: it keeps memory writes auditable, adjustable, and safer over time.

4) Production-grade integration: graceful degradation

A serious system must boot even when some keys/providers are missing. Mridu initializes the agent service safely and continues without agent capabilities if needed—critical for local development and real deployments.

What Makes This Engineering‑First (Not a Demo)

Graph-based orchestration instead of a monolithic “agent prompt.”
Typed state instead of unstructured dicts flowing through the system.
Providers + adapters so voice services remain stable while intelligence evolves.
Observable boundaries (health endpoints, integration docs, structured logging hooks).
Tool execution framework designed for safety and future extensibility.

SEO Notes (Built Into the Structure)

This post is intentionally structured around search intent while still reading like a human engineering write-up:

Clear primary topic: voice-first agentic AI companion.
Scannable headings: architecture, LangGraph orchestration, WebSocket voice pipeline, memory.
Concrete technical nouns: FastAPI, WebSockets, Deepgram, Sarvam/ElevenLabs, Pinecone, Groq.
Problem → approach → result flow (what engineers actually want).

What’s Next

The next engineering wins for AYRA typically look like:

Stronger deduplication/idempotency for streaming events
Better observability (per-node timing, tracing)
More robust memory write policies (confidence + user consent)
Expanding tool execution with a permissioned read / suggest / act model

References (Repo Anchors)

Mridu voice + agent integration overview: backend/app/INTEGRATION_COMPLETE.md
Agent package entry point: backend/app/agent/__init__.py
LangGraph orchestrator: backend/app/agent/core/orchestrator.py
Agent state model: backend/app/agent/models/state.py
Provider factory (STT/TTS/Agent adapter): backend/app/core/provider_factory.py
Ayra frontend technical docs: docs/tech.md (Ayra repo)
Ayra overview and architecture narrative: README.md (Ayra repo)

Quick question

Do you want the published version to be written in a first-person build log (“we built… we hit… we solved…”) or a neutral engineering deep dive tone? I can adjust the wording without changing the technical content.

AITechnology

140

Published on January 20, 2026

Last updated on July 5, 2026

Stay Updated

Get the latest insights and updates delivered to your inbox

AYRA Engineering Deep Dive: LangGraph Agent + Real‑Time Voice (STT/TTS) + Memory