Module 6A — Voice AI¶

Parent: Phase 3 — Artificial Intelligence · Track A

Speech-to-text, text-to-speech, and real-time voice pipelines — the audio workloads that drive edge AI hardware.

Prerequisites: Module 1 (Neural Networks — transformers, attention), Module 2 (Frameworks — PyTorch).

Role targets: Voice AI Engineer · Speech/Audio ML Engineer · Edge Audio Engineer · Conversational AI Engineer

Why Voice AI Matters for Hardware Engineers¶

Voice is one of the most latency-sensitive AI workloads. Users notice 200ms+ delay in conversation. This creates hard requirements for hardware:

STT (Speech-to-Text): Real-time transcription needs streaming inference with <100ms latency per chunk
TTS (Text-to-Speech): Natural voice synthesis requires autoregressive generation (like LLMs) or fast diffusion models
On-device voice: Privacy-critical applications (medical, military, automotive) need inference without cloud — your edge chip must handle it
VAD (Voice Activity Detection): Always-on, ultra-low-power — runs on MCU or dedicated DSP (L4/L5 hardware)

Voice task	Compute pattern	Hardware implication
VAD	Tiny CNN, always-on	L4: MCU/DSP, < 1mW power budget
STT (streaming)	Encoder-decoder, chunked	L1/L3: streaming inference, low latency
TTS (neural)	Autoregressive or diffusion	L1/L3: memory-bound generation, like LLM decode
Keyword spotting	Small CNN/RNN, always-on	L4: TinyML, dedicated audio NPU
Speaker verification	Embedding model, one-shot	L1: inference + vector similarity
Noise suppression	U-Net or RNNoise, real-time	L4/L6: DSP or FPGA, strict latency budget

1. Speech-to-Text (STT / ASR)¶

How Modern STT Works¶

Audio Input → Feature Extraction → Encoder → Decoder → Text Output
             (mel spectrogram)    (transformer)  (CTC/attention)

Key Models and Architectures¶

Model	Architecture	Strengths	Use case
Whisper (OpenAI)	Encoder-decoder transformer	Multi-language, robust, open-source	General-purpose STT
Wav2Vec 2.0 (Meta)	Self-supervised encoder + CTC	Pre-trained on unlabeled audio	Low-resource languages
Conformer (Google)	Convolution + transformer	Best accuracy on benchmarks	Production ASR
DeepSpeech (Mozilla)	RNN + CTC	Simple, lightweight	Legacy / embedded
Whisper.cpp	GGML quantized Whisper	Runs on CPU/edge without GPU	On-device STT
Faster-Whisper	CTranslate2 backend	4x faster than original Whisper	Production serving

Audio Feature Extraction¶

import torchaudio

# Load audio
waveform, sample_rate = torchaudio.load("speech.wav")

# Convert to mel spectrogram (the "image" that the model sees)
mel_transform = torchaudio.transforms.MelSpectrogram(
    sample_rate=16000,
    n_fft=400,        # FFT window size
    hop_length=160,   # 10ms hop (160 samples at 16kHz)
    n_mels=80         # 80 mel frequency bins
)
mel = mel_transform(waveform)
# Shape: [1, 80, T] — 80 frequency bins x T time frames
# This is the input to the encoder

Why this matters for hardware: - Mel spectrogram computation is an FFT + filterbank — can be accelerated on DSP or FPGA (L6) - The encoder processes the mel spectrogram — this is the compute-heavy part (matrix multiply) - Streaming STT processes chunks (e.g., 1 second at a time) — requires careful state management

Streaming vs Batch STT¶

Mode	Latency	Accuracy	Use case
Batch	Process entire audio file at once	Highest	Transcription, subtitles
Streaming	Process chunks in real-time	Slightly lower	Live conversation, voice assistant

Streaming requires: - Chunked input (e.g., 1-second windows with overlap) - Encoder state caching between chunks - Partial hypothesis output with correction

Projects¶

Run Whisper on a sample audio file. Measure inference time and WER (word error rate).
Whisper on Jetson — deploy Whisper with TensorRT. Measure latency and compare with CPU.
Streaming STT — implement chunked Whisper inference. Measure time-to-first-word.
Whisper.cpp on edge — run quantized Whisper on Raspberry Pi or Jetson. Benchmark INT8 vs FP16.

2. Text-to-Speech (TTS)¶

How Modern TTS Works¶

Text Input → Text Analysis → Acoustic Model → Vocoder → Audio Output
             (phonemes,       (mel spectrogram   (waveform
              prosody)         generation)         synthesis)

Key Models and Architectures¶

Model	Type	Quality	Speed	Use case
VITS	End-to-end (text → audio)	High	Fast	Production TTS
Bark (Suno)	GPT-style autoregressive	Very high, expressive	Slow	Creative, multi-language
Tortoise TTS	Autoregressive + diffusion	Highest quality	Very slow	Voice cloning
Piper	VITS-based, optimized	Good	Very fast	On-device, embedded
Coqui TTS	Multiple architectures	High	Medium	Open-source toolkit
F5-TTS	Flow matching	High, zero-shot	Fast	Voice cloning, multilingual
XTTS (Coqui)	GPT + VITS	High, voice cloning	Medium	Multi-speaker, multi-language

TTS Pipeline Components¶

Text analysis: - Text normalization (numbers, abbreviations, dates → words) - Grapheme-to-phoneme (G2P) conversion - Prosody prediction (duration, pitch, energy)

Acoustic model (mel generation): - Generates mel spectrogram from phoneme sequence - Autoregressive (Tacotron 2) or non-autoregressive (FastSpeech 2, VITS) - Non-autoregressive is faster — better for edge deployment

Vocoder (waveform synthesis): - Converts mel spectrogram → audio waveform - HiFi-GAN: fast, high-quality, lightweight - WaveGlow / WaveNet: higher quality, much slower

# Piper TTS — fast on-device TTS
import piper

voice = piper.PiperVoice.load("en_US-lessac-medium.onnx")
audio = voice.synthesize("Hello, I am running on edge hardware.")
# Runs on CPU, ~10x real-time on Raspberry Pi 4

Projects¶

Run Piper TTS on CPU. Measure real-time factor (RTF). Target: RTF < 0.1 (10x faster than real-time).
VITS on Jetson — deploy VITS with ONNX Runtime or TensorRT. Measure latency per sentence.
Voice cloning — use XTTS or F5-TTS to clone a voice from a 10-second sample.
HiFi-GAN vocoder — run standalone, benchmark on GPU vs CPU. Understand the mel → waveform bottleneck.

3. Voice Activity Detection (VAD) & Keyword Spotting¶

VAD — Is Someone Speaking?¶

Always-on, ultra-low-power. Runs continuously to wake up the full STT pipeline.

Model	Size	Latency	Power	Platform
Silero VAD	1.5 MB	<1ms per frame	~10mW on MCU	CPU, edge
WebRTC VAD	<100 KB	<0.1ms	~1mW	Any CPU
Custom CNN VAD	50–500 KB	<1ms	<5mW	MCU, DSP

# Silero VAD — production-quality, lightweight
import torch
model, utils = torch.hub.load('snakers4/silero-vad', 'silero_vad')
(get_speech_timestamps, _, read_audio, _, _) = utils

wav = read_audio('speech.wav', sampling_rate=16000)
timestamps = get_speech_timestamps(wav, model, sampling_rate=16000)
# Returns: [{'start': 1000, 'end': 15000}, ...] — speech segments

Keyword Spotting — "Hey [Device]"¶

Always-on wake word detection. Must run at < 1mW for battery-powered devices.

Models: Small CNNs (DS-CNN), RNNs, or attention-based
Training: Typically custom-trained on the specific wake word
Deployment: TFLite Micro on Cortex-M, dedicated audio DSP
Connection to L4/L5: This is a workload you'd design custom silicon for (always-on NPU)

Noise Suppression / Enhancement¶

Real-time audio cleanup before STT.

Model	Approach	Latency	Quality
RNNoise	GRU-based, handcrafted features	<5ms	Good
DTLN	Dual-signal transformer	~10ms	High
DeepFilterNet	Attention-based filterbank	~10ms	Very high
NSNet2	Dense + GRU	<5ms	Good

Projects¶

Silero VAD — run on a continuous audio stream. Measure detection latency and false positive rate.
Keyword spotter on MCU — train a small CNN for a custom wake word. Deploy with TFLite Micro on Cortex-M.
RNNoise on FPGA — implement the GRU-based noise suppression in HLS (connection to Phase 4A).

4. End-to-End Voice Pipeline¶

Architecture for Edge Voice Assistant¶

┌─────────────────────────────────────────────────────────┐
│                   Edge Device (Jetson / MCU+NPU)         │
│                                                          │
│  ┌─────────┐   ┌──────────┐   ┌──────────┐            │
│  │  VAD    │──▶│   STT    │──▶│  NLU /   │            │
│  │(always  │   │(Whisper  │   │  LLM     │            │
│  │  on)    │   │ or       │   │(on-device│            │
│  └─────────┘   │Conformer)│   │ or cloud)│            │
│                └──────────┘   └────┬─────┘            │
│                                    │                    │
│                                    ▼                    │
│                              ┌──────────┐              │
│  ┌─────────┐                │   TTS    │              │
│  │ Speaker │◀───────────────│  (VITS/  │              │
│  │         │                │  Piper)  │              │
│  └─────────┘                └──────────┘              │
└─────────────────────────────────────────────────────────┘

Latency budget for natural conversation:

Stage	Target	Model
VAD detection	< 50ms	Silero VAD
STT (streaming)	< 300ms to first word	Whisper / Conformer
NLU / LLM response	< 500ms	On-device small LLM or cloud
TTS synthesis	< 200ms to first audio	VITS / Piper
Total round-trip	< 1 second

Projects¶

Full pipeline on Jetson — VAD → Whisper STT → simple NLU → Piper TTS. Measure end-to-end latency.
Optimize for latency — quantize STT to INT8, use streaming chunked inference, pre-warm TTS. Target < 800ms round-trip.
Compare cloud vs edge — same pipeline on Jetson vs cloud API. Measure latency, accuracy, and privacy trade-off.

5. Voice AI for Hardware Design Context¶

Voice workload	Compute pattern	Why hardware engineers care
Mel spectrogram	FFT + filterbank	Can be accelerated in DSP/FPGA — fixed-function vs general compute trade-off
Encoder (Conformer)	Attention + conv	Same matmul-heavy compute as vision — systolic arrays apply
Autoregressive decode	Sequential token generation	Memory-bound like LLM decode — HBM bandwidth matters
Vocoder (HiFi-GAN)	Transposed conv, upsampling	Unique compute pattern — not well-served by standard matmul accelerators
VAD / keyword	Tiny CNN/RNN	Target for always-on NPU design (< 1mW) — Phase 5F AI Chip Design
Noise suppression	GRU / filterbank	Real-time streaming with strict latency — FPGA or dedicated DSP

Resources¶

Resource	What it covers
Whisper	Open-source STT model
Faster-Whisper	CTranslate2-based fast Whisper
Whisper.cpp	C/C++ port for edge deployment
Piper	Fast on-device TTS
Coqui TTS	Open-source TTS toolkit
Silero VAD	Production-quality VAD
RNNoise	Real-time noise suppression
ESPnet	End-to-end speech processing toolkit
SpeechBrain	PyTorch speech toolkit

Next¶

→ Module 5A — Edge AI & Model Optimization — quantize and deploy these voice models on edge hardware.