Module 6A — Voice AI¶
Parent: Phase 3 — Artificial Intelligence · Track A
Speech-to-text, text-to-speech, and real-time voice pipelines — the audio workloads that drive edge AI hardware.
Prerequisites: Module 1 (Neural Networks — transformers, attention), Module 2 (Frameworks — PyTorch).
Role targets: Voice AI Engineer · Speech/Audio ML Engineer · Edge Audio Engineer · Conversational AI Engineer
Why Voice AI Matters for Hardware Engineers¶
Voice is one of the most latency-sensitive AI workloads. Users notice 200ms+ delay in conversation. This creates hard requirements for hardware:
- STT (Speech-to-Text): Real-time transcription needs streaming inference with <100ms latency per chunk
- TTS (Text-to-Speech): Natural voice synthesis requires autoregressive generation (like LLMs) or fast diffusion models
- On-device voice: Privacy-critical applications (medical, military, automotive) need inference without cloud — your edge chip must handle it
- VAD (Voice Activity Detection): Always-on, ultra-low-power — runs on MCU or dedicated DSP (L4/L5 hardware)
| Voice task | Compute pattern | Hardware implication |
|---|---|---|
| VAD | Tiny CNN, always-on | L4: MCU/DSP, < 1mW power budget |
| STT (streaming) | Encoder-decoder, chunked | L1/L3: streaming inference, low latency |
| TTS (neural) | Autoregressive or diffusion | L1/L3: memory-bound generation, like LLM decode |
| Keyword spotting | Small CNN/RNN, always-on | L4: TinyML, dedicated audio NPU |
| Speaker verification | Embedding model, one-shot | L1: inference + vector similarity |
| Noise suppression | U-Net or RNNoise, real-time | L4/L6: DSP or FPGA, strict latency budget |
1. Speech-to-Text (STT / ASR)¶
How Modern STT Works¶
Audio Input → Feature Extraction → Encoder → Decoder → Text Output
(mel spectrogram) (transformer) (CTC/attention)
Key Models and Architectures¶
| Model | Architecture | Strengths | Use case |
|---|---|---|---|
| Whisper (OpenAI) | Encoder-decoder transformer | Multi-language, robust, open-source | General-purpose STT |
| Wav2Vec 2.0 (Meta) | Self-supervised encoder + CTC | Pre-trained on unlabeled audio | Low-resource languages |
| Conformer (Google) | Convolution + transformer | Best accuracy on benchmarks | Production ASR |
| DeepSpeech (Mozilla) | RNN + CTC | Simple, lightweight | Legacy / embedded |
| Whisper.cpp | GGML quantized Whisper | Runs on CPU/edge without GPU | On-device STT |
| Faster-Whisper | CTranslate2 backend | 4x faster than original Whisper | Production serving |
Audio Feature Extraction¶
import torchaudio
# Load audio
waveform, sample_rate = torchaudio.load("speech.wav")
# Convert to mel spectrogram (the "image" that the model sees)
mel_transform = torchaudio.transforms.MelSpectrogram(
sample_rate=16000,
n_fft=400, # FFT window size
hop_length=160, # 10ms hop (160 samples at 16kHz)
n_mels=80 # 80 mel frequency bins
)
mel = mel_transform(waveform)
# Shape: [1, 80, T] — 80 frequency bins x T time frames
# This is the input to the encoder
Why this matters for hardware: - Mel spectrogram computation is an FFT + filterbank — can be accelerated on DSP or FPGA (L6) - The encoder processes the mel spectrogram — this is the compute-heavy part (matrix multiply) - Streaming STT processes chunks (e.g., 1 second at a time) — requires careful state management
Streaming vs Batch STT¶
| Mode | Latency | Accuracy | Use case |
|---|---|---|---|
| Batch | Process entire audio file at once | Highest | Transcription, subtitles |
| Streaming | Process chunks in real-time | Slightly lower | Live conversation, voice assistant |
Streaming requires: - Chunked input (e.g., 1-second windows with overlap) - Encoder state caching between chunks - Partial hypothesis output with correction
Projects¶
- Run Whisper on a sample audio file. Measure inference time and WER (word error rate).
- Whisper on Jetson — deploy Whisper with TensorRT. Measure latency and compare with CPU.
- Streaming STT — implement chunked Whisper inference. Measure time-to-first-word.
- Whisper.cpp on edge — run quantized Whisper on Raspberry Pi or Jetson. Benchmark INT8 vs FP16.
2. Text-to-Speech (TTS)¶
How Modern TTS Works¶
Text Input → Text Analysis → Acoustic Model → Vocoder → Audio Output
(phonemes, (mel spectrogram (waveform
prosody) generation) synthesis)
Key Models and Architectures¶
| Model | Type | Quality | Speed | Use case |
|---|---|---|---|---|
| VITS | End-to-end (text → audio) | High | Fast | Production TTS |
| Bark (Suno) | GPT-style autoregressive | Very high, expressive | Slow | Creative, multi-language |
| Tortoise TTS | Autoregressive + diffusion | Highest quality | Very slow | Voice cloning |
| Piper | VITS-based, optimized | Good | Very fast | On-device, embedded |
| Coqui TTS | Multiple architectures | High | Medium | Open-source toolkit |
| F5-TTS | Flow matching | High, zero-shot | Fast | Voice cloning, multilingual |
| XTTS (Coqui) | GPT + VITS | High, voice cloning | Medium | Multi-speaker, multi-language |
TTS Pipeline Components¶
Text analysis: - Text normalization (numbers, abbreviations, dates → words) - Grapheme-to-phoneme (G2P) conversion - Prosody prediction (duration, pitch, energy)
Acoustic model (mel generation): - Generates mel spectrogram from phoneme sequence - Autoregressive (Tacotron 2) or non-autoregressive (FastSpeech 2, VITS) - Non-autoregressive is faster — better for edge deployment
Vocoder (waveform synthesis): - Converts mel spectrogram → audio waveform - HiFi-GAN: fast, high-quality, lightweight - WaveGlow / WaveNet: higher quality, much slower
# Piper TTS — fast on-device TTS
import piper
voice = piper.PiperVoice.load("en_US-lessac-medium.onnx")
audio = voice.synthesize("Hello, I am running on edge hardware.")
# Runs on CPU, ~10x real-time on Raspberry Pi 4
Projects¶
- Run Piper TTS on CPU. Measure real-time factor (RTF). Target: RTF < 0.1 (10x faster than real-time).
- VITS on Jetson — deploy VITS with ONNX Runtime or TensorRT. Measure latency per sentence.
- Voice cloning — use XTTS or F5-TTS to clone a voice from a 10-second sample.
- HiFi-GAN vocoder — run standalone, benchmark on GPU vs CPU. Understand the mel → waveform bottleneck.
3. Voice Activity Detection (VAD) & Keyword Spotting¶
VAD — Is Someone Speaking?¶
Always-on, ultra-low-power. Runs continuously to wake up the full STT pipeline.
| Model | Size | Latency | Power | Platform |
|---|---|---|---|---|
| Silero VAD | 1.5 MB | <1ms per frame | ~10mW on MCU | CPU, edge |
| WebRTC VAD | <100 KB | <0.1ms | ~1mW | Any CPU |
| Custom CNN VAD | 50–500 KB | <1ms | <5mW | MCU, DSP |
# Silero VAD — production-quality, lightweight
import torch
model, utils = torch.hub.load('snakers4/silero-vad', 'silero_vad')
(get_speech_timestamps, _, read_audio, _, _) = utils
wav = read_audio('speech.wav', sampling_rate=16000)
timestamps = get_speech_timestamps(wav, model, sampling_rate=16000)
# Returns: [{'start': 1000, 'end': 15000}, ...] — speech segments
Keyword Spotting — "Hey [Device]"¶
Always-on wake word detection. Must run at < 1mW for battery-powered devices.
- Models: Small CNNs (DS-CNN), RNNs, or attention-based
- Training: Typically custom-trained on the specific wake word
- Deployment: TFLite Micro on Cortex-M, dedicated audio DSP
- Connection to L4/L5: This is a workload you'd design custom silicon for (always-on NPU)
Noise Suppression / Enhancement¶
Real-time audio cleanup before STT.
| Model | Approach | Latency | Quality |
|---|---|---|---|
| RNNoise | GRU-based, handcrafted features | <5ms | Good |
| DTLN | Dual-signal transformer | ~10ms | High |
| DeepFilterNet | Attention-based filterbank | ~10ms | Very high |
| NSNet2 | Dense + GRU | <5ms | Good |
Projects¶
- Silero VAD — run on a continuous audio stream. Measure detection latency and false positive rate.
- Keyword spotter on MCU — train a small CNN for a custom wake word. Deploy with TFLite Micro on Cortex-M.
- RNNoise on FPGA — implement the GRU-based noise suppression in HLS (connection to Phase 4A).
4. End-to-End Voice Pipeline¶
Architecture for Edge Voice Assistant¶
┌─────────────────────────────────────────────────────────┐
│ Edge Device (Jetson / MCU+NPU) │
│ │
│ ┌─────────┐ ┌──────────┐ ┌──────────┐ │
│ │ VAD │──▶│ STT │──▶│ NLU / │ │
│ │(always │ │(Whisper │ │ LLM │ │
│ │ on) │ │ or │ │(on-device│ │
│ └─────────┘ │Conformer)│ │ or cloud)│ │
│ └──────────┘ └────┬─────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ ┌─────────┐ │ TTS │ │
│ │ Speaker │◀───────────────│ (VITS/ │ │
│ │ │ │ Piper) │ │
│ └─────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
Latency budget for natural conversation:
| Stage | Target | Model |
|---|---|---|
| VAD detection | < 50ms | Silero VAD |
| STT (streaming) | < 300ms to first word | Whisper / Conformer |
| NLU / LLM response | < 500ms | On-device small LLM or cloud |
| TTS synthesis | < 200ms to first audio | VITS / Piper |
| Total round-trip | < 1 second |
Projects¶
- Full pipeline on Jetson — VAD → Whisper STT → simple NLU → Piper TTS. Measure end-to-end latency.
- Optimize for latency — quantize STT to INT8, use streaming chunked inference, pre-warm TTS. Target < 800ms round-trip.
- Compare cloud vs edge — same pipeline on Jetson vs cloud API. Measure latency, accuracy, and privacy trade-off.
5. Voice AI for Hardware Design Context¶
| Voice workload | Compute pattern | Why hardware engineers care |
|---|---|---|
| Mel spectrogram | FFT + filterbank | Can be accelerated in DSP/FPGA — fixed-function vs general compute trade-off |
| Encoder (Conformer) | Attention + conv | Same matmul-heavy compute as vision — systolic arrays apply |
| Autoregressive decode | Sequential token generation | Memory-bound like LLM decode — HBM bandwidth matters |
| Vocoder (HiFi-GAN) | Transposed conv, upsampling | Unique compute pattern — not well-served by standard matmul accelerators |
| VAD / keyword | Tiny CNN/RNN | Target for always-on NPU design (< 1mW) — Phase 5F AI Chip Design |
| Noise suppression | GRU / filterbank | Real-time streaming with strict latency — FPGA or dedicated DSP |
Resources¶
| Resource | What it covers |
|---|---|
| Whisper | Open-source STT model |
| Faster-Whisper | CTranslate2-based fast Whisper |
| Whisper.cpp | C/C++ port for edge deployment |
| Piper | Fast on-device TTS |
| Coqui TTS | Open-source TTS toolkit |
| Silero VAD | Production-quality VAD |
| RNNoise | Real-time noise suppression |
| ESPnet | End-to-end speech processing toolkit |
| SpeechBrain | PyTorch speech toolkit |
Next¶
→ Module 5A — Edge AI & Model Optimization — quantize and deploy these voice models on edge hardware.