[GH#540] [Future] Sherpa-ONNX Benchmarks: STT/TTS Latency, RAM, CPU auf Windows Gaming-Rig #17

New issue

Open

opened 2026-05-19 22:15:33 +02:00 by Max · 0 comments

Max commented

2026-05-19 22:15:33 +02:00

Owner

Migrated from GitHub #540
Originally created by @Bio1988 on 2026-05-15T07:43:44Z

Context

Sherpa ONNX is the definitive local speech pipeline (ADR 0012). C.4 (STT) and C.5 (TTS) providers are integrated. This issue tracks comprehensive benchmarks now deferred from Batch 5C (#379).

Scope

Benchmark all Sherpa ONNX models planned for Strategy Desktop on a Windows iRacing rig:

STT Models (streaming, per-language)

Model	Size	Priority
zipformer-en-20M	~125 MB	Primary (EN)
zipformer-en-20M-mobile	~105 MB	Low-end fallback
zipformer-en-303M	~303 MB	High-accuracy

TTS Models

Model	Size	Priority
Kokoro-int8 (11 speakers)	~143 MB	Primary
VITS LJSpeech	~106 MB	Fast fallback
PocketTTS-int8 (voice cloning)	~96 MB	Future voice packs

Metrics

Latency: Generation time (ms) for 1s/3s/5s utterances
RAM: Resident set size during active STT/TTS
CPU: % utilization, E-Core vs P-Core impact
GPU: ONNX provider overhead (CUDA/DirectML vs CPU)
Gaming impact: FPS delta while iRacing runs

Acceptance Criteria

Benchmark report with tables per model/metric
Recommendation for default model selection
FPS impact during active STT (PTT) and TTS (callout)
E-Core vs P-Core affinity recommendation

Non-goals

Voice cloning quality benchmarks
Non-English model benchmarks (deferred to localization)
VAD/Keyword Spotting benchmarks (small models, negligible)

Migrated from [GitHub #540](https://github.com/Bio1988/strategy-desktop/issues/540) Originally created by @Bio1988 on 2026-05-15T07:43:44Z --- ## Context Sherpa ONNX is the definitive local speech pipeline (ADR 0012). C.4 (STT) and C.5 (TTS) providers are integrated. This issue tracks comprehensive benchmarks now deferred from Batch 5C (#379). ## Scope Benchmark all Sherpa ONNX models planned for Strategy Desktop on a Windows iRacing rig: ### STT Models (streaming, per-language) | Model | Size | Priority | |---|---|---| | zipformer-en-20M | ~125 MB | Primary (EN) | | zipformer-en-20M-mobile | ~105 MB | Low-end fallback | | zipformer-en-303M | ~303 MB | High-accuracy | ### TTS Models | Model | Size | Priority | |---|---|---| | Kokoro-int8 (11 speakers) | ~143 MB | Primary | | VITS LJSpeech | ~106 MB | Fast fallback | | PocketTTS-int8 (voice cloning) | ~96 MB | Future voice packs | ### Metrics - **Latency:** Generation time (ms) for 1s/3s/5s utterances - **RAM:** Resident set size during active STT/TTS - **CPU:** % utilization, E-Core vs P-Core impact - **GPU:** ONNX provider overhead (CUDA/DirectML vs CPU) - **Gaming impact:** FPS delta while iRacing runs ## Acceptance Criteria - [ ] Benchmark report with tables per model/metric - [ ] Recommendation for default model selection - [ ] FPS impact during active STT (PTT) and TTS (callout) - [ ] E-Core vs P-Core affinity recommendation ## Non-goals - Voice cloning quality benchmarks - Non-English model benchmarks (deferred to localization) - VAD/Keyword Spotting benchmarks (small models, negligible)

Max added the

labels

2026-05-19 22:15:33 +02:00