Developer Trains 9M-Parameter Mandarin Pronunciation Tutor Using CTC Loss

An independent developer created a highly specialized, on-device Computer-Assisted Pronunciation Training (CAPT) system for Mandarin tones, circumventing commercial APIs. The system, built using a Conformer encoder and Connectionist Temporal Classification (CTC) loss, was trained on 300 hours of speech data. This approach prioritizes verbatim transcription over auto-correction, offering granular feedback crucial for mastering tonal languages.

La Era

23 de febrero de 2026

Developer Trains 9M-Parameter Mandarin Pronunciation Tutor Using CTC Loss

An independent developer announced the release of a compact, nine-million-parameter Mandarin pronunciation tutor designed for on-device operation, according to a report published on simedw.com.

The system was developed after initial attempts to build a brittle, heuristic-based pitch visualization tool proved insufficient for handling real-world audio complexities like noise and coarticulation.

Treating the problem as a specialized Automatic Speech Recognition (ASR) task, the developer opted for a Conformer encoder trained with CTC loss, balancing the need for local spectral feature capture (via convolution) and global contextual understanding (via attention).

Unlike standard sequence-to-sequence ASR models that might correct pronunciation errors to achieve the most likely text output, the CTC framework forces the model to analyze probabilities on a frame-by-frame basis, revealing exactly what was spoken.

The developer established a vocabulary of 1,254 tokens representing Pinyin syllables including tones, avoiding Hanzi representation which obscures phonetic mistakes.

After training on approximately 300 hours of combined AISHELL-1 and Primewords data, the initial 75M-parameter model was aggressively pruned down to the nine-million-parameter version, retaining high accuracy while achieving an 11 MB footprint post-quantization.

Testing revealed a critical alignment bug related to leading silence, which was resolved by decoupling the UI highlighting spans from the scoring frames to prevent blank tokens from dominating confidence metrics.

Early beta testers report the system is strictly effective for improvement, although native speakers noted the model required them to over-enunciate, suggesting future work should incorporate more conversational datasets to address domain shift.

Developer Trains 9M-Parameter Mandarin Pronunciation Tutor Using CTC Loss

Tags

Comments

Keep reading

More from AI

Former Google Engineer Convicted of Stealing Thousands of AI Trade Secrets for PRC

Moonshot AI Publishes Technical Report for Kimi-K2.5 Model

AMLALabs Introduces amla-sandbox for Isolated LLM Code Execution via WASM

Latest news

National Herbarium of Ireland Digitizes Over 5,000 Irish Plant Specimens in DRI

US Senate Grants Two-Week Extension for DHS Funding Amid ICE Negotiation Standoff

Google Disrupts IPIDEA, One of World's Largest Residential Proxy Networks