Hey folks,
I've embarked on quite the adventure building an ASR model for dialectal Arabic and could use some insights. I'm employing the PyTorch library with a custom dataset of about 120 hours. The challenge? Getting this Conformer-Tiny encoder paired with a LSTM decoder to converge.
For losses, I'm alternating between CTC and Cross-Entropy, weighted as 0.4 for CTC and 0.6 for CE. Early on in the training, both loss values sharply dip before mutating into a plateau—a real stubborn one. The CTC loss bounces around the 50-70 range, and the CE loss is stuck in the 70s. Result? Stagnant validation CER which hovers disappointingly high.
I've double-checked the basics: tweaked the learning rate to varying scales (from 1e-3 to 1e-5), incorporated different warm-up strategies, adjusted epochs and batch size, and also tested smaller vocabulary lists down to 3000 entries.
The data is somewhat of a riddle—it's not publicly released and comes with quite noisy labels. My validation corpuses are MGB2-based and seemingly more reliable.
I'm scratching my head here. Has anyone wrangled with similar training snags, where loss curves just won't budge even with extensive hyper-parameter tuning? If you've tackled such issues, I'd love to hear what worked for you.
Any pointers or fresh directions would be immensely valuable!
Thanks in advance!
Interesting challenge! Have you thought about trying a Transformer-based decoder instead of LSTM? In my experience, Transformers tend to handle attention mechanisms better and improve model convergence rates. Also, when I worked on a dialectally-rich dataset, normalizing your input might help. I used a combination of feature extraction techniques like pitch and MFCCs to better handle variances in dialects. Since you mentioned noisy labels, have you tried using label smoothing, which can sometimes help in cases where the labels aren't entirely accurate?
Hey there! I've tackled a similar issue in the past with an ASR model for a low-resource language. One thing that helped was implementing SpecAugment on the input features, as it added some robustness to the model by artificially increasing the dataset size. Also, have you tried incorporating any dropout in your model? Sometimes, a dropout layer can help to avoid overfitting, especially in early stages. Good luck, ASR for dialects can be tricky.