#low-resource-languages
#low-resource-languages

[ follow ]

Hugging Face Introduces mmBERT, a Multilingual Encoder for 1,800+ Languages

Hugging Face has released mmBERT, a new multilingual encoder trained on more than 3 trillion tokens across 1,833 languages. The model builds on the ModernBERT architecture and is the first to significantly improve upon XLM-R, a long-time baseline for multilingual understanding tasks. mmBERT uses a progressive training schedule instead of training on all languages at once. It starts with 60 high-resource languages, expands to 110, and finally includes all 1,833 languages.

Artificial intelligence

fromHackernoon

8 months ago

Training Tesseract for Low-Resource Languages | HackerNoon

Trained Tesseract OCR on 1233 Kurdish text lines from pre-1950 documents to advance digitization of Kurdish historical materials.

Digital life

fromFortune Asia

2 months ago

The world's best AI models operate in English. Other languages-even major ones like Cantonese-risk falling further behind

AI translation models struggle with languages that have limited online data, leading to mistranslations and inaccuracies.

Scala

fromHackernoon

11 months ago

Why Lua Is the Ideal Benchmark for Testing Quantized Code Models | HackerNoon

Lua presents unique challenges for quantized model performance due to its low-resource status and unconventional programming paradigms.

[ Load more ]

#low-resource-languages#low-resource-languages

Hugging Face Introduces mmBERT, a Multilingual Encoder for 1,800+ Languages

Training Tesseract for Low-Resource Languages | HackerNoon

The world's best AI models operate in English. Other languages-even major ones like Cantonese-risk falling further behind

Why Lua Is the Ideal Benchmark for Testing Quantized Code Models | HackerNoon

#low-resource-languages
#low-resource-languages