Darnoto, Brian Rizqi Paradisiaca and Firmawan, Dony Bahtera (2026) Language-Similarity-Guided Transfer Fine-Tuning of Pre-trained Transformer Models for Sentiment Analysis Across 12 Indonesian Regional Languages. Journal of Computing Theories and Applications, 3 (4). pp. 547-563. ISSN 3024-9104
15975-Article Text-56530-2-10-20260507.pdf - Published Version
Available under License Creative Commons Attribution.
Download (864kB)
Abstract
Sentiment analysis for Indonesian regional languages faces two persistent challenges: labeled training data is extremely limited for most regional varieties, and transformer models pre-trained on Bahasa Indonesia do not generalize reliably to languages with substantially different morphological structures. Prior work on the NusaX benchmark has primarily relied on direct fine-tuning, treating each regional language independently and without exploiting linguistic proximity between related languages as a transfer signal. This paper proposes Language-Similarity-Guided Transfer (LSGT), a sequential fine-tuning strategy that first adapts a pre-trained model to a pivot language selected using character trigram similarity, followed by fine-tuning on the target language. Four transformer models are evaluated across all 12 NusaX languages using the official train/validation/test splits: IndoBERT, NusaBERT, mBERT, and XLM-R. Performance is evaluated using four metrics: accuracy, macro F1, macro precision, and macro recall. Experimental results show that LSGT improves macro F1 in 44 of 48 model-language combinations, demonstrating that the fine-tuning strategy itself is a major factor in low-resource cross-lingual sentiment classification. XLM-R benefits most strongly from LSGT, achieving an average improvement of +0.137 macro F1 and a peak gain of +0.298 on Madurese. SHAP-based token attribution analysis further reveals that predictions rely heavily on named entities and domain-specific nouns rather than sentiment-bearing vocabulary, indicating a dataset-level bias inherited from the original SmSA corpus and propagated through the NusaX translation pipeline.
| Item Type: | Article |
|---|---|
| Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
| Depositing User: | dl fts |
| Date Deposited: | 08 May 2026 06:27 |
| Last Modified: | 08 May 2026 06:27 |
| URI: | https://dl.futuretechsci.org/id/eprint/181 |
