Language-Similarity-Guided Transfer Fine-Tuning of Pre-trained Transformer Models for Sentiment Analysis Across 12 Indonesian Regional Languages

Darnoto, Brian Rizqi Paradisiaca and Firmawan, Dony Bahtera (2026) Language-Similarity-Guided Transfer Fine-Tuning of Pre-trained Transformer Models for Sentiment Analysis Across 12 Indonesian Regional Languages. Journal of Computing Theories and Applications, 3 (4). pp. 547-563. ISSN 3024-9104

[thumbnail of 15975-Article Text-56530-2-10-20260507.pdf]

Text
15975-Article Text-56530-2-10-20260507.pdf - Published Version
Available under License Creative Commons Attribution.
Download (864kB)

Official URL: https://doi.org/10.62411/jcta.15975

Abstract

Sentiment analysis for Indonesian regional languages faces two persistent challenges: labeled training data is extremely limited for most regional varieties, and transformer models pre-trained on Bahasa Indonesia do not generalize reliably to languages with substantially different morphological structures. Prior work on the NusaX benchmark has primarily relied on direct fine-tuning, treating each regional language independently and without exploiting linguistic proximity between related languages as a transfer signal. This paper proposes Language-Similarity-Guided Transfer (LSGT), a sequential fine-tuning strategy that first adapts a pre-trained model to a pivot language selected using character trigram similarity, followed by fine-tuning on the target language. Four transformer models are evaluated across all 12 NusaX languages using the official train/validation/test splits: IndoBERT, NusaBERT, mBERT, and XLM-R. Performance is evaluated using four metrics: accuracy, macro F1, macro precision, and macro recall. Experimental results show that LSGT improves macro F1 in 44 of 48 model-language combinations, demonstrating that the fine-tuning strategy itself is a major factor in low-resource cross-lingual sentiment classification. XLM-R benefits most strongly from LSGT, achieving an average improvement of +0.137 macro F1 and a peak gain of +0.298 on Madurese. SHAP-based token attribution analysis further reveals that predictions rely heavily on named entities and domain-specific nouns rather than sentiment-bearing vocabulary, indicating a dataset-level bias inherited from the original SmSA corpus and propagated through the NusaX translation pipeline.

Item Type:	Article
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Depositing User:	dl fts
Date Deposited:	08 May 2026 06:27
Last Modified:	08 May 2026 06:27
URI:	https://dl.futuretechsci.org/id/eprint/181

Actions (login required)

: View Item

Search for collections on FTS Digilib