Sundarreson, Pushpika and Kumarapathirage, Sapna (2024) SentiGEN: Synthetic Data Generator for Sentiment Analysis. Journal of Computing Theories and Applications, 1 (4). pp. 461-477. ISSN 3024-9104
10480-Article Text-33060-2-10-20240615.pdf - Published Version
Download (447kB) | Preview
Abstract
Obtaining high-quality, diverse, accurate datasets for sentiment analysis has always been a significant challenge. Traditional approaches include annotators, which may introduce bias to datasets and are also time-consuming and expensive. These types of datasets may also not represent the variety needed to train robust and generalizable sentiment analysis models. This study introduces a novel combination of techniques to approach the problem with a novel solution. The proposed system, SentiGEN includes the use of a transformer, T5, fine-tuned and optimized using an evolutionary algorithm to generate high-quality, diverse, accurate data for sentiment analysis. The generated data is validated using XLNet to ensure high sentiment accuracy. This combination of technologies has proven successful based on the results derived from evaluating multiple models. From complex transformers such as BERT to more straightforward approaches like KNN, those trained using synthetic data demonstrated superior performance compared to their counterparts trained on real data. This enhancement in predictive accuracy was observed when evaluated on benchmark datasets such as SST-2 and Yelp. SentiGEN can generate high-quality, diverse, accurate, realistic data for sentiment analysis and successfully increased the performance of models trained on synthetic data compared to the same model trained on real data.
Item Type: | Article |
---|---|
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
Depositing User: | dl fts |
Date Deposited: | 24 Nov 2024 07:24 |
Last Modified: | 29 Nov 2024 15:21 |
URI: | https://dl.futuretechsci.org/id/eprint/31 |