Evaluating Open-Source Machine Learning Project Quality Using SMOTE-Enhanced and Explainable ML/DL Models

Hamza, Ali and Hussain, Wahid and Iftikhar, Hassan and Ahmad, Aziz and Shamim, Alamgir Md (2025) Evaluating Open-Source Machine Learning Project Quality Using SMOTE-Enhanced and Explainable ML/DL Models. Journal of Computing Theories and Applications, 3 (2). pp. 206-222. ISSN 3024-9104

[thumbnail of 14793-Article Text-52602-1-10-20251116.pdf]

Preview

Text
14793-Article Text-52602-1-10-20251116.pdf - Published Version
Available under License Creative Commons Attribution.
Download (1MB) | Preview

Official URL: https://doi.org/10.62411/jcta.14793

Abstract

The rapid growth of open-source software (OSS) in machine learning (ML) has intensified the need for reliable, automated methods to assess project quality, particularly as OSS increasingly underpins critical applications in science, industry, and public infrastructure. This study evaluates the effectiveness of a diverse set of machine learning and deep learning (ML/DL) algorithms for classifying GitHub OSS ML projects as engineered or non-engineered using a SMOTE-enhanced and explainable modeling pipeline. The dataset used in this research includes both numerical and categorical attributes representing documentation, testing, architecture, community engagement, popularity, and repository activity. After handling missing values, standardizing numerical features, encoding categorical variables, and addressing the inherent class imbalance using the Synthetic Minority Oversampling Technique (SMOTE), seven different classifiers—K-Nearest Neighbors (KNN), Decision Tree (DT), Random Forest (RF), XGBoost (XGB), Logistic Regression (LR), Support Vector Machine (SVM), and a Deep Neural Network (DNN)—were trained and evaluated. Results show that LR (84%) and DNN (85%) outperform all other models, indicating that both linear and moderately deep non-linear architectures can effectively capture key quality indicators in OSS ML projects. Additional explainability analysis using SHAP reveals consistent feature importance across models, with documentation quality, unit testing practices, architectural clarity, and repository dynamics emerging as the strongest predictors. These findings demonstrate that automated, explainable ML/DL-based quality assessment is both feasible and effective, offering a practical pathway for improving OSS sustainability, guiding contributor decisions, and enhancing trust in ML-based systems that depend on open-source components.

Item Type:	Article
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Depositing User:	dl fts
Date Deposited:	17 Nov 2025 02:17
Last Modified:	17 Nov 2025 02:17
URI:	https://dl.futuretechsci.org/id/eprint/136

Actions (login required)

: View Item

Search for collections on FTS Digilib