Search for collections on FTS Digilib

Big Data-Driven Health Risk Stratification: A Health Index-Based Approach Using Feature Importance and PySpark

Abioye, Oluwasegun Abiodun and Irhebhude, Martins Ekata (2025) Big Data-Driven Health Risk Stratification: A Health Index-Based Approach Using Feature Importance and PySpark. Journal of Computing Theories and Applications, 2 (4). pp. 456-469. ISSN 3024-9104

[thumbnail of 12327-Article Text-44309-1-10-20250324.pdf] Text
12327-Article Text-44309-1-10-20250324.pdf - Published Version
Available under License Creative Commons Attribution.

Download (413kB)

Abstract

Health risk stratification is crucial for preventive healthcare, yet existing models often rely on binary classification generalized disease prediction, neglecting personalized health indicators and graded risk levels. Many studies apply feature selection techniques like Relief and Univariate Selection without quantifying the weighted impact of features. To address these gaps, this study introduces a Big Data-driven Health Index (HI) framework using PySpark for scalable health risk stratification. The HI is computed as a weighted sum of health-related features using SHAP Analysis, XGBoost, Random Forest, and Correlation Analysis. PySpark enables efficient processing of large-scale health data, and individuals are classified into Low and High Risk. Optimal classification thresholds are determined using the Youden Index from the ROC curve to balance sensitivity and specificity. Personalized health recommendations are generated based on risk categories to guide preventive interventions. Performance evaluation reveals that Correlation Analysis achieves 100% precision and 98.90% recall, outperforming other methods. SHAP prioritizes recall but has low precision, while XGBoost and Random Forest improve precision but struggle with recall. By leveraging Big Data techniques with PySpark, this study enhances computational efficiency, scalability, and classification accuracy, addressing prior research limitations and providing a robust data-driven approach to personalized health monitoring.

Item Type: Article
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Depositing User: dl fts
Date Deposited: 31 Mar 2025 23:58
Last Modified: 01 Apr 2025 02:18
URI: https://dl.futuretechsci.org/id/eprint/105

Actions (login required)

View Item
View Item