Abioye, Oluwasegun Abiodun and Irhebhude, Martins Ekata (2025) Big Data-Driven Health Risk Stratification: A Health Index-Based Approach Using Feature Importance and PySpark. Journal of Computing Theories and Applications, 2 (4). pp. 456-469. ISSN 3024-9104
![12327-Article Text-44309-1-10-20250324.pdf [thumbnail of 12327-Article Text-44309-1-10-20250324.pdf]](https://dl.futuretechsci.org/style/images/fileicons/text.png)
12327-Article Text-44309-1-10-20250324.pdf - Published Version
Available under License Creative Commons Attribution.
Download (413kB)
Abstract
Health risk stratification is crucial for preventive healthcare, yet existing models often rely on binary classification generalized disease prediction, neglecting personalized health indicators and graded risk levels. Many studies apply feature selection techniques like Relief and Univariate Selection without quantifying the weighted impact of features. To address these gaps, this study introduces a Big Data-driven Health Index (HI) framework using PySpark for scalable health risk stratification. The HI is computed as a weighted sum of health-related features using SHAP Analysis, XGBoost, Random Forest, and Correlation Analysis. PySpark enables efficient processing of large-scale health data, and individuals are classified into Low and High Risk. Optimal classification thresholds are determined using the Youden Index from the ROC curve to balance sensitivity and specificity. Personalized health recommendations are generated based on risk categories to guide preventive interventions. Performance evaluation reveals that Correlation Analysis achieves 100% precision and 98.90% recall, outperforming other methods. SHAP prioritizes recall but has low precision, while XGBoost and Random Forest improve precision but struggle with recall. By leveraging Big Data techniques with PySpark, this study enhances computational efficiency, scalability, and classification accuracy, addressing prior research limitations and providing a robust data-driven approach to personalized health monitoring.
Item Type: | Article |
---|---|
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
Depositing User: | dl fts |
Date Deposited: | 31 Mar 2025 23:58 |
Last Modified: | 01 Apr 2025 02:18 |
URI: | https://dl.futuretechsci.org/id/eprint/105 |