home..

Heart Disease Classification

project

Project Overview

For this project, I predicted whether a presence of heart disease in patients.

I used classification using logistic regression based on information such as a patient’s age, sex, cholesterol levels, and resting blood pressure.


Project Content

Introduction

This project aims to develop a predictive model for detecting the presence of heart disease, utilizing key factors such as age, maximum heart rate, and cholesterol levels. Beyond its significance in advancing public health and cardiovascular research, the impact of detecting heart disease extends to individuals’ well-being on a micro level. Our exploration of various factors relies on data from the UCI Heart Disease Dataset, with the goal of developing a robust model for this purpose.

Logistic regression, a foundational statistical method in binary classification, plays a pivotal role in this project. By fitting a sigmoid function to the data, it transforms a combination of features into a probability scale ranging from 0 to 1. Typically, a threshold of 0.5 is set to predict whether an observation indicates the presence (1) or absence (0) of heart disease. The relative simplicity and interpretability of logistic regression make it a powerful and reliable tool for binary classification tasks.

Data

The data was split into training and testing datasets with 643 and 275 observations respectively by the Hackathon facilitators. Here is a look at the variables:

Heart Disease Variables

From this, we observe that there is a undocumented Unnamed: 0 variable that contains 0 in all observations. We disregard this variable in our analysis.

Exploratory Data Analysis

Univariate Analysis

First, we explore the distribution of numerical variables.

Heart Disease Histograms

Upon visual inspection, there seem to be odd characteristics in some of the variables. For example, there is an empty bin in Age at 53. We also obvseve possibly unnatural peaks at 0 in Cholesterol and Oldpeak. 0 for Oldpeak seems to be plausible given that it is a value relative to RestingECG. On the other hand, we found that 0 mg/dl of cholesterol is not within a realistic range, as values below 70 mg/dl are reported as “abnormally low” even under the secondary prevention settings (Tada et al., 2020). We suspect that there may have been measurement error or missing values.

However, given that 121 out of 643 observations in the training dataset has 0 for Cholesterol, it is difficult to remove these observations altogether. Furthermore, the testing dataset also contains a high proportion of observations with 0 for Cholesterol, so we note this abnormality as we go forward with our analysis.

We also observe some outliers from these histograms, so we will identify them using z-scores with a threshold of 3. We have 13 outliers, which we will remove from the analysis. Before removal, our training dataset had 643 observations. After the removal of 13 outliers, we now have 630 observations in our dataset.

Next, we explore the distribution of categorical variables.

Heart Disease Histograms

From the plots, we can see that the patients with heart disease have differing proportions in each categorical variable. This suggests that we can include these variables to the logistic regression model.

We also note that there are more males than females in the training dataset. There are also more people with heart disease than not, which raises slight concerns about the representativeness of this dataset.

Multivariate Analysis

We explore the correlation between each numerical predictor variable, using a scatter matrix and correlation heatmap.

Heart Disease Scatter Matrix Heart Disease Correlation Heatmap

In general, we cannot observe correlation between most ofthe predictors. We may be concerned about the negative correlation between Age and MaxHR.

Next, we check for multicollinearity using Variance Inflation Factors (VIF). We use a cutoff of 10 to determine whether the factor should be removed from our logistic regression model.

Heart Disease VIF1

We see that RestingBP has a VIF of nearly 42, so we remove it an check again.

Heart Disease VIF2

We see that Age and MaxHR both have VIF around 13. We have a choice of which variable to remove, though both are associated with heart disease. We choose to remove MaxHR because of its slightly higher VIF and check for multicollinearity again.

Heart Disease VIF3

We see that all numerical variables now have VIF under 5. This significant decrease in VIF for Age is in line with our concern of mild correlation between Age and MaxHR.

Finally, we encode for qualitative features before fitting a logistic regression model.

Logistic Regression

Naive Approach

Using the insights from our exploratory data analysis, we will fit a logistic regression model using the 630 observations in the training dataset with Age, Sex, ChestPainType, Cholesterol, FastingBS, RestingECG, ExerciseAngina, Oldpeak, and ST_Slope as predictors. We get a model with the following coefficients:

Heart Disease Logistic Regression Coefficients

The practical interpretation of these coefficients is different from linear regression. Each coefficient represents the change in the log-odds of heart disease for a one-unit change in the corresponding predictor variable, holding other variables constant. For example, a unit increase in Age leads to a 0.028 increase in the log-odds of heart disease. The prediction accuracy of this model on the testing dataset is 0.8335081.

Alternative Approach I

We stated in the exploratory data analysis that the Cholesterol variable had many observations at 0, which is highly unlikely given the literature. However, it is difficult to remove these observations from the training dataset as we would lose a high proprtion of observations from the training dataset. Moreover, the testing dataset also contains many observations with Cholesterol at 0, so we cannot ignore this problem algother. Finally, we are uncertain about removing the Cholesterol variable given the literature about its link to heart disease.

As such, we utilize two separate logistic regression models to mitigate the effects of these observations. We separate the training and testing data into two groups each: Cholesterol = 0 or Cholesterol > 0.

The first logistic regression model is trained with all observations in the training data, but tested with only testing data with Cholesterol = 0. This model does not use Cholesterol as a predictor. In doing so, we are able to use all observations in the training data to fit a model, mitigating the effect of removing Cholesterol as a predictor.

The second logistic regression model is trained on observations in the training data with Cholesterol > 0 and also tested on observations in the testing data with Cholesterol > 0. This allows us to keep Cholesterol as a predictor in the model for observations that presumably did not have measurement error. The coefficient of Cholesterol in this model was 0.001, which indicates that it is not as strongly associated with heart disease occurance as previously believed.

After merging the predictions of the two models, the prediction accuracy of this model on the testing dataset is 0.83570705, a marginal increase over the previous predictions.

Alternative Approach II

Given that Cholesterol is not strongly associated with heart disease given the other predictors, we try removing Cholesterol as a predictor and fitting a logistic regression model on all observations in the training dataset. This allows us to fit a model with the largest possible number of observations that we have with our data. The prediction accuracy of this model on the testing dataset is 0.8290626.

Alternative Approach III

Finally, we observe that for the observations in the training dataset with Cholesterol = 0 have a 0.89 proportion of having heart disease. Depending on the applications of this model, overdiagnosing heart disease is more beneficial than underdiagnosing. This is especially true when preliminarily diagnosing individuals with heart disease before extensive checks to confirm this.

Thus, we can attempt to map all observations in the testing dataset with Cholesterol = 0 to have heart disease, then fit a logistic regression model on the remaining observations identical to the second model in Alternative Approach I. The prediction accuracy of this model on the testing dataset is 0.84085325, the highest accuracy of all the models so far.

Conclusion

We conducted a thorough exploration of various combinations of observations and predictors to create a predctive model to identify the presence of heart disease. The complexity of selecting the right features and instances underscored the significance of both exploratory data analysis and domain knowledge in ensuring thoughtful data preprocessing and informed feature selection.

a notable limitation surfaced due to the relatively modest sample size in our training and testing datasets. This challenge was further compounded by the presence of problematic observations, either as outliers or values inconsistent with the literature. Recognizing these constraints, it becomes evident that the success of machine learning models is intricately tied to the quality and integrity of the underlying data.

In future studies, it would be beneficial to incorporate additional data sources and try different machine learning models.

References

  1. Aha, D. W. (n.d.). Heart Disease Data. UCI Machine Learning Repository. Link
  2. Tada, H., Usui, S., Sakata, K., Takamura, M., & Kawashiri, M. (2020). Low-Density Lipoprotein Cholesterol Level cannot be too Low: Considerations from Clinical Trials, Human Genetics, and Biology. Journal of Atherosclerosis and Thrombosis, 27(6), 489–498. DOI
© 2024 Makoto Takahara   •  Theme  Moonwalk