Skip to the content.

Analysis of Homocysteine as a Biomarker for Cardiovascular Disease Among Non-Smokers

By Amber Campbell

Introduction

General Introduction

Cardiovascular disease has been the leading cause of death for both men and women in the United States since 1950. There are many known risk factors for cardiovascular disease including high blood pressure, high cholesterol, and smoking. Homocysteine, a naturally-occuring amino acid in the body, is often used as a biomarker for cardiovascular disease. While homocysteine is broken down by folate, Vitamin B12, and Vitamin B6 to generate chemicals the body needs, high levels of homocysteine can damange artery linings in the body and blood clots, significantly increasing the risk of cardiovascular disease. Research has long supported the notion of homocysteine as a biomarker for cardiovascular disease, and recent advocates support the use of its monitoring and measurement as a guide for disease prevention(PubMed Abstract, Cleveland Clinic).

The dataset I am using comes from the National Health and Nutrition Examination Survey (NHANES) 2005-2006 cohort study, in which demographic, socioeconomic, dietary, and health-related questions are collected from a nationally representative sample. Because smoking is such a well-identified risk factor for cardiovascular disease, the qustion I am intending to focus on is how demographic and health factors relate to a person’s homocysteine levels, including diet quality, BMI, sex, race, blood pressure, and total cholesterol. Using such information, I created several prediction models to to predict a person’s homocysteine level vased on these metrics. Such a prediction model can be used as an investigative tool for the risks of cardiovascular disease, with homocysteine serving as a proxy measure of an individual’s risk for cardiovascular disease.

Introduction of Columns

This dataset was prefiltered in attempts to only include non-smokers by excluding any persons who had smoked over 100 cigarettes in their lifetime or had smoked within 5 days of taking the survey. There are 2033 rows in this dataset, representing health and demographic information for 2033 participants of the NHANES survey. The dataset contains the following relevant columns:

Column Description
id (SEQN) Participants sequence number in NHANES study
homocysteine (LBXHCY) Homocysteine measurement (umol/L)
cotinine (LBXCOT) Cotinine measurement - biomarker for tobacco smoke exposure (ng/mL)
sex (RIAGENDR) Sex (Male, Female)
age (RIDAGEYR) Age (years)
race (RIDRETH1) Race
education_level (DMDEDUC2) Highest Education Achieved (No/Some/All High School, Some/All College)
bmi (BMXBMI) BMI (kg/b^2^)
diet_level (DBQ700) Self-desribed Diet Quality (Poor, Fair, Good, Very Good, Excellent)
WkAlcDays Average number of alcoholic drinks per week
diabetes_status (DIQ010) Diabetes Status (Borderline, Yes, No)
systolic_bp (BPXSY1) Systolic Blood Pressure (mmHg)
cholesterol (LBXTC) Total Cholesterol (mg/dL)

Data Cleaning and Exploratory Analysis

Data Cleaning

  1. I started by renaming the columns (indicated in parentheses in the previous table) to be more easily-comprehendable.
  2. I then verified that each row in the dataset had a unique ID, and thus no duplicate information about each participant from the study. I then checked relevant columns homocysteine, age, bmi, systolic_bp, and cholesterol for values of 0, which are likely indicators of missing values, because they are all measurements that should have values above 0. While none of these columns had values of 0, I noticed some extreme values for BMI (such as 130). While acknowledging that BMI is a controversial and more recently viewed as a misused or inaccurate measurement, it is highly unlikely for BMI values to be above 50, and I those chose to remove any rows from the dataset that contained BMI measurements over 54 after looking into this more (CDC, Cleveland Clinic).
  3. I then inspected the categorial columns, making sure that the unique values in each column were comprehendable. The diabetes_status column had a range of values including ‘Yes’, ‘No’, ‘Borderline’, and ‘Borderlinederline’. I corrected any instances of ‘Borderlinederline’ to match ‘Borderline’.

My resulting dataframe contained 1099 rows. The first few rows of this cleaned dataset are shown below.

id homocysteine cotinine sex age race education_level bmi diet_level diabetes_status systolic_bp cholesterol WkAlcDays
31131 9.33 0.035 Female 44 Black SomeCollege 30.9 Good No 144 105 0
31132 8.96 0.021 Male 70 White College 24.74 VeryGood Yes 138 147 4
31134 8.2 0.065 Male 73 White HighSchool 30.63 Good No 130 186 2
31144 7.97 0.125 Male 21 OtherHispanic HighSchool 25.03 Excellent No 116 207 0
31149 10.29 0.011 Female 85 White SomeHighSchool 21.63 Good No 110 121 nan

Univariate Analysis

This plot shows the distribution of homocysteine across participants in the NHANES dataset. The distribution appears to be relatively normal with a slight right skew. This suggests that the data is well-behaved and that the average homocysteine level measurement ins around 6.5 ug/mol.

Bivariate Analysis

This plot shows the distribution of homocysteine by racial group. This plot suggests that there is a difference in distribution between diffrent racial groups, with Black and white racial group distributions appearing to have higher median homocysteine measurement than other groups.

Interesting Aggregates

Pivot table using diabetes status

diabetes_status bmi cholesterol cotinine homocysteine systolic_bp
Borderline 33.4398 200.412 0.262382 8.30324 133.549
No 28.6295 201.569 0.162658 7.64186 122.485
Yes 31.8165 192.099 0.120988 9.17927 132.996

Missing Value Imputation

Imputed Using mean for numerical columns and most common value for categorical columns.

There were 21 missing BMI measurements

There were 220 missing systolic blood pressure measurements

There were 13 missing total cholesterol measurements

There were 11 missing cotinine measurements

There were 2 missing diabetes status values

There were 4 missing diet quality values

Framing a Prediction Problem

Mmy prediction problem will be predicting a person’s homocysteine levels. This is a regression problem, with homocysteine as the prediction variable. Such a prediction model can be used as an investigative tool for the risks of cardiovascular disease, with homocysteine serving as a proxy measure of an individual’s risk for cardiovascular disease.

I chose MSE as my evaluation metric because it tends to provides an easily understandable ,measure of how similar the prediction levels are to the actual homocysteine levels, and it penalizes larger errors more heavily than smaller errors. All of the information presented in the dataset would be known at time of prediction, because they are all health measurements or demographic identifications that are independent of one another.

Baseline Model

For my baseline model, I created a linear regression model that include 2 nominal features, diabetes_status and race. These features were one-hot-encoded.

I added diabetes_status because insulin resistence is associated with increased homocysteine levels.

I also incorporated race because differences in genetic background, dietary patterns, and access to healthcare across racial groups may influence homocysteine levels.

I originally indentified these two features in my exploratory analyses and bivariate analyses, as I noted that there were distinct differences between the average homocysteine level for participants of different racial groups as well as participants with different diabetes statuses.

pipeline of baseline model

Performance of Baseline Model:

The baseline model achieved an MSE of 8.597 and an r-squared value of 0.07145

Final Model

For my final model, I experimented with different features and models before landing on a ridge regression model with 8 features.

I one-hot-encoded the categorical variables, and chose to use StandardScaler and apply polynomialFeatures in order attepts to capture any non-linear relationships there might be between homocysteine and the numerical features, while also noting that ridge regression is sensitive to scale, and applying polynomial transformations before scaling could have amplified differences in scale.

pipeline of final model

I used GridSearchCV to find the best parameters for the Ridge Regression Model. These were:

Performance of Final Model:

The final model achieved an MSE of 5.428 and an r-squared value of 0.413 While these measurements might not indicate an objectively “good” model, it is a signficant increase in accuracy from the baseline model.