Analysis of Homocysteine as a Biomarker for Cardiovascular Disease Among Non-Smokers

By Amber Campbell

Introduction

General Introduction

Cardiovascular disease has been the leading cause of death for both men and women in the United States since 1950. There are many known risk factors for cardiovascular disease including high blood pressure, high cholesterol, and smoking. Homocysteine, a naturally-occuring amino acid in the body, is often used as a biomarker for cardiovascular disease. While homocysteine is broken down by folate, Vitamin B12, and Vitamin B6 to generate chemicals the body needs, high levels of homocysteine can damange artery linings in the body and blood clots, significantly increasing the risk of cardiovascular disease. Research has long supported the notion of homocysteine as a biomarker for cardiovascular disease, and recent advocates support the use of its monitoring and measurement as a guide for disease prevention(PubMed Abstract, Cleveland Clinic).

The dataset I am using comes from the National Health and Nutrition Examination Survey (NHANES) 2005-2006 cohort study, in which demographic, socioeconomic, dietary, and health-related questions are collected from a nationally representative sample. Because smoking is such a well-identified risk factor for cardiovascular disease, the qustion I am intending to focus on is how demographic and health factors relate to a person’s homocysteine levels, including diet quality, BMI, sex, race, blood pressure, and total cholesterol. Using such information, I created several prediction models to to predict a person’s homocysteine level vased on these metrics. Such a prediction model can be used as an investigative tool for the risks of cardiovascular disease, with homocysteine serving as a proxy measure of an individual’s risk for cardiovascular disease.

Introduction of Columns

This dataset was prefiltered in attempts to only include non-smokers by excluding any persons who had smoked over 100 cigarettes in their lifetime or had smoked within 5 days of taking the survey. There are 2033 rows in this dataset, representing health and demographic information for 2033 participants of the NHANES survey. The dataset contains the following relevant columns:

Column	Description
`id` (SEQN)	Participants sequence number in NHANES study
`homocysteine` (LBXHCY)	Homocysteine measurement (umol/L)
`cotinine` (LBXCOT)	Cotinine measurement - biomarker for tobacco smoke exposure (ng/mL)
`sex` (RIAGENDR)	Sex (Male, Female)
`age` (RIDAGEYR)	Age (years)
`race` (RIDRETH1)	Race
`education_level` (DMDEDUC2)	Highest Education Achieved (No/Some/All High School, Some/All College)
`bmi` (BMXBMI)	BMI (kg/b^2^)
`diet_level` (DBQ700)	Self-desribed Diet Quality (Poor, Fair, Good, Very Good, Excellent)
`WkAlcDays`	Average number of alcoholic drinks per week
`diabetes_status` (DIQ010)	Diabetes Status (Borderline, Yes, No)
`systolic_bp` (BPXSY1)	Systolic Blood Pressure (mmHg)
`cholesterol` (LBXTC)	Total Cholesterol (mg/dL)

Data Cleaning and Exploratory Analysis

Data Cleaning

I started by renaming the columns (indicated in parentheses in the previous table) to be more easily-comprehendable.
I then verified that each row in the dataset had a unique ID, and thus no duplicate information about each participant from the study. I then checked relevant columns homocysteine, age, bmi, systolic_bp, and cholesterol for values of 0, which are likely indicators of missing values, because they are all measurements that should have values above 0. While none of these columns had values of 0, I noticed some extreme values for BMI (such as 130). While acknowledging that BMI is a controversial and more recently viewed as a misused or inaccurate measurement, it is highly unlikely for BMI values to be above 50, and I those chose to remove any rows from the dataset that contained BMI measurements over 54 after looking into this more (CDC, Cleveland Clinic).
I then inspected the categorial columns, making sure that the unique values in each column were comprehendable. The diabetes_status column had a range of values including ‘Yes’, ‘No’, ‘Borderline’, and ‘Borderlinederline’. I corrected any instances of ‘Borderlinederline’ to match ‘Borderline’.

My resulting dataframe contained 1099 rows. The first few rows of this cleaned dataset are shown below.

id	homocysteine	cotinine	sex	age	race	education_level	bmi	diet_level	diabetes_status	systolic_bp	cholesterol	WkAlcDays
31131	9.33	0.035	Female	44	Black	SomeCollege	30.9	Good	No	144	105	0
31132	8.96	0.021	Male	70	White	College	24.74	VeryGood	Yes	138	147	4
31134	8.2	0.065	Male	73	White	HighSchool	30.63	Good	No	130	186	2
31144	7.97	0.125	Male	21	OtherHispanic	HighSchool	25.03	Excellent	No	116	207	0
31149	10.29	0.011	Female	85	White	SomeHighSchool	21.63	Good	No	110	121	nan

Univariate Analysis

This plot shows the distribution of homocysteine across participants in the NHANES dataset. The distribution appears to be relatively normal with a slight right skew. This suggests that the data is well-behaved and that the average homocysteine level measurement ins around 6.5 ug/mol.

Bivariate Analysis

This plot shows the distribution of homocysteine by racial group. This plot suggests that there is a difference in distribution between diffrent racial groups, with Black and white racial group distributions appearing to have higher median homocysteine measurement than other groups.

Interesting Aggregates

Pivot table using diabetes status

diabetes_status	bmi	cholesterol	cotinine	homocysteine	systolic_bp
Borderline	33.4398	200.412	0.262382	8.30324	133.549
No	28.6295	201.569	0.162658	7.64186	122.485
Yes	31.8165	192.099	0.120988	9.17927	132.996

Missing Value Imputation

Imputed Using mean for numerical columns and most common value for categorical columns.

There were 21 missing BMI measurements

There were 220 missing systolic blood pressure measurements

There were 13 missing total cholesterol measurements

There were 11 missing cotinine measurements

There were 2 missing diabetes status values

There were 4 missing diet quality values

Framing a Prediction Problem

Mmy prediction problem will be predicting a person’s homocysteine levels. This is a regression problem, with homocysteine as the prediction variable. Such a prediction model can be used as an investigative tool for the risks of cardiovascular disease, with homocysteine serving as a proxy measure of an individual’s risk for cardiovascular disease.

I chose MSE as my evaluation metric because it tends to provides an easily understandable ,measure of how similar the prediction levels are to the actual homocysteine levels, and it penalizes larger errors more heavily than smaller errors. All of the information presented in the dataset would be known at time of prediction, because they are all health measurements or demographic identifications that are independent of one another.

Baseline Model

For my baseline model, I created a linear regression model that include 2 nominal features, diabetes_status and race. These features were one-hot-encoded.

I added diabetes_status because insulin resistence is associated with increased homocysteine levels.

I also incorporated race because differences in genetic background, dietary patterns, and access to healthcare across racial groups may influence homocysteine levels.

I originally indentified these two features in my exploratory analyses and bivariate analyses, as I noted that there were distinct differences between the average homocysteine level for participants of different racial groups as well as participants with different diabetes statuses.

pipeline of baseline model

Performance of Baseline Model:

The baseline model achieved an MSE of 8.597 and an r-squared value of 0.07145

Final Model

For my final model, I experimented with different features and models before landing on a ridge regression model with 8 features.

There were 4 were numerical features: age , bmi, systolic_bp, cholesterol
There were 3 were nominal features: sex, diabetes_status, race
There was 1 ordinal feature: diet_level

I one-hot-encoded the categorical variables, and chose to use StandardScaler and apply polynomialFeatures in order attepts to capture any non-linear relationships there might be between homocysteine and the numerical features, while also noting that ridge regression is sensitive to scale, and applying polynomial transformations before scaling could have amplified differences in scale.

I added age because I hypothesized that the older a person is, the less likely their body is to properly metabolize and break down amino acids like homocysteine when needed, so it would make sense that older people would have higher homocysteine levels.
I added bmi because body weight could affect the way a person metabolizes nutrients and thus, vitamins like B12 and B6.
I added systolic_bpand cholesterol because they are both associated with cardiovascular health issues, so including them in this model could correlate with higher homocysteine values
I added sex because there are physiological differences between males and females that could impact the rate of homocysteine breakdown.
I added diet_level because the quality of one’s diet could impact the nutrients and vitamins they are getting, and could impact the amount of folate, Vitamin B6, and Vitamin B12, affecting the subsequent amount of homocysteine in the body. (I.e. if you have a high amount of folate, your body will need to break down more homocysteine, so you will have lower homocysteine levels)

pipeline of final model

I used GridSearchCV to find the best parameters for the Ridge Regression Model. These were:

polynomial feature degree: 2
ridge alpha: 10

Performance of Final Model:

The final model achieved an MSE of 5.428 and an r-squared value of 0.413 While these measurements might not indicate an objectively “good” model, it is a signficant increase in accuracy from the baseline model.