This README provides an overview of a report on the data analysis of COVID-19. In future iterations of this project, the focus would shift towards examining the cultural and personality traits within specific age groups that may influence their heightened susceptibility to contracting COVID-19.
This report outlines the comprehensive steps taken in the data exploration and mining process for Nexoid’s medical dataset, focusing on COVID-19 risk factors and infection statuses. The objective was to analyze, clean, and prepare the dataset to ensure reliable insights in subsequent analysis phases.
Each section in this report addresses one of the following tasks:
- Examine and Correct Data Types: Identify initial data types and correct mismatches for consistency.
- Prepare Data: Address skewness, missing values, and other quality issues using cleaning and transformation methods.
- Data Mining Task and Feature Selection: Explore relationships between variables, perform feature selection, and determine suitable data mining tasks.
Initial analysis of data types was conducted using df.dtypes
. Key issues identified included incorrect formats for date fields and categorical variables, which could impede analysis.
- Convert Dates to DateTime: The
survey_date
column was converted from string todatetime64[ns]
to enable time-based operations. - Categorical Data: Columns like
gender
,region
, andsmoking
were converted to categorical data types to improve performance and ensure consistency. - Counts to Integers: Variables like
contacts_count
were converted to integers to reflect their nature as whole numbers. - Boolean Data: Binary fields such as
covid19_positive
andheart_disease
were set to Boolean types for clarity and storage efficiency. - Age Data: Age ranges were standardized to numerical midpoints (e.g., "20-25" became 22.5) to enhance granularity in analysis.
- Height and Weight: These columns were converted to
float64
for precision.
Data types were reviewed after conversion to ensure alignment with intended formats, resolving any issues related to mixed data types.
Skewness in the dataset was measured using .skew()
, revealing extreme positive and negative skewness in variables like public_transport_count
and risk_mortality
. Addressing these distributions was critical to ensure robust analysis.
Highly skewed columns were visualized to identify potential anomalies. Extreme skewness in variables like cocaine
and public_transport_count
was noted for further transformation.
A detailed examination of missing values across columns revealed significant gaps in some variables. The following approaches were used:
- Drop Columns with >50% Missing Data: Columns like
cocaine
were removed to avoid introducing bias. - Impute Categorical Data: Missing categorical values were replaced with "Unknown" to maintain dataset integrity.
- Numerical Data Imputation: Median values were used to replace missing numerical entries, ensuring robustness in non-normally distributed data.
Inconsistencies were addressed by:
- Standardizing categorical entries (e.g., consolidating smoking responses).
- Ensuring uniform formats for columns like
gender
andblood_type
. - Resolving issues where "NA" in the
region
column was misinterpreted as missing data.
A cube root transformation was applied to reduce skewness, stabilizing variance and minimizing the impact of extreme values. This transformation effectively normalized highly skewed variables like BMI
and risk_mortality
.
Key outcomes:
- Skewness in
BMI
was reduced from 1.99 to 1.05. - Skewness in
alcohol
dropped from 1.84 to 0.16. - While some columns retained moderate skewness, the transformations significantly improved data symmetry.
A Chi-Square test was conducted to assess the association between smoking status and COVID-19 positivity. The results revealed a significant relationship, with a high Chi-Square value and a low p-value (< 0.001).
A correlation matrix highlighted key relationships:
- Positive Correlations: Strong relationships were observed between variables like
BMI
andweight
. - Negative Correlations: Notable negative correlations included
cluster
andbmi_risk_infection_interaction
.
Given the dataset's structure and objectives:
- Classification: Selected to predict COVID-19 positivity based on health, demographic, and behavioral data.
- Clustering: Explored as an alternative for grouping individuals based on feature similarity.
The following features were selected for their relevance to predicting COVID-19 status:
- Health:
BMI
,risk_infection
,risk_mortality
. - Behavioral:
contacts_count
,public_transport_count
. - Demographic:
age_mean
. - Derived:
bmi_risk_infection_interaction
.
This report details the comprehensive data preparation process undertaken to enhance the Nexoid dataset. Key outcomes include improved data quality, normalized distributions, and insights into variable relationships. These efforts provide a solid foundation for advanced analysis and reliable predictions in subsequent phases.