We have chosen to utilize a binary GradientBoostingClassifier
for our analysis. This method was selected due to its robustness in handling various types of data and its effectiveness in binary classification tasks. Gradient boosting is particularly adept at improving predictions by minimizing errors sequentially using decision trees, making it well-suited for complex datasets with intricate patterns.
Our model incorporates 75 manually selected features. These features were carefully chosen based on established fertility theories to ensure that each contributes meaningfully to the understanding and prediction of fertility behaviors. The selection process involved rigorous analysis of literature and existing studies, ensuring that the variables used are not only theoretically sound but also empirically validated. There are 38 features that are dummy variables, which were created from 4 categorical variables in the dataset. These dummy variables were included to capture the nuances of the original categorical variables and provide a more detailed representation of the data.
One of the significant challenges we encountered during the data preprocessing stage was the prevalence of missing values within the dataset. Missing data can obscure the true relationships between variables and potentially bias the model's outcomes. To address this, we implemented data imputation techniques aimed at preserving the underlying distribution and relationships of the data as much as possible, thereby allowing for a more accurate and reliable model.
Given the nature of our feature selection and the robustness of the GradientBoostingClassifier
, we anticipate that our model will perform better on datasets characterized by larger, more complete records (e.g., large register data) compared to smaller, survey-based datasets which have a higher incidence of missing values. Larger datasets typically provide a richer set of information, allowing our model to more effectively learn and generalize from the data without being heavily biased by the absence of key information.
Throughout the course of our model development, we have iteratively refined our approach based on insights gained from previous submissions. Each iteration was aimed at enhancing the model’s predictive accuracy and adapting to the nuances presented by the evolving data features. This iterative process was invaluable in fine-tuning our feature selection and data preprocessing strategies, ultimately leading to a more robust model that we believe aligns well with the objectives of this challenge.