This is an exploratory data analysis project I worked on as my capstone project for the Hack the Hood Fall 2023 Build Data Science bootcamp.
- I chose to analyze the Animal Conditions dataset by Grace Hephzibah M. and William Oliveira Gibin available on Kaggle at this link.
- This particular dataset inherently stood out to me when I was searching for a dataset I could use because I thought it'd be interesting to find the particular symptoms or combination of symptoms that cause a condition to be dangerous as opposed to non-dangerous although that is more in the domain of machine learning.
- I also had a childhood dream of becoming a vet when I was in grade school, and I think this is one way I can get close to fulfilling that dream.
I sought to answer the following questions about the dataset:
- What kinds of animals are present in the dataset and in what quantities?
- What is the proportion of domestic animals to wild animals in the dataset?
I suspect that there will be more observations on domestic animals than on wild animals and would like to test this hypothesis. - What are the different symptoms and in what quantities do they occur?
- What symptom was most prevalent across the entire dataset?
- What was the rarest symptom across the entire dataset?
- What percentage of animals had their symptoms marked as dangerous?
- Which symptom was most prevalent in cats?
- Which symptoms appear more in non-dangerous cases than they do in dangerous cases and of those symptoms which one had the highest non-dangerous occurrence to dangerous occurrence ratio?
This will help identify if there are any symptoms that are generally not dangerous. - What is the percentage of animals that died?
- A) Train a machine learning model for classifying the condition of the animals as dangerous or not dangerous based on their symptoms.
B) Visualize the performance of the trained machine learning model.
A Google Colab notebook for the project is available at this link.