Skip to content

Latest commit

 

History

History
32 lines (22 loc) · 2.12 KB

README.md

File metadata and controls

32 lines (22 loc) · 2.12 KB

AdvStatsCW

Coursework for Advanced Statistics

Brief

You are working with a team who have been tasked to create a classification tool for use by a company doing chemical analysis. They wish to know if they need to measure all variables, or if a subset of variables could achieve the same out-of-sample classification performance. Being able to use a subset of variables would mean faster processing times; but they would not wish to sacrifice classification performance for the sake of marginal speed gains.

Your Task

Your role in this team is to create and build a markdown file (either using quarto or rmarkdown) to

  1. Check for missing data;
  2. Check for outliers and perform exploratory data analysis;
  3. Create a training, test and validation split – with at least 15% of the observations to be in the validation set;
  4. Investigate a variety of classification approaches and recommend the optimal one for this dataset;
  5. Evaluate the performance of that approach on the validation dataset.

Your Report

Therefore your short (no more than 2000 words) report should concentrate on describing the process, how decisions are made so that others in your team can explain it to the client.

Your Data

You will have an individual dataset from blackboard (based on your student ID – called STUDENTID.csv) to create a model and test your model on. This dataset of 2500 observations will have five different groups, labelled A-E.

Include

You should include full appropriate Exploratory Data Analysis and Descriptive Statistics to highlight what, if any, data checking should be carried out prior to conducting the analysis.

You may select any appropriate combination of dimension reduction and classification techniques that we have covered.

You should report on your in-sample training and test classification performance and then also report on your out-of-sample validation performance and what, if any, differences are observed.

Please structure your code so that the data is in a subdirectory called “data”, your .Rmd/ .qmd file should be in a subdirectory called “scripts” so that your relative file locations are correct.