Skip to content

Latest commit

 

History

History
executable file
·
46 lines (20 loc) · 3.72 KB

README.md

File metadata and controls

executable file
·
46 lines (20 loc) · 3.72 KB

High-dimensional-statistics

Statistical learning plays a key role in many fields of sciences, medicine, industry, marketing, finance .. The development of data storage and computing resources gives rise to the production and the storage of a huge amount of data from which the data scientist will try to learn crucial informations to better understand the underlying phenomena or to provide predictions. Many fields are impacted, here are some examples of learning problems :

  • Signals : Aerospace industry produces a huge amont of signal measurements obtained from thousand of on-board sensors. For example, before the launch of a satellite, many tests are provided to observe the behavior of the satellite in various conditions. It is particularly important to detect possible anomalies before launching the satellite. Similarly, many sensors are involved in planes; they are generally redondant and it is important to detect a abnormal behavior on a sensor. Those examples concern curve clustering or anomaly detections in a set of curves.

  • Images : More and more images are collected and stored, for example medical images, earth observation satellite images, photos, video surveillance images, handwritten text images ...Each image is made from a huge number of pixels. Examples of learning problems are handwritten digit recognition, tumor detection, image classification ..

  • Microarray data : DNA microarrays allow to measure the expression of thousands of genes simultaneously on a single individual. It is, for example, a challenge to try to infer from those kind of data which genes are involved in a certain type of cancer, by comparing expression levels between healthy and sick patients. The number p of genes measured on a microarray is generally much larger than the number n of individuals in the study.

  • Geolocalisation data : Machine learning based on geolocalisation data has also many potential applications : targeted advertising, road traffic forecasting, monotoring the behavior of fishing vessels ...

  • Consumers preferences data : Websites and supermarkets collect a huge amount of data on the behavior of consumers. Machine learning algorithms are used to valorize these data (gathered sometimes with personal data such as age, sex, job, adress .. ) for recommandation systems, fixing personalized prices ..

The main references for this course are the books

-"Introduction to High-Dimensional Statistics" by C. Giraud
-" The elements of Statistical Learning" by T. Hastie, R. Tibshirani and J. Friedman

This course mainly focuses on high dimensional statistical problems, namely problems where we can observe many (sometimes thousand of ) variables on each individual. Studying a certain phenomenon (presence of a cancer or not/ abnormal behavior or not/ interest for a certain product ..), it is a challenge to derive which variables (among the huge amount of available ones) are influent for the phenomenon of interest, as well as to provide a prediction rule.

We will focus on :

  • Model selection methods for high dimensional linear regression models

  • Classical methods for supervised classification and Support Vectors Machines

  • Regularization methods in nonparametric statistics and wavelet thresholding

  • An introduction to deep learning

  • Anomaly detection for functional data