May be you are familiar with spreadsheets and dynamic cross tables tools to compare columns behaviours as sum,means...but what happens if you have about a thousand columns, you will need a more synthetic view of your datas.
Pca(Principal Component Analysis) is a method attached to Quantitative analysis (QA) branch.
It performs multidimensional analysis (Rk space), considering "Components" as columns of a datasets.
Behaviours are calculated as covariance or correlation and represented as 2d square matrix.
Many of these features yet exists in Python modules, but python may be slow on wide datasets.
The c++ code is a backend to handle large datasets with a best time response.
Python part Docker image can be used to plot and or to crosscheck results directly or from the backend.
Matlab/Octave part is available to crosscheck, some scripts can be used to generate graphics.
- π Purpose
- π Lexical
- π Features
- π Requirements
- π Pca explained
- π Check
- π Build
- π Run
- π Debug
- π Sample output
- π Testing
- π Doc
Demistify PCA to let exploration as simple as possible for c/c++ devs.
Pre-processing
- Covariance matrix is the dispersion matrix of a dataset.
- Correlation matrix is a covariance scaled matrix (identified by diagonal set to 1).
Svd (Single values decomposition) is the Eigen process.
Consider 2 forms of Pca
- covariance based (Svd on unscaled matrix).
- correlation based (Svd on scaled matrix).
As you may notice
- covariance is lossless with a wide dispersion.
- correlation is lossy with scaled dispersion.
β So what should I use cov or cor
When using dataset with columns values of same units use covariance else use correlation.
So method to use will depend on the nature of your dataset.
SpeciesDemo is semi-generic class starter.
You can provide a filename as 1st argument to work on other file than "species.csv".
In that case you must provide a class label (additional column in csv headers prefixed by #), see csv samples like bovinscat,winecat.
- π Covariance
- π Correlation
- π Pca basis
- π Eigen values
- π Eigen vectors
- π Explained variance
- π Projection
- πΉ Scatter
- πΉ Correlation circle
- πΉ Heatmap correlation
- πΉ Dataset boxes and wiskers
- π Csv
- π Json
Interpretation
Questions
Tools
- Online Statistics Calculator (en)
- Principal Component Analysis and Linear Discriminant Analysis with GNU Octave (en)
Preparing datasets is essential.
Think about using some ETL (Extract Transform Load) before operating pca.
I would recommend to read related content to csvkit (module included in docker image) :
- Clean csv data to filter relevant datas from a raw source.
- 10 csvkit commands you should know as data engineer.
Hereby
- 2x12 inline
- 4x12 pop(gender/salary/age/weight) csv
- 6x23 bovins(vif/carcasse/quality/total/gras/os) csv
- 4x150 iris species(sepallength/sepalwidth/petallength/petalwidth) csv
Sources
- CMake 3.22.1.
- C++ compiler, here g++, howto change it in CMakeLists.txt.
- Alglib included in src.
- Boost 1.78.0, check CMakeList.txt to let cmake use the correct PATHS in the find_package.
- Gnuplot-iostream.
- Octave or Matlab.
- Doxygen for doc generation.
./check.sh
./build.sh
Help usage
./build/pca --help
Demo species
./build/pca
Demo wine options dimension1 column 0, dimension2 column 5
build/pca script/matlab/winecat.csv --d1 0 --d2 5
GNUPLOT_IOSTREAM_CMD=">dbgscript.gp" ./build/pca
Related to
- iris species dataset input source.
- Python script species.py
Fixture csv iris species 4x150
Fixture Dataset
5.100000 3.500000 1.400000 0.200000
4.900000 3.000000 1.400000 0.200000
4.700000 3.200000 1.300000 0.200000
4.600000 3.100000 1.500000 0.200000
5.000000 3.600000 1.400000 0.200000
...
Covariance
0.685694 -0.042434 1.274315 0.516271
-0.042434 0.189979 -0.329656 -0.121639
1.274315 -0.329656 3.116278 1.295609
0.516271 -0.121639 1.295609 0.581006
Correlation
1.000000 -0.117570 0.871754 0.817941
-0.117570 1.000000 -0.428440 -0.366126
0.871754 -0.428440 1.000000 0.962865
0.817941 -0.366126 0.962865 1.000000
Eigen vectors
0.361387 -0.656589 0.582030 0.315487
-0.084523 -0.730161 -0.597911 -0.319723
0.856671 0.173373 -0.076236 -0.479839
0.358289 0.075481 -0.545831 0.753657
Eigen values
4.228242 0.242671 0.078210 0.023835
Explained variance
0.924619 0.053066 0.017103 0.005212
Projected matrix
2.818240 -5.646350 0.659768 -0.031089
2.788223 -5.149951 0.842317 0.065675
2.613375 -5.182003 0.613952 -0.013383
2.757022 -5.008654 0.600293 -0.108928
2.773649 -5.653707 0.541773 -0.094610
...
β How can I interpret individual factor map
β How can I interpret variable factor map
β How can I interpret correlation heatmap
β What does a box plot tell you
./test.sh
From root project.
doxygen doc/pcasvd.doxygen
Doc will be generated in doc/html folder.
- Make gplot thread safe.
- Improve gplotgeneric heatmap scale.
- Remove #cat class from matrices.
- Feel free to clone from master then pull request on your brand new branch.
- Pls, do not forget to rebase from master before push.