Skip to content

pierre-fromager/pcasvd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

80 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Pca

May be you are familiar with spreadsheets and dynamic cross tables tools to compare columns behaviours as sum,means...but what happens if you have about a thousand columns, you will need a more synthetic view of your datas.

Pca(Principal Component Analysis) is a method attached to Quantitative analysis (QA) branch.

It performs multidimensional analysis (Rk space), considering "Components" as columns of a datasets.

Behaviours are calculated as covariance or correlation and represented as 2d square matrix.
Many of these features yet exists in Python modules, but python may be slow on wide datasets.

The c++ code is a backend to handle large datasets with a best time response.
Python part Docker image can be used to plot and or to crosscheck results directly or from the backend.

Matlab/Octave part is available to crosscheck, some scripts can be used to generate graphics.

Toc

Purpose

Demistify PCA to let exploration as simple as possible for c/c++ devs.

Lexical

Pre-processing

  • Covariance matrix is the dispersion matrix of a dataset.
  • Correlation matrix is a covariance scaled matrix (identified by diagonal set to 1).

Svd (Single values decomposition) is the Eigen process.

Consider 2 forms of Pca

  • covariance based (Svd on unscaled matrix).
  • correlation based (Svd on scaled matrix).

As you may notice

  • covariance is lossless with a wide dispersion.
  • correlation is lossy with scaled dispersion.

❓ So what should I use cov or cor
When using dataset with columns values of same units use covariance else use correlation.
So method to use will depend on the nature of your dataset.

◀️

Features

Demo

SpeciesDemo is semi-generic class starter.
You can provide a filename as 1st argument to work on other file than "species.csv".
In that case you must provide a class label (additional column in csv headers prefixed by #), see csv samples like bovinscat,winecat.

Calculus

Graphics

Exports

  • πŸ“‹ Csv
  • πŸ“‹ Json

Pca explained

Interpretation

Questions

Tools

◀️

Fixtures & datasets

Preparing datasets is essential.
Think about using some ETL (Extract Transform Load) before operating pca.
I would recommend to read related content to csvkit (module included in docker image) :

Hereby

Sources

Requirements

◀️

Check

./check.sh

◀️

Build

./build.sh

◀️

Run

Help usage

./build/pca --help

Demo species

./build/pca

Demo wine options dimension1 column 0, dimension2 column 5

build/pca script/matlab/winecat.csv --d1 0 --d2 5

◀️

Debug

Gnuplot

GNUPLOT_IOSTREAM_CMD=">dbgscript.gp" ./build/pca

◀️

Sample output

Related to

Console

Fixture csv iris species 4x150
	Fixture Dataset

	5.100000	3.500000	1.400000	0.200000
	4.900000	3.000000	1.400000	0.200000
	4.700000	3.200000	1.300000	0.200000
	4.600000	3.100000	1.500000	0.200000
	5.000000	3.600000	1.400000	0.200000
	         ...
	Covariance

	0.685694	-0.042434	1.274315	0.516271
	-0.042434	0.189979	-0.329656	-0.121639
	1.274315	-0.329656	3.116278	1.295609
	0.516271	-0.121639	1.295609	0.581006

	Correlation

	1.000000	-0.117570	0.871754	0.817941
	-0.117570	1.000000	-0.428440	-0.366126
	0.871754	-0.428440	1.000000	0.962865
	0.817941	-0.366126	0.962865	1.000000

	Eigen vectors

	0.361387	-0.656589	0.582030	0.315487
	-0.084523	-0.730161	-0.597911	-0.319723
	0.856671	0.173373	-0.076236	-0.479839
	0.358289	0.075481	-0.545831	0.753657

	Eigen values

	4.228242	0.242671	0.078210	0.023835
	
	Explained variance
	
	0.924619	0.053066	0.017103	0.005212
	
	Projected matrix

	2.818240	-5.646350	0.659768	-0.031089
	2.788223	-5.149951	0.842317	0.065675
	2.613375	-5.182003	0.613952	-0.013383
	2.757022	-5.008654	0.600293	-0.108928
	2.773649	-5.653707	0.541773	-0.094610
	         ...

◀️

Graphics

Scatter

❓ How can I interpret individual factor map

Scatter ◀️

Correlation circle

❓ How can I interpret variable factor map

CorCircle ◀️

Heatmap correlation

❓ How can I interpret correlation heatmap

HeatMapCor ◀️

Dataset box and wiskers

❓ What does a box plot tell you

BoxAndWiskers ◀️

Testing

./test.sh

◀️

Doc

From root project.

doxygen doc/pcasvd.doxygen

Doc will be generated in doc/html folder.

◀️

Todo

  • Make gplot thread safe.
  • Improve gplotgeneric heatmap scale.
  • Remove #cat class from matrices.

Contribute

  • Feel free to clone from master then pull request on your brand new branch.
  • Pls, do not forget to rebase from master before push.

Licence

About

Principal Component Analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published