Pca

May be you are familiar with spreadsheets and dynamic cross tables tools to compare columns behaviours as sum,means...but what happens if you have about a thousand columns, you will need a more synthetic view of your datas.

Pca(Principal Component Analysis) is a method attached to Quantitative analysis (QA) branch.

It performs multidimensional analysis (Rk space), considering "Components" as columns of a datasets.

Behaviours are calculated as covariance or correlation and represented as 2d square matrix.
Many of these features yet exists in Python modules, but python may be slow on wide datasets.

The c++ code is a backend to handle large datasets with a best time response.
Python part Docker image can be used to plot and or to crosscheck results directly or from the backend.

Matlab/Octave part is available to crosscheck, some scripts can be used to generate graphics.

Purpose

Demistify PCA to let exploration as simple as possible for c/c++ devs.

Lexical

Pre-processing

Covariance matrix is the dispersion matrix of a dataset.
Correlation matrix is a covariance scaled matrix (identified by diagonal set to 1).

Svd (Single values decomposition) is the Eigen process.

Consider 2 forms of Pca

covariance based (Svd on unscaled matrix).
correlation based (Svd on scaled matrix).

As you may notice

covariance is lossless with a wide dispersion.
correlation is lossy with scaled dispersion.

❓ So what should I use cov or cor
When using dataset with columns values of same units use covariance else use correlation.
So method to use will depend on the nature of your dataset.

◀️

Features

Demo

SpeciesDemo is semi-generic class starter.
You can provide a filename as 1st argument to work on other file than "species.csv".
In that case you must provide a class label (additional column in csv headers prefixed by #), see csv samples like bovinscat,winecat.

Calculus

Exports

📋 Csv
📋 Json

Pca explained

Interpretation

Questions

Best way to let pca be normalized (en)

Tools

◀️

Fixtures & datasets

Preparing datasets is essential.
Think about using some ETL (Extract Transform Load) before operating pca.
I would recommend to read related content to csvkit (module included in docker image) :

Clean csv data to filter relevant datas from a raw source.
10 csvkit commands you should know as data engineer.

Hereby

Sources

Requirements

CMake 3.22.1.
C++ compiler, here g++, howto change it in CMakeLists.txt.
Alglib included in src.
Boost 1.78.0, check CMakeList.txt to let cmake use the correct PATHS in the find_package.
Gnuplot-iostream.
Octave or Matlab.
Doxygen for doc generation.

◀️

Check

./check.sh

◀️

Build

./build.sh

◀️

Run

Help usage

./build/pca --help

Demo species

./build/pca

Demo wine options dimension1 column 0, dimension2 column 5

build/pca script/matlab/winecat.csv --d1 0 --d2 5

◀️

Debug

Gnuplot

GNUPLOT_IOSTREAM_CMD=">dbgscript.gp" ./build/pca

◀️

Sample output

Related to

iris species dataset input source.
Python script species.py

Console

Fixture csv iris species 4x150
	Fixture Dataset

	5.100000	3.500000	1.400000	0.200000
	4.900000	3.000000	1.400000	0.200000
	4.700000	3.200000	1.300000	0.200000
	4.600000	3.100000	1.500000	0.200000
	5.000000	3.600000	1.400000	0.200000
	         ...
	Covariance

	0.685694	-0.042434	1.274315	0.516271
	-0.042434	0.189979	-0.329656	-0.121639
	1.274315	-0.329656	3.116278	1.295609
	0.516271	-0.121639	1.295609	0.581006

	Correlation

	1.000000	-0.117570	0.871754	0.817941
	-0.117570	1.000000	-0.428440	-0.366126
	0.871754	-0.428440	1.000000	0.962865
	0.817941	-0.366126	0.962865	1.000000

	Eigen vectors

	0.361387	-0.656589	0.582030	0.315487
	-0.084523	-0.730161	-0.597911	-0.319723
	0.856671	0.173373	-0.076236	-0.479839
	0.358289	0.075481	-0.545831	0.753657

	Eigen values

	4.228242	0.242671	0.078210	0.023835
	
	Explained variance
	
	0.924619	0.053066	0.017103	0.005212
	
	Projected matrix

	2.818240	-5.646350	0.659768	-0.031089
	2.788223	-5.149951	0.842317	0.065675
	2.613375	-5.182003	0.613952	-0.013383
	2.757022	-5.008654	0.600293	-0.108928
	2.773649	-5.653707	0.541773	-0.094610
	         ...

◀️

Graphics

Testing

./test.sh

◀️

Doc

From root project.

doxygen doc/pcasvd.doxygen

Doc will be generated in doc/html folder.

◀️

Todo

Make gplot thread safe.
Improve gplotgeneric heatmap scale.
Remove #cat class from matrices.

Contribute

Feel free to clone from master then pull request on your brand new branch.
Pls, do not forget to rebase from master before push.

Licence

GPL 3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pca

Toc

Purpose

Lexical

Features

Demo

Calculus

Graphics

Exports

Pca explained

Fixtures & datasets

Requirements

Check

Build

Run

Debug

Gnuplot

Sample output

Console

Graphics

Scatter

Correlation circle

Heatmap correlation

Dataset box and wiskers

Testing

Doc

Todo

Contribute

Licence

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
.vscode		.vscode
doc		doc
include		include
script		script
src		src
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
check.sh		check.sh
test.sh		test.sh

License

pierre-fromager/pcasvd

Folders and files

Latest commit

History

Repository files navigation

Pca

Toc

Purpose

Lexical

Features

Demo

Calculus

Graphics

Exports

Pca explained

Fixtures & datasets

Requirements

Check

Build

Run

Debug

Gnuplot

Sample output

Console

Graphics

Scatter

Correlation circle

Heatmap correlation

Dataset box and wiskers

Testing

Doc

Todo

Contribute

Licence

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages