-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.tex
1361 lines (1193 loc) · 64.2 KB
/
index.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\PassOptionsToPackage{unicode=true}{hyperref} % options for packages loaded elsewhere
\PassOptionsToPackage{hyphens}{url}
\PassOptionsToPackage{dvipsnames,svgnames*,x11names*}{xcolor}
%
\documentclass[]{article}
\usepackage{lmodern}
\usepackage{amssymb,amsmath}
\usepackage{ifxetex,ifluatex}
\usepackage{fixltx2e} % provides \textsubscript
\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{textcomp} % provides euro and other symbols
\else % if luatex or xelatex
\usepackage{unicode-math}
\defaultfontfeatures{Ligatures=TeX,Scale=MatchLowercase}
\fi
% use upquote if available, for straight quotes in verbatim environments
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
% use microtype if available
\IfFileExists{microtype.sty}{%
\usepackage[]{microtype}
\UseMicrotypeSet[protrusion]{basicmath} % disable protrusion for tt fonts
}{}
\IfFileExists{parskip.sty}{%
\usepackage{parskip}
}{% else
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}
}
\usepackage{xcolor}
\usepackage{hyperref}
\hypersetup{
pdftitle={Introduction to ordination (PCA) in R},
pdfauthor={Brad Duthie},
colorlinks=true,
linkcolor=blue,
filecolor=Maroon,
citecolor=Blue,
urlcolor=Blue,
breaklinks=true}
\urlstyle{same} % don't use monospace font for urls
\usepackage[margin=1in]{geometry}
\usepackage{color}
\usepackage{fancyvrb}
\newcommand{\VerbBar}{|}
\newcommand{\VERB}{\Verb[commandchars=\\\{\}]}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{commandchars=\\\{\}}
% Add ',fontsize=\small' for more characters per line
\usepackage{framed}
\definecolor{shadecolor}{RGB}{248,248,248}
\newenvironment{Shaded}{\begin{snugshade}}{\end{snugshade}}
\newcommand{\AlertTok}[1]{\textcolor[rgb]{0.94,0.16,0.16}{#1}}
\newcommand{\AnnotationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\AttributeTok}[1]{\textcolor[rgb]{0.77,0.63,0.00}{#1}}
\newcommand{\BaseNTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\BuiltInTok}[1]{#1}
\newcommand{\CharTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\CommentTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textit{#1}}}
\newcommand{\CommentVarTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\ConstantTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\ControlFlowTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{#1}}}
\newcommand{\DataTypeTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{#1}}
\newcommand{\DecValTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\DocumentationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\ErrorTok}[1]{\textcolor[rgb]{0.64,0.00,0.00}{\textbf{#1}}}
\newcommand{\ExtensionTok}[1]{#1}
\newcommand{\FloatTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\FunctionTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\ImportTok}[1]{#1}
\newcommand{\InformationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\KeywordTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{#1}}}
\newcommand{\NormalTok}[1]{#1}
\newcommand{\OperatorTok}[1]{\textcolor[rgb]{0.81,0.36,0.00}{\textbf{#1}}}
\newcommand{\OtherTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{#1}}
\newcommand{\PreprocessorTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textit{#1}}}
\newcommand{\RegionMarkerTok}[1]{#1}
\newcommand{\SpecialCharTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\SpecialStringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\StringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\VariableTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\VerbatimStringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\WarningTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\usepackage{longtable,booktabs}
% Fix footnotes in tables (requires footnote package)
\IfFileExists{footnote.sty}{\usepackage{footnote}\makesavenoteenv{longtable}}{}
\usepackage{graphicx,grffile}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
\setlength{\emergencystretch}{3em} % prevent overfull lines
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\setcounter{secnumdepth}{0}
% Redefines (sub)paragraphs to behave more like sections
\ifx\paragraph\undefined\else
\let\oldparagraph\paragraph
\renewcommand{\paragraph}[1]{\oldparagraph{#1}\mbox{}}
\fi
\ifx\subparagraph\undefined\else
\let\oldsubparagraph\subparagraph
\renewcommand{\subparagraph}[1]{\oldsubparagraph{#1}\mbox{}}
\fi
% set default figure placement to htbp
\makeatletter
\def\fps@figure{htbp}
\makeatother
\title{Introduction to ordination (PCA) in R}
\author{Brad Duthie}
\date{22 July 2020}
\begin{document}
\maketitle
\hypertarget{contents}{%
\section{Contents}\label{contents}}
\begin{center}\rule{0.5\linewidth}{0.5pt}\end{center}
\textbf{Ordination is a useful tool for visualising
\href{https://en.wikipedia.org/wiki/Multivariate_analysis}{multivariate
data}. These notes focus on explaining and demonstrating
\href{https://en.wikipedia.org/wiki/Principal_component_analysis}{principal
component analysis} in the R programming language. Examples will use
\href{https://github.com/StirlingCodingClub/ordination/master/wing_loadings.csv}{morphological
data} from
\href{http://bradduthie.github.io/DuthieEtAl2015_Appendix.pdf\#page=3}{six
species} of fig wasps in a community associated with the Sonoran Desert
Rock Fig (\emph{Ficus petiolaris}). These notes are also available as
\href{https://stirlingcodingclub.github.io/ordination/index.pdf}{PDF}
and
\href{https://stirlingcodingclub.github.io/ordination/index.docx}{DOCX}
documents.}
\begin{center}\rule{0.5\linewidth}{0.5pt}\end{center}
\begin{itemize}
\tightlist
\item
\protect\hyperlink{intro}{Introduction: What is ordination?}
\item
\protect\hyperlink{pca}{Principal Component Analysis: key concepts}
\item
\protect\hyperlink{wasps}{Fig wasp morphological data}
\item
\protect\hyperlink{Rcode}{Principal Component Analysis in R}
\item
\protect\hyperlink{maths}{Principal Component Analysis: matrix
algebra}
\item
\protect\hyperlink{conclusions}{Conclusions}
\item
\protect\hyperlink{refs}{Literature Cited}
\end{itemize}
\begin{center}\rule{0.5\linewidth}{0.5pt}\end{center}
\hypertarget{introduction-what-is-ordination}{%
\section{Introduction: What is
ordination?}\label{introduction-what-is-ordination}}
Ordination is a method for investigating and visualising multivariate
data. It allows us to look at and understand variation in a data set
when there are too many dimensions to see everything at once. Imagine a
data set with four different variables. I have
\href{https://StirlingCodingClub.github.io/simulating_data/index.html}{simulated
one} below.
\begin{longtable}[]{@{}lrrrr@{}}
\toprule
& Variable\_1 & Variable\_2 & Variable\_3 & Variable\_4\tabularnewline
\midrule
\endhead
Sample\_1 & 3.8432410 & 2.6407594 & -0.6976496 &
-3.8173266\tabularnewline
Sample\_2 & 5.0907125 & 6.0683544 & 1.8097752 & 0.8234129\tabularnewline
Sample\_3 & -3.2575704 & -3.6965361 & 4.4399886 &
0.3118165\tabularnewline
Sample\_4 & -1.7029267 & -0.7446813 & 4.9935894 &
-0.3506784\tabularnewline
Sample\_5 & -1.4924450 & -1.9229800 & 3.4279558 &
7.8473583\tabularnewline
Sample\_6 & 1.9690167 & -0.0193453 & -0.2280392 &
-0.7550955\tabularnewline
Sample\_7 & 4.4895141 & 4.3336861 & 2.2381424 & 4.7782675\tabularnewline
Sample\_8 & 1.1811029 & -0.1939233 & -2.4153756 &
4.1086135\tabularnewline
Sample\_9 & 2.3024895 & 0.9478533 & 4.4690578 &
-0.3843029\tabularnewline
Sample\_10 & 7.0521002 & 7.0297369 & 4.4969840 &
7.3068450\tabularnewline
Sample\_11 & -0.8589422 & -1.6083472 & 5.2195623 &
3.8069426\tabularnewline
Sample\_12 & 5.0304025 & 7.0224731 & -2.4648239 &
-1.7609266\tabularnewline
\bottomrule
\end{longtable}
We could look at the distribution of each variable in columns 1-4
individually using a histogram. Or we could use a scatterplot to look at
two variables at the same time, with one variable plotted in one
dimension (horizontal x-axis) and a second variable plotted
\href{https://en.wikipedia.org/wiki/Orthogonality}{orthogonally} (i.e.,
at a right angle) in another dimension (y-axis). I have done this below
with a histogram for \texttt{Variable\_1}, and with a scatterplot of
\texttt{Variable\_1} versus \texttt{Variable\_2}. The numbers in the
scatterplot points correspond to the sample number (i.e., rows 1-12).
\includegraphics{index_files/figure-latex/unnamed-chunk-2-1.pdf}
Since we do not have four dimensions of space for plotting, we cannot
put all four variables on a scatterplot that we can visualise with an
axis for each variable. Hence, in the scatterplot to the right above, we
are ignoring \texttt{Variable\_3} and \texttt{Variable\_4}.
For variables 1 and 2, we can see that Sample\_1 is similar to Sample\_7
because the points are close together, and that Sample\_1 is different
from Sample\_3 because the points are relatively far apart. But it might
be useful to see the distance between Sample\_1, Sample\_7, and
Sample\_3 while simultaneously taking into account Variables 3 and 4.
More generally, it would be useful to see how distant different points
are from one another given all of the dimensions in the data set.
Ordination methods are a tools that try to do this, allowing us to
visualise high-dimensional variation in data in a reduced number of
dimensions (usually two).
It is important to emphasise that ordination is a tool for exploring
data and reducing dimensionality; it is not a method of hypothesis
testing (i.e., not a type of
\href{https://stirlingcodingclub.github.io/linear_modelling/index.html}{linear
model}). Variables in ordinations are interpreted as response variables
(i.e., \(y\) variables in models where
\(y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}\)). Ordinations
visualise the total variation in these variables, but do not test
relationships between dependent (\(x\)) and independent (\(y\))
variables.
These notes will focus entirely on Principal Component Analysis (PCA),
which is just one of many ordination methods (perhaps the most commonly
used). After reading these notes, you should have a clear
\protect\hyperlink{pca}{conceptual understanding} of how a PCA is
constructed, and how to read one when you see one in the literature. You
should also be able to \protect\hyperlink{Rcode}{build your own PCA} in
R using the code below. For most biologists and environmental sciences,
this should be enough knowledge to allow you use PCA effectively in your
own research. I have therefore kept the maths to a minimum when
explaining the key ideas, putting all of the
\protect\hyperlink{maths}{matrix algebra} underlying PCA into a single
section toward the end.
\hypertarget{principal-component-analysis-key-concepts}{%
\section{Principal Component Analysis: key
concepts}\label{principal-component-analysis-key-concepts}}
Imagine a scatterplot of data, with each variable getting its own axis
representing some kind of measurement. If there are only two variables,
as with the scatterplot above, then we would only need an x-axis and a
y-axis to show the \emph{exact} position of each data point in data
space. If we add a third variable, then we would need a z-axis, which
would be orthogonal to the x-axis and y-axis (i.e., three dimensions of
data space). If we need to add a fourth variable, then we would need yet
another axis along which our data points could vary, making the whole
data space extremely difficult to visualise. As we add yet more
variables and more axes, the position that our data points occupy in
data space becomes impossible to visualise.
Principal Component Analysis (PCA) can be interpreted as a rotation of
data in this multi-dimensional space. The distance between data points
does not change at all; the data are just moved around so that the total
variation in the data set is easier to see. If this verbal explanation
is confusing, that is okay; a visual example should make the idea easier
to understand. To make everything easy to see, I will start with only
two dimensions using Variable\_1 and Variable\_2 from earlier (for now,
just ignore the existence of Variables 3 and 4). Notice from the
scatterplot that there is variation in both dimensions of the data; the
variance of Variable\_1 is 10.58, and the variance of Variable\_2 is
13.63. But these variables are also clearly correlated. A sample for
which Variable\_1 is measured to be high is also very likely measure a
high value of Variable\_2 (perhaps these are measurements of animal
length and width).
What if we wanted to show as much of the total variation as possible
just on the x-axis? In other words, rotate the data so that the
maximimum amount of variance in the full data set (i.e., in the
scatterplot) falls along the x-axis, with any variation remaining being
left to fall along the y-axis? To do this, we need to draw a line that
cuts through our two dimensions of space in the direction where the data
are most stretched out (i.e., have the highest variance); \textbf{this
direction is our first Principal component, PC1}. I have drawn it below
in red (left panel).
\includegraphics{index_files/figure-latex/unnamed-chunk-4-1.pdf}
To build our PCA, all that we need to do is take this red line and drag
it to the x-axis so that it overlaps with \(y = 0\) (right panel). As we
move Principal Component 1 (PC1) to the x-axis, we bring all of the data
points with it, preserving their distances from PC1 and each other.
Notice in the panel to the right above that the data have the same shape
as they do in the panel to the left. The distances between points have
not changed at all; everything has just been moved. Sample\_1 and
Sample\_7 are just as close to each other as they were in the original
scatterplot, and Sample\_1 and Sample\_3 are just as far away.
Principal Component 1 shows that maximum amount of variation that is
possible to show in one dimension while preserving these distances
between points. What little variation that remains is in PC2. Since we
have maximised the amount of variation on PC1, there can be no
additional covariation left between the x-axis and the y-axis. If any
existed, then we would need to move the line again because it would mean
that more variation could be still placed along the x-axis (i.e., more
variation could be squeezed out of the covariation between variables).
Hence, PC1 and PC2 are, by definition, uncorrelated. Note that this does
not mean that Variable 1 and Variable 2 are not correlated (they
obviously are!), but PC1 and PC2 are not. This is possible because PC1
and PC2 represent a mixture of Variables 1 and 2. They no longer
represent a specific variable, but a linear combination of multiple
variables.
This toy example with two variables can be extended to any number of
variables and principal components. PCA finds the axis along which
variation is maximised in any number of dimensions and makes that line
PC1. In our example with two variables, this completed the PCA because
we were only left with one other dimension, so all of the remaining
variation had to go in PC2. But when there are more variables and
dimensions, the data are rotated again around PC1 so that the maximum
amount of variation is shown in PC2 after PC1 is fixed. For three
variables, imagine a cloud of data points; first put a line through the
most elonated part of the cloud and reorient the whole thing so that
this most elongated part is the width of the cloud (x-axis), then spin
it again along this axis so that the next widest part is the height of
the cloud (y-axis). The same idea applies for data in even higher
dimensional space; the data are rotated orthogonally around each
additional Principal Component so that the amount of variation explained
with each gets progressively smaller (in practice, however, all of this
rotating is done simultaneously by the \protect\hyperlink{maths}{matrix
algebra}).
There are a few additional points to make before moving on to some real
data. First, PCA is not useful if the data are entirely uncorrelated. To
see why, take a look at the plot of two hypothetical (albeit ridiculous)
variables in the lower left panel. The correlation between the two
variables is zero, and there is no way to rotate the data to increase
the amount of variation shown on the x-axis. The PCA on the lower right
is basically the same figure, just moved so that the centre is on the
origin.
\includegraphics{index_files/figure-latex/unnamed-chunk-5-1.pdf}
Second, if two variables are completely correlated, the maths underlying
PCA does not work because a
\href{https://mathworld.wolfram.com/SingularMatrix.html}{singluar
matrix} is created (this is the matrix algebra equivalent of dividing by
zero). Here is what it looks like visually if two variables are
completely correlated.
\includegraphics{index_files/figure-latex/unnamed-chunk-6-1.pdf}
The panel on the left shows two perfectly correlated variables. The
panel on the right shows what the PCA would look like. Note that there
is no variation on PC2. One hundred percent of the variation can be
described using PC1, meaning that if we know the value of Variable\_1,
then we can certain about the value of Variable\_2.
I will now move on to introduce a morphological data set collected in
2010 from a fig wasp community surrounding the Sonoran Desert rock fig
(\emph{Ficus petiolaris}). These data were originally published in
Duthie, Abbott, and Nason (\protect\hyperlink{ref-Duthie2015b}{2015}),
and they are publicly available
\href{https://github.com/StirlingCodingClub/ordination/tree/master/data}{on
GitHub}.
\hypertarget{fig-wasp-morphological-data}{%
\section{Fig wasp morphological
data}\label{fig-wasp-morphological-data}}
The association between fig trees and their pollinating and seed-eating
wasps is a classic example of mutualism. The life-histories of figs and
pollinating fig wasps are well-known and fascinating (Janzen
\protect\hyperlink{ref-Janzen1979}{1979}; Weiblen
\protect\hyperlink{ref-Weiblen2002}{2002}), but less widely known are
the species rich communities of non-pollinating exploiter species of fig
wasps (Borges \protect\hyperlink{ref-Borges2015}{2015}). These
non-pollinating fig wasps make use of resources within figs in a variety
of ways; some lay eggs into developing fig ovules without pollinating,
while others are parasitoids of other wasps. All of these wasps develop
alongside the pollinators and seeds within the enclosed inflorescence of
figs, and emerge as adults typically after several weeks. Part of my
doctoral work focused on the community of fig wasps that lay eggs in the
figs of \emph{F. petiolaris} in Baja, Mexico (Duthie, Abbott, and Nason
\protect\hyperlink{ref-Duthie2015b}{2015}; Duthie and Nason
\protect\hyperlink{ref-Duthie2016}{2016}).
\includegraphics{images/fig_tree.jpg}
The community of non-pollinators associated with \emph{F. petiolaris}
includes five species of the genera \emph{Idarnes} (3) and
\emph{Heterandrium} (2). Unlike pollinators, which climb into a small
hole of the enclosed infloresence and pollinate and lay eggs from
inside, these non-pollinators drill their ovipositors directly into the
wall of the fig. The left panel below shows a fig sliced in half (which
often results in hundreds of fig wasps emerging from the inside). The
right panel shows a fig on a branch.
\includegraphics{images/fruitbranch.png}
The whole fig wasp community is shown below. The pollinator is in panel
`a', while, panels `b-h' show the non-pollinators (panels `g' and `h'
show a host-parasitoid pair).
\includegraphics[width=0.5\textwidth,height=\textheight]{images/wasps.jpg}
As part of a test for evidence of a dispersal-fecundity tradeoff in the
fig wasps in panels `b-f', I estimated the mean wing loading of each
species using body volume (body volume divided by wing surface area).
This included 11 morphological measurements in total.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{fig_wasps <-}\StringTok{ }\KeywordTok{read.csv}\NormalTok{(}\StringTok{"data/wing_loadings.csv"}\NormalTok{, }\DataTypeTok{header =} \OtherTok{TRUE}\NormalTok{);}
\NormalTok{fig_cols <-}\StringTok{ }\KeywordTok{colnames}\NormalTok{(fig_wasps);}
\KeywordTok{print}\NormalTok{(fig_cols);}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## [1] "Species" "Site" "Tree"
## [4] "Fruit" "Head_Length_mm" "Head_Width_mm"
## [7] "Thorax_Length_mm" "Thorax_Width_mm" "Abdomen_Length_mm"
## [10] "Abdomen_Width_mm" "Ovipositor_Length_mm" "Ovipositor_Width_mm"
## [13] "Forewing_Area_mm.2" "Hindwing_Area_mm.2" "Forewing_Length_mm"
\end{verbatim}
Below, I will run a PCA in R to look at the total variation in
non-pollinating fig wasp morphology. To keep things simple, I will focus
just on the two species of \emph{Heterandrium} (panels `e, f' in the
image above).
\hypertarget{principal-component-analysis-in-r}{%
\section{Principal Component Analysis in
R}\label{principal-component-analysis-in-r}}
First, I will trim down the data set in \texttt{fig\_wasps} that I read
in above to remove all species that are not in the genus
\emph{Heterandrium}. There are two species of \emph{Heterandrium} that I
will put into a single data set.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{het1 <-}\StringTok{ }\NormalTok{fig_wasps[fig_wasps[,}\DecValTok{1}\NormalTok{] }\OperatorTok{==}\StringTok{ "Het1"}\NormalTok{,] }\CommentTok{# Species in panel 'f' above}
\NormalTok{het2 <-}\StringTok{ }\NormalTok{fig_wasps[fig_wasps[,}\DecValTok{1}\NormalTok{] }\OperatorTok{==}\StringTok{ "Het2"}\NormalTok{,] }\CommentTok{# Species in panel 'e' above}
\NormalTok{het <-}\StringTok{ }\KeywordTok{rbind}\NormalTok{(het1, het2);}
\end{Highlighting}
\end{Shaded}
This leaves us with a data frame with 32 rows and 15 columns. In total,
there are 16 measurements from samples of `Het1' and 16 measurements
from samples of `Het2'.
Because PCA requires matrix manipulations, we need the R object holding
the data to be a matrix. To do this, we can get rid of the first four
columns of data. These columns include the species names, and the site,
tree, and fruit from which the fig wasp was collected.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{dat <-}\StringTok{ }\NormalTok{het[,}\OperatorTok{-}\NormalTok{(}\DecValTok{1}\OperatorTok{:}\DecValTok{4}\NormalTok{)]; }\CommentTok{# Remove columns 1 through 4}
\end{Highlighting}
\end{Shaded}
We now need to define \texttt{dat} as a matrix.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{dat <-}\StringTok{ }\KeywordTok{as.matrix}\NormalTok{(dat);}
\end{Highlighting}
\end{Shaded}
This leaves us with the final version of the data set that we will use.
It includes 11 columns of morphological measurements.
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{head}\NormalTok{(dat);}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## Head_Length_mm Head_Width_mm Thorax_Length_mm Thorax_Width_mm
## [1,] 0.566 0.698 0.767 0.494
## [2,] 0.505 0.607 0.784 0.527
## [3,] 0.511 0.622 0.769 0.511
## [4,] 0.479 0.601 0.766 0.407
## [5,] 0.545 0.707 0.828 0.561
## [6,] 0.525 0.651 0.852 0.590
## Abdomen_Length_mm Abdomen_Width_mm Ovipositor_Length_mm
## [1,] 1.288 0.504 0.376
## [2,] 1.059 0.430 0.274
## [3,] 1.107 0.504 0.359
## [4,] 1.242 0.446 0.331
## [5,] 1.367 0.553 0.395
## [6,] 1.408 0.618 0.350
## Ovipositor_Width_mm Forewing_Area_mm.2 Hindwing_Area_mm.2
## [1,] 0.098 1.339 0.294
## [2,] 0.072 0.890 0.195
## [3,] 0.062 0.975 0.214
## [4,] 0.065 0.955 0.209
## [5,] 0.081 1.361 0.361
## [6,] 0.085 1.150 0.308
## Forewing_Length_mm
## [1,] 2.122
## [2,] 1.765
## [3,] 1.875
## [4,] 1.794
## [5,] 2.130
## [6,] 1.981
\end{verbatim}
These 11 columns are our variables 1-11. Our fly measurements thereby
occupy some position in 11-dimensional space on 11 orthogonal axes,
which is way too complex to visualise all at once. We can first take a
look at all of the possible scatterplots using \texttt{pairs}.
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{pairs}\NormalTok{(}\DataTypeTok{x =}\NormalTok{ dat, }\DataTypeTok{gap =} \DecValTok{0}\NormalTok{, }\DataTypeTok{cex.labels =} \FloatTok{0.5}\NormalTok{);}
\end{Highlighting}
\end{Shaded}
\includegraphics{index_files/figure-latex/unnamed-chunk-13-1.pdf}
There is not much that we can infer from this, except that most of the
variables appear to be highly correlated. A PCA is therefore likely to
be useful. Before building the PCA, it is probably a good idea to scale
all of our variables. This would be especially important if we had
variables measured in different units. If we do not scale the variables
to have the same mean and standard deviation, then the PCA could
potentially be affected by the units in which variables were measured,
which does not make sense. If, for example, we had measured thorax
length and width in \(mm^2\), but abdomen length and width in
\(cm^{2}\), then we would get thorax measurements contributing more to
PC1 just because the measured values would be larger (if you do not
believe me, try multiplying all values in one column of the data by 10
or 100, then re-run the PCA below). For data sets that include
measurements with much different unit scales (e.g., pH versus elevation
in metres above sea level), this is especially problematic. I will
therefore scale the data below.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{dat <-}\StringTok{ }\KeywordTok{scale}\NormalTok{(dat);}
\end{Highlighting}
\end{Shaded}
Now every variable has a mean of zero and a standard deviation of one.
We are finally ready to build our PCA, using only one line of code.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{pca_dat <-}\StringTok{ }\KeywordTok{prcomp}\NormalTok{(dat);}
\end{Highlighting}
\end{Shaded}
That is it. Here is what the output looks like when printed out. Do not
worry about the details. I will come back to explain it later.
\begin{verbatim}
## Standard deviations (1, .., p=11):
## [1] 2.5247440 1.3801063 1.0554983 0.7327957 0.5851569 0.5002464 0.4027246
## [8] 0.3451656 0.3368968 0.2424040 0.1538419
##
## Rotation (n x k) = (11 x 11):
## PC1 PC2 PC3 PC4
## Head_Length_mm -0.34140280 0.163818591 -0.28767452 0.12186387
## Head_Width_mm -0.37329512 0.009539983 -0.18750714 -0.02689984
## Thorax_Length_mm -0.31043223 0.212467054 -0.34643613 0.07575953
## Thorax_Width_mm -0.32914664 -0.026143927 -0.12608860 0.52180693
## Abdomen_Length_mm -0.35080667 0.215792603 -0.12967361 -0.02261804
## Abdomen_Width_mm -0.09942693 0.628750876 0.20976550 -0.20080986
## Ovipositor_Length_mm 0.18174394 0.587847182 -0.03746881 -0.21192676
## Ovipositor_Width_mm -0.26652653 -0.260592830 -0.19455961 -0.77915529
## Forewing_Area_mm.2 -0.34255769 -0.133512606 0.39772561 -0.03554531
## Hindwing_Area_mm.2 -0.25340037 0.132093495 0.65445796 0.05921015
## Forewing_Length_mm -0.34758221 -0.191325254 0.24411693 -0.09382015
## PC5 PC6 PC7 PC8
## Head_Length_mm -0.08112505 0.505176595 -0.01398414 0.364561108
## Head_Width_mm -0.17070577 0.120110305 0.03774836 0.474486398
## Thorax_Length_mm 0.54292261 -0.176608183 -0.42642232 -0.304654846
## Thorax_Width_mm -0.56028055 -0.113250947 0.14497291 -0.500898934
## Abdomen_Length_mm 0.22638163 -0.196597233 0.12804621 0.005214056
## Abdomen_Width_mm -0.25555292 -0.542657091 0.10605929 0.215945914
## Ovipositor_Length_mm 0.01296918 0.530782226 0.27984207 -0.397709709
## Ovipositor_Width_mm -0.32473973 -0.002741201 -0.17249250 -0.261382659
## Forewing_Area_mm.2 0.21024273 0.065140892 0.23416076 -0.001060866
## Hindwing_Area_mm.2 -0.10607059 0.258514312 -0.57477991 -0.078216349
## Forewing_Length_mm 0.27922395 0.020406736 0.52403579 -0.137734836
## PC9 PC10 PC11
## Head_Length_mm -0.225828822 0.54868461 -0.109825693
## Head_Width_mm 0.053630473 -0.71589314 0.197911013
## Thorax_Length_mm -0.328182916 -0.15625131 -0.002758449
## Thorax_Width_mm -0.022217714 -0.03937269 -0.045797240
## Abdomen_Length_mm 0.815797870 0.20281032 -0.007815979
## Abdomen_Width_mm -0.285403216 0.10473716 -0.019244867
## Ovipositor_Length_mm 0.051323247 -0.23210236 -0.014135669
## Ovipositor_Width_mm -0.003649773 0.09266248 -0.063094898
## Forewing_Area_mm.2 -0.094481435 -0.16839713 -0.751543357
## Hindwing_Area_mm.2 0.141382511 0.02013088 0.227054769
## Forewing_Length_mm -0.243686582 0.13097688 0.570684702
\end{verbatim}
If we want to plot the first two principle components, we can do so with
the following code. Note that \texttt{pca\_dat\$x} is a matrix of the
same size as our original data set, but with values that plot the data
on the PC axes.
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{par}\NormalTok{(}\DataTypeTok{mar =} \KeywordTok{c}\NormalTok{(}\DecValTok{5}\NormalTok{, }\DecValTok{5}\NormalTok{, }\DecValTok{1}\NormalTok{, }\DecValTok{1}\NormalTok{));}
\KeywordTok{plot}\NormalTok{(}\DataTypeTok{x =}\NormalTok{ pca_dat}\OperatorTok{$}\NormalTok{x[,}\DecValTok{1}\NormalTok{], }\DataTypeTok{y =}\NormalTok{ pca_dat}\OperatorTok{$}\NormalTok{x[,}\DecValTok{2}\NormalTok{], }\DataTypeTok{asp =} \DecValTok{1}\NormalTok{, }\DataTypeTok{cex.lab =} \FloatTok{1.25}\NormalTok{, }
\DataTypeTok{cex.axis =} \FloatTok{1.25}\NormalTok{, }\DataTypeTok{xlab =} \StringTok{"PC1"}\NormalTok{, }\DataTypeTok{ylab =} \StringTok{"PC2"}\NormalTok{);}
\end{Highlighting}
\end{Shaded}
\includegraphics{index_files/figure-latex/unnamed-chunk-17-1.pdf}
Ignore the arguments that start with \texttt{cex}; these are just
plotting preferences. But the argument \texttt{asp\ =\ 1} is important;
it ensures that one unit along the x-axis is the same distance as one
unit along the y-axis. If this were not set, then the distance between
any two points might be misleading. If, for example, moving 100 pixels
on the x-axis corresponded to one unit on PC1, but moving 100 pixels on
the y-axis corresponded to two units in PC2, then the variation in PC1
would look larger relative to PC2 than it should.
Note that we are not at all restricted to comparing PC1 and PC2. We can
look at any of the 11 PCs that we want. Below compares PC1 with PC3.
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{par}\NormalTok{(}\DataTypeTok{mar =} \KeywordTok{c}\NormalTok{(}\DecValTok{5}\NormalTok{, }\DecValTok{5}\NormalTok{, }\DecValTok{1}\NormalTok{, }\DecValTok{1}\NormalTok{));}
\KeywordTok{plot}\NormalTok{(}\DataTypeTok{x =}\NormalTok{ pca_dat}\OperatorTok{$}\NormalTok{x[,}\DecValTok{1}\NormalTok{], }\DataTypeTok{y =}\NormalTok{ pca_dat}\OperatorTok{$}\NormalTok{x[,}\DecValTok{3}\NormalTok{], }\DataTypeTok{asp =} \DecValTok{1}\NormalTok{, }\DataTypeTok{cex.lab =} \FloatTok{1.25}\NormalTok{, }
\DataTypeTok{cex.axis =} \FloatTok{1.25}\NormalTok{, }\DataTypeTok{xlab =} \StringTok{"PC1"}\NormalTok{, }\DataTypeTok{ylab =} \StringTok{"PC3"}\NormalTok{);}
\end{Highlighting}
\end{Shaded}
\includegraphics{index_files/figure-latex/unnamed-chunk-18-1.pdf}
Note that the points are in the same locations on the x-axis (PC1) as
before, but not the y-axis (now PC3). We have just rotated 90 degrees in
our 11-dimensional space and are therefore looking at our cloud of data
from a different angle.
Recall that we had two groups in these data; two species of the genus
\emph{Heterandrium}. Now that we have plotted the position of all wasp
measurements, we can also fill in the PCA with colours representing each
species. This allows us to visualise how individuals in different groups
are separated in our data.
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{par}\NormalTok{(}\DataTypeTok{mar =} \KeywordTok{c}\NormalTok{(}\DecValTok{5}\NormalTok{, }\DecValTok{5}\NormalTok{, }\DecValTok{1}\NormalTok{, }\DecValTok{1}\NormalTok{));}
\KeywordTok{plot}\NormalTok{(}\DataTypeTok{x =}\NormalTok{ pca_dat}\OperatorTok{$}\NormalTok{x[,}\DecValTok{1}\NormalTok{], }\DataTypeTok{y =}\NormalTok{ pca_dat}\OperatorTok{$}\NormalTok{x[,}\DecValTok{2}\NormalTok{], }\DataTypeTok{asp =} \DecValTok{1}\NormalTok{, }\DataTypeTok{cex.lab =} \FloatTok{1.25}\NormalTok{, }
\DataTypeTok{cex.axis =} \FloatTok{1.25}\NormalTok{, }\DataTypeTok{xlab =} \StringTok{"PC1"}\NormalTok{, }\DataTypeTok{ylab =} \StringTok{"PC2"}\NormalTok{);}
\NormalTok{h1_rw <-}\StringTok{ }\KeywordTok{which}\NormalTok{(het[,}\DecValTok{1}\NormalTok{] }\OperatorTok{==}\StringTok{ "Het1"}\NormalTok{); }\CommentTok{# Recall the original data set}
\NormalTok{h2_rw <-}\StringTok{ }\KeywordTok{which}\NormalTok{(het[,}\DecValTok{1}\NormalTok{] }\OperatorTok{==}\StringTok{ "Het2"}\NormalTok{); }\CommentTok{# Recall the original data set}
\KeywordTok{points}\NormalTok{(}\DataTypeTok{x =}\NormalTok{ pca_dat}\OperatorTok{$}\NormalTok{x[h1_rw,}\DecValTok{1}\NormalTok{], }\DataTypeTok{y =}\NormalTok{ pca_dat}\OperatorTok{$}\NormalTok{x[h1_rw,}\DecValTok{2}\NormalTok{], }\DataTypeTok{pch =} \DecValTok{20}\NormalTok{, }\DataTypeTok{col =} \StringTok{"red"}\NormalTok{);}
\KeywordTok{points}\NormalTok{(}\DataTypeTok{x =}\NormalTok{ pca_dat}\OperatorTok{$}\NormalTok{x[h2_rw,}\DecValTok{1}\NormalTok{], }\DataTypeTok{y =}\NormalTok{ pca_dat}\OperatorTok{$}\NormalTok{x[h2_rw,}\DecValTok{2}\NormalTok{], }\DataTypeTok{pch =} \DecValTok{20}\NormalTok{, }\DataTypeTok{col =} \StringTok{"blue"}\NormalTok{);}
\KeywordTok{legend}\NormalTok{(}\StringTok{"topleft"}\NormalTok{, }\DataTypeTok{fill =} \KeywordTok{c}\NormalTok{(}\StringTok{"red"}\NormalTok{, }\StringTok{"blue"}\NormalTok{), }\DataTypeTok{cex =} \FloatTok{1.5}\NormalTok{, }
\DataTypeTok{legend =} \KeywordTok{c}\NormalTok{( }\KeywordTok{expression}\NormalTok{(}\KeywordTok{paste}\NormalTok{(}\KeywordTok{italic}\NormalTok{(}\StringTok{"Heterandrium "}\NormalTok{), }\DecValTok{1}\NormalTok{)), }
\KeywordTok{expression}\NormalTok{(}\KeywordTok{paste}\NormalTok{(}\KeywordTok{italic}\NormalTok{(}\StringTok{"Heterandrium "}\NormalTok{), }\DecValTok{2}\NormalTok{))));}
\end{Highlighting}
\end{Shaded}
\includegraphics{index_files/figure-latex/unnamed-chunk-19-1.pdf}
From the groups overlaid onto the PCA, it is clear that these two
species of \emph{Heterandrium} differ in their morphological
measurements. Often groups are not so clearly distinguished (i.e., data
are usually more messy), and it is important to again remember that the
PCA is not actually testing any statistical hypothesis. It is just
plotting the data. If we wanted to test whether or not the difference
between species measurements was statistically significant, then we
would need to use some sort of appropriate multivariate test, such as a
\href{https://en.wikipedia.org/wiki/Multivariate_analysis_of_variance}{MANOVA}
(that is, a
\href{https://stirlingcodingclub.github.io/linear_modelling/}{linear
model} with multiple continuous dependent variables and a single
categorical independent variable), or an appropriate
\href{https://stirlingcodingclub.github.io/randomisation/randomisation_notes.html}{randomisation
approach}.
We can look at the amount of variation explained by each PC by examining
\texttt{pca\_dat\$sdev}, which reports the standard deviations of the
principal components.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{pca_variance <-}\StringTok{ }\NormalTok{(pca_dat}\OperatorTok{$}\NormalTok{sdev) }\OperatorTok{*}\StringTok{ }\NormalTok{(pca_dat}\OperatorTok{$}\NormalTok{sdev); }\CommentTok{# Get variance}
\KeywordTok{print}\NormalTok{(pca_variance);}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## [1] 6.37433235 1.90469347 1.11407673 0.53698949 0.34240864 0.25024647
## [7] 0.16218711 0.11913928 0.11349942 0.05875972 0.02366733
\end{verbatim}
To make it more easy to interpret the variances along each PC, we can
use a \href{https://en.wikipedia.org/wiki/Scree_plot}{screeplot}. There
is a function for this in base R, which takes the direct output of the
\texttt{prcomp} function.
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{screeplot}\NormalTok{(pca_dat, }\DataTypeTok{npcs =} \DecValTok{11}\NormalTok{, }\DataTypeTok{main =} \StringTok{""}\NormalTok{, }\DataTypeTok{type =} \StringTok{"lines"}\NormalTok{, }\DataTypeTok{cex.lab =} \FloatTok{1.5}\NormalTok{);}
\end{Highlighting}
\end{Shaded}
\includegraphics{index_files/figure-latex/unnamed-chunk-21-1.pdf}
The screeplot above provides a useful indication of how much variation
explained decreases per PC (or how much variation is explained by the
first \(n\) axes). To make things even easier to visualise, it might
help to present the above in terms of per cent of the variance explained
on each axis.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{pca_pr_explained <-}\StringTok{ }\NormalTok{pca_variance }\OperatorTok{/}\StringTok{ }\KeywordTok{sum}\NormalTok{(pca_variance);}
\KeywordTok{print}\NormalTok{(pca_pr_explained);}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## [1] 0.579484759 0.173153952 0.101279703 0.048817227 0.031128058 0.022749679
## [7] 0.014744282 0.010830844 0.010318129 0.005341792 0.002151575
\end{verbatim}
Note that all of the values above sum to 1. It appears that about 75.26
per cent of the total variation is explained by PC1 and PC2, so we are
looking at a lot of variation in the PCA showed above. We can plot the
proportion of the variation explained by each PC in a bar plot.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{pc_names <-}\StringTok{ }\KeywordTok{paste}\NormalTok{(}\StringTok{"PC"}\NormalTok{, }\DecValTok{1}\OperatorTok{:}\KeywordTok{length}\NormalTok{(pca_pr_explained), }\DataTypeTok{sep =} \StringTok{""}\NormalTok{);}
\KeywordTok{barplot}\NormalTok{(}\DataTypeTok{height =}\NormalTok{ pca_pr_explained }\OperatorTok{*}\StringTok{ }\DecValTok{100}\NormalTok{, }\DataTypeTok{names =}\NormalTok{ pc_names, }\DataTypeTok{cex.names =} \FloatTok{0.8}\NormalTok{,}
\DataTypeTok{ylab =} \StringTok{"Per cent of total variation explained"}\NormalTok{, }\DataTypeTok{cex.lab =} \FloatTok{1.25}\NormalTok{);}
\end{Highlighting}
\end{Shaded}
\includegraphics{index_files/figure-latex/unnamed-chunk-23-1.pdf}
Notice that the general pattern of the barplot above is the same as the
screeplot.
We can also look at a biplot (i.e., a loading plot). The biplot below
shows the first two principal components (data points are now numbers 1
to 32, indicating the sample or row number of the fig wasp from the
original scaled data set, \texttt{dat}), but also the direction of each
of our original variables (morphological measurements) in relation to
these principal components.
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{biplot}\NormalTok{(pca_dat, }\DataTypeTok{cex =} \FloatTok{0.8}\NormalTok{, }\DataTypeTok{asp =} \DecValTok{1}\NormalTok{);}
\end{Highlighting}
\end{Shaded}
\includegraphics{index_files/figure-latex/unnamed-chunk-24-1.pdf}
The direction of the red arrows shows the relationships among the 11
different variables. Intuitively, variables with arrows pointing in
similar directions are positively correlated, while those pointing in
opposite directions are negatively correlated (more technically, the
cosine of the angle between arrows, in radians, actually equals the
correlation between them; e.g., cos(0) = 1 and cos(\(\pi\)) = -1; recall
that \(\pi\) radians equals 180 degrees). In the above example, wasp
head width and thorax width are highly correlated because the two arrows
are pointing in very similar directions. In contrast, ovipositor length
and ovipositor width appear to be pointing in very different directions
on PC1 and PC2, suggesting that the two variables are negatively
correlated.
The direction of these arrows is determined by the loading of each
variable on PC1 and PC2. The loading is the value that you would need to
multiply each variable by to get the score of a data point on the
principal component. The loadings can be found in the \texttt{pca\_dat}
results as \texttt{pca\_dat\$rotation}. I will just show the loadings
for the first two principle components below.
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{print}\NormalTok{(pca_dat}\OperatorTok{$}\NormalTok{rotation[,}\DecValTok{1}\OperatorTok{:}\DecValTok{2}\NormalTok{]); }\CommentTok{# Note: 11 total columns; one for each PC}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## PC1 PC2
## Head_Length_mm -0.34140280 0.163818591
## Head_Width_mm -0.37329512 0.009539983
## Thorax_Length_mm -0.31043223 0.212467054
## Thorax_Width_mm -0.32914664 -0.026143927
## Abdomen_Length_mm -0.35080667 0.215792603
## Abdomen_Width_mm -0.09942693 0.628750876
## Ovipositor_Length_mm 0.18174394 0.587847182
## Ovipositor_Width_mm -0.26652653 -0.260592830
## Forewing_Area_mm.2 -0.34255769 -0.133512606
## Hindwing_Area_mm.2 -0.25340037 0.132093495
## Forewing_Length_mm -0.34758221 -0.191325254
\end{verbatim}
We can see how the loadings of each variable match up with each arrows
by going back to the output of \texttt{pca\_dat}. To get the direction
of the arrows, we need to multiply the loadings in
\texttt{pca\_dat\$rotation} above by the standard deviation of each
principle component in \texttt{pca\_dat\$sdev}. I do this below to show
how we can manually reproduce the position of the red arrows from the
\texttt{pca\_dat} output.
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{biplot}\NormalTok{(pca_dat, }\DataTypeTok{cex =} \FloatTok{0.8}\NormalTok{, }\DataTypeTok{asp =} \DecValTok{1}\NormalTok{);}
\CommentTok{# Note that the arrows are scaled by a factor of ca 4.5, hence the scaling below}
\KeywordTok{points}\NormalTok{(}\DataTypeTok{x =} \FloatTok{4.5} \OperatorTok{*}\StringTok{ }\NormalTok{pca_dat}\OperatorTok{$}\NormalTok{sdev[}\DecValTok{1}\NormalTok{] }\OperatorTok{*}\StringTok{ }\NormalTok{pca_dat}\OperatorTok{$}\NormalTok{rotation[,}\DecValTok{1}\NormalTok{],}
\DataTypeTok{y =} \FloatTok{4.5} \OperatorTok{*}\StringTok{ }\NormalTok{pca_dat}\OperatorTok{$}\NormalTok{sdev[}\DecValTok{2}\NormalTok{] }\OperatorTok{*}\StringTok{ }\NormalTok{pca_dat}\OperatorTok{$}\NormalTok{rotation[,}\DecValTok{2}\NormalTok{],}
\DataTypeTok{col =} \StringTok{"blue"}\NormalTok{, }\DataTypeTok{pch =} \DecValTok{20}\NormalTok{, }\DataTypeTok{cex =} \FloatTok{1.5}\NormalTok{);}
\end{Highlighting}
\end{Shaded}
\includegraphics{index_files/figure-latex/unnamed-chunk-26-1.pdf}
I hope that this clarifies how to produce and interpret a PCA, a
screeplot, and a biplot in R. I also hope that the output that I showed
earlier generated generated from \texttt{prcomp(dat)} is now clear. For
one last illustration to further connect this output with the PCA, I
will reproduce a PCA just from the original data points in \texttt{dat}
and the loadings for PC1 and PC2 found in the output of
\texttt{prcomp(dat)}, \texttt{pca\_dat}. For each fly, we take the sum
of each measurement times its corresponding loading on PC1 and PC2.
Hence, the position of the first wasp in \texttt{dat} (first row) on PC1
would be found by multipling \texttt{dat{[}1,{]}} times
\texttt{pca\_dat\$rotation{[},1{]}}. Here is a reminder of what
\texttt{dat{[}1,{]}} looks like.
\begin{verbatim}
## Head_Length_mm Head_Width_mm Thorax_Length_mm
## 1.3692840 1.7033618 0.3158512
## Thorax_Width_mm Abdomen_Length_mm Abdomen_Width_mm
## 0.2036821 0.8331580 -0.4356985
## Ovipositor_Length_mm Ovipositor_Width_mm Forewing_Area_mm.2
## -0.7493047 2.4779288 1.5324901
## Hindwing_Area_mm.2 Forewing_Length_mm
## 0.1659505 1.6125831
\end{verbatim}
Here are the loadings for PC1 (\texttt{pca\_dat\$rotation{[},1{]}}).
\begin{verbatim}
## Head_Length_mm Head_Width_mm Thorax_Length_mm
## -0.34140280 -0.37329512 -0.31043223
## Thorax_Width_mm Abdomen_Length_mm Abdomen_Width_mm
## -0.32914664 -0.35080667 -0.09942693
## Ovipositor_Length_mm Ovipositor_Width_mm Forewing_Area_mm.2
## 0.18174394 -0.26652653 -0.34255769
## Hindwing_Area_mm.2 Forewing_Length_mm
## -0.25340037 -0.34758221
\end{verbatim}
We calculate the sum of each element multiplied together: (1.369284
\(\times\) -0.3414028) + (1.7033618 \(\times\) -0.3732951) + \(...\) +
(1.6125831 \(\times\) -0.3475822). This gives a value of -3.4415217,
which is the location of the first data point on PC1. When we do this
calculation for PC1 and PC2 for all of the data, we can get the
coordinates to plot on our PCA. I do this with the code below. Note that
the \texttt{for}
\href{https://stirlingcodingclub.github.io/using_loops/loop_notes.html}{loop}
is just cycling through each individual wasp sample (do not worry about
this if the code is unfamiliar).
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{plot}\NormalTok{(}\DataTypeTok{x =} \DecValTok{0}\NormalTok{, }\DataTypeTok{y =} \DecValTok{0}\NormalTok{, }\DataTypeTok{type =} \StringTok{"n"}\NormalTok{, }\DataTypeTok{xlim =} \KeywordTok{c}\NormalTok{(}\OperatorTok{-}\FloatTok{6.2}\NormalTok{, }\FloatTok{4.2}\NormalTok{), }\DataTypeTok{ylim =} \KeywordTok{c}\NormalTok{(}\OperatorTok{-}\DecValTok{4}\NormalTok{, }\DecValTok{3}\NormalTok{), }\DataTypeTok{asp =} \DecValTok{1}\NormalTok{,}
\DataTypeTok{xlab =} \StringTok{"PC1"}\NormalTok{, }\DataTypeTok{ylab =} \StringTok{"PC2"}\NormalTok{, }\DataTypeTok{cex.axis =} \FloatTok{1.25}\NormalTok{, }\DataTypeTok{cex.lab =} \FloatTok{1.25}\NormalTok{);}
\ControlFlowTok{for}\NormalTok{(i }\ControlFlowTok{in} \DecValTok{1}\OperatorTok{:}\KeywordTok{dim}\NormalTok{(dat)[}\DecValTok{1}\NormalTok{])\{ }\CommentTok{# Take the sum of the measurements times PC loadings}
\NormalTok{ pt_PC1 <-}\StringTok{ }\KeywordTok{sum}\NormalTok{(dat[i,] }\OperatorTok{*}\StringTok{ }\NormalTok{pca_dat}\OperatorTok{$}\NormalTok{rotation[,}\DecValTok{1}\NormalTok{]); }\CommentTok{# measurements times loadings 1}
\NormalTok{ pt_PC2 <-}\StringTok{ }\KeywordTok{sum}\NormalTok{(dat[i,] }\OperatorTok{*}\StringTok{ }\NormalTok{pca_dat}\OperatorTok{$}\NormalTok{rotation[,}\DecValTok{2}\NormalTok{]); }\CommentTok{# measurements times loadings 2}
\KeywordTok{points}\NormalTok{(}\DataTypeTok{x =}\NormalTok{ pt_PC1, }\DataTypeTok{y =}\NormalTok{ pt_PC2, }\DataTypeTok{cex =} \DecValTok{4}\NormalTok{, }\DataTypeTok{col =} \StringTok{"black"}\NormalTok{, }\DataTypeTok{pch =} \DecValTok{20}\NormalTok{);}
\KeywordTok{text}\NormalTok{(}\DataTypeTok{x =}\NormalTok{ pt_PC1, }\DataTypeTok{y =}\NormalTok{ pt_PC2, }\DataTypeTok{col =} \StringTok{"white"}\NormalTok{, }\DataTypeTok{labels =}\NormalTok{ i, }\DataTypeTok{cex =} \FloatTok{0.8}\NormalTok{);}
\NormalTok{\}}
\end{Highlighting}
\end{Shaded}
\includegraphics{index_files/figure-latex/unnamed-chunk-29-1.pdf}
Those who are satisfied with their conceptual understanding of PCA, and
the ability to use PCA in R, can stop here. In the next section, I will
explain how PCA works mathematically.
\hypertarget{principal-component-analysis-matrix-algebra}{%
\section{Principal Component Analysis: matrix
algebra}\label{principal-component-analysis-matrix-algebra}}
A Principal Component Analysis is basically just an
\href{https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors}{eigenanalysis}
of a \href{https://en.wikipedia.org/wiki/Covariance_matrix}{covariance
matrix}. I will present the maths underlying this and show the
calculations in some detail, but a full explanation of matrix algebra,
and
\href{https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors}{eigenvalues
and eigenvectors} is beyond the scope of these notes. Following my
approach in the \protect\hyperlink{pca}{key concepts} section above, I
am going to try to present the key calculations and some of the key
mathematical ideas using two dimensions to make everything easier to
calculate and visualise. First, a reminder of what our simulated data
look like from the first section.
\begin{verbatim}
## Variable_1 Variable_2 Variable_3 Variable_4
## Sample_1 3.8432410 2.64075944 -0.6976496 -3.8173266
## Sample_2 5.0907125 6.06835438 1.8097752 0.8234129
## Sample_3 -3.2575704 -3.69653611 4.4399886 0.3118165
## Sample_4 -1.7029267 -0.74468125 4.9935894 -0.3506784
## Sample_5 -1.4924450 -1.92297999 3.4279558 7.8473583
## Sample_6 1.9690167 -0.01934525 -0.2280392 -0.7550955
## Sample_7 4.4895141 4.33368612 2.2381424 4.7782675
## Sample_8 1.1811029 -0.19392334 -2.4153756 4.1086135
## Sample_9 2.3024895 0.94785329 4.4690578 -0.3843029
## Sample_10 7.0521002 7.02973687 4.4969840 7.3068450
## Sample_11 -0.8589422 -1.60834723 5.2195623 3.8069426
## Sample_12 5.0304025 7.02247311 -2.4648239 -1.7609266
\end{verbatim}
Again, I am just going to use the first two columns of data, so I will
create a truncated data set \texttt{eg\_m} using just two of the columns
from above.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{eg_m <-}\StringTok{ }\NormalTok{eg_mat[,}\DecValTok{1}\OperatorTok{:}\DecValTok{2}\NormalTok{]; }\CommentTok{# Now we have an R object with just two columns}
\end{Highlighting}
\end{Shaded}
We can start by just running a PCA as we did in the
\protect\hyperlink{Rcode}{section above} with \texttt{pr\_comp}.
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{prcomp}\NormalTok{(eg_m);}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## Standard deviations (1, .., p=2):
## [1] 4.8614634 0.7539536
##
## Rotation (n x k) = (2 x 2):
## PC1 PC2
## Variable_1 -0.6586816 0.7524218
## Variable_2 -0.7524218 -0.6586816
\end{verbatim}
Here is our starting point, with standard deviations and loadings for
each PC that we know are correct. Now we can attempt to do this whole
analysis manually, without the \texttt{prcomp} function. To start, we
need to get the covariance matrix of the data. A covariance matrix is
just a square matrix in which off-diagonal elements hold the covariance
between variable \(X_{i}\) and \(X_{j}\) in row \(i\) and column \(j\).
Diagonal elements (upper left to lower right) hold the variance of
variable \(X_{i}\) (i.e., the covariance between \(X_{i}\) and itself,
where \(i\) is both the row and column),
\[
V = \begin{pmatrix}
Var(X_{1}), & Cov(X_{1}, X_{2}) \\
Cov(X_{2}, X_{1}), & Var(X_{2}) \\
\end{pmatrix}.
\]
\href{https://en.wikipedia.org/wiki/Covariance}{Covariance} and
\href{https://en.wikipedia.org/wiki/Variance}{variance} are calculated
as usual. In R, we can easily get the covariance matrix using the
\texttt{cov} function.
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{V <-}\StringTok{ }\KeywordTok{cov}\NormalTok{(eg_m); }\CommentTok{# Variance covariance matrix of two simulated measurements}
\KeywordTok{print}\NormalTok{(V);}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## Variable_1 Variable_2