-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathworkflow_expressions.tex
5331 lines (4587 loc) · 297 KB
/
workflow_expressions.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[9pt,a4paper,]{extarticle}
\usepackage{f1000_styles}
\usepackage[pdfborder={0 0 0}]{hyperref}
\usepackage[numbers]{natbib}
\bibliographystyle{unsrtnat}
%% maxwidth is the original width if it is less than linewidth
%% otherwise use linewidth (to make sure the graphics do not exceed the margin)
\makeatletter
\def\maxwidth{ %
\ifdim\Gin@nat@width>\linewidth
\linewidth
\else
\Gin@nat@width
\fi
}
\makeatother
\usepackage{color}
\usepackage{fancyvrb}
\newcommand{\VerbBar}{|}
\newcommand{\VERB}{\Verb[commandchars=\\\{\}]}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{commandchars=\\\{\}}
% Add ',fontsize=\small' for more characters per line
\usepackage{framed}
\definecolor{shadecolor}{RGB}{248,248,248}
\newenvironment{Shaded}{\begin{snugshade}}{\end{snugshade}}
\newcommand{\AlertTok}[1]{\textcolor[rgb]{0.94,0.16,0.16}{#1}}
\newcommand{\AnnotationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\AttributeTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{#1}}
\newcommand{\BaseNTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\BuiltInTok}[1]{#1}
\newcommand{\CharTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\CommentTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textit{#1}}}
\newcommand{\CommentVarTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\ConstantTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{#1}}
\newcommand{\ControlFlowTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{#1}}}
\newcommand{\DataTypeTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{#1}}
\newcommand{\DecValTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\DocumentationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\ErrorTok}[1]{\textcolor[rgb]{0.64,0.00,0.00}{\textbf{#1}}}
\newcommand{\ExtensionTok}[1]{#1}
\newcommand{\FloatTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\FunctionTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{#1}}}
\newcommand{\ImportTok}[1]{#1}
\newcommand{\InformationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\KeywordTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{#1}}}
\newcommand{\NormalTok}[1]{#1}
\newcommand{\OperatorTok}[1]{\textcolor[rgb]{0.81,0.36,0.00}{\textbf{#1}}}
\newcommand{\OtherTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{#1}}
\newcommand{\PreprocessorTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textit{#1}}}
\newcommand{\RegionMarkerTok}[1]{#1}
\newcommand{\SpecialCharTok}[1]{\textcolor[rgb]{0.81,0.36,0.00}{\textbf{#1}}}
\newcommand{\SpecialStringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\StringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\VariableTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\VerbatimStringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\WarningTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
% disable code chunks background
%\renewenvironment{Shaded}{}{}
% disable section numbers
\setcounter{secnumdepth}{0}
%% added by MLS, this is not in the F1000 style by default %%
\hypersetup{unicode=true,
pdftitle={A Bioconductor workflow for processing, evaluating, and interpreting expression proteomics data},
pdfkeywords={Bioconductor, QFeatures, proteomics, shotgun proteomics, bottom-up proteomics,
differential expression, mass spectrometry, quality control, data processing,
limma},
pdfborder={0 0 0},
breaklinks=true}
%% End added by MLS %%
\setlength{\parindent}{0pt}
\setlength{\parskip}{6pt plus 2pt minus 1pt}
\begin{document}
\pagestyle{front}
\title{A Bioconductor workflow for processing, evaluating, and interpreting expression proteomics data}
\author[1]{Charlotte Hutchings}
\author[1]{Charlotte S. Dawson}
\author[2]{Thomas Krueger}
\author[1]{Kathryn S. Lilley}
\author[1]{Lisa M. Breckels}
\affil[1]{Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, UK}
\affil[2]{Department of Biochemistry, University of Cambridge, UK}
\maketitle
\thispagestyle{front}
\begin{abstract}
\hfill\break
\textbf{Background:} Expression proteomics involves the global evaluation of protein
abundances within a system. In turn, differential expression analysis can be
used to investigate changes in protein abundance upon perturbation to such a
system.
\textbf{Methods:} Here, we provide a workflow for the processing, analysis and
interpretation of quantitative mass spectrometry-based expression proteomics
data. This workflow utilises open-source R software packages from the
Bioconductor project and guides users end-to-end and step-by-step through every
stage of the analyses. As a use-case we generated expression proteomics data
from HEK293 cells with and without a treatment. Of note, the experiment included
cellular proteins labelled using tandem mass tag (TMT) technology and secreted
proteins quantified using label-free quantitation (LFQ).
\textbf{Results:} The workflow explains the software infrastructure before focusing
on data import, pre-processing and quality control. This is done individually
for TMT and LFQ datasets. The application of statistical differential
expression analysis is demonstrated, followed by interpretation via gene
ontology enrichment analysis.
\textbf{Conclusions:} A comprehensive workflow for the processing, analysis and
interpretation of expression proteomics is presented. The workflow is a
valuable resource for the proteomics community and specifically beginners who
are at least familiar with R who wish to understand and make data-driven
decisions with regards to their analyses.
\end{abstract}
\section*{Keywords}
Bioconductor, QFeatures, proteomics, shotgun proteomics, bottom-up proteomics,
differential expression, mass spectrometry, quality control, data processing,
limma
\clearpage
\pagestyle{main}
\textbf{R version}: R version 4.4.0 (2024-04-24)
\textbf{Bioconductor version}: 3.19
\section{Introduction}\label{introduction}
Proteins are responsible for carrying out a multitude of biological tasks,
implementing cellular functionality and determining phenotype. Mass spectrometry
(MS)-based expression proteomics allows protein abundance to be quantified and
compared between samples. In turn, differential protein abundance can be used to
explore how biological systems respond to a perturbation. Many research groups
have applied such methodologies to understand mechanisms of disease, elucidate
cellular responses to external stimuli, and discover diagnostic biomarkers (see
\citep{PinaJimnez2021, AmiriDashatan2021, Anitua2018} for recent examples). As the
potential of proteomics continues to be realised, there is a clear need for
resources demonstrating how to deal with expression proteomics data in a robust
and standardised manner.
The data generated during an expression proteomics experiment are complex, and
unfortunately there is no one-size-fits-all method for the processing and analysis
of such data. The reason for this is two-fold. Firstly, there are a wide range
of experimental methods that can be used to generate expression proteomics data.
Researchers can analyze full-length proteins (top-down proteomics) or complete
an enzymatic digestion and analyze the resulting peptides. This proteolytic
digestion can be either partial (middle-down proteomics) or complete (bottom-up
proteomics). The latter approach is most commonly used as peptides have a more
favourable ionization capacity, predictable fragmentation patterns, and can be
separated via reversed phase liquid chromatography, ultimately making them more
compatible with MS \citep{Dupree2020}. Within bottom-up proteomics, the relative
quantitation of peptides can be determined using one of two approaches: (1)
label-free or (2) label-based quantitation. Moreover, the latter can be
implemented with a number of different peptide labelling chemistries, for example,
using tandem mass tag (TMT), stable-isotope labelling by amino acids in cell
culture (SILAC), isobaric tags for relative and absolute quantitation (iTRAQ),
among others \citep{Obermaier2015}. MS analysis can also be used in either data-dependent
or data-independent acquisition (DDA or DIA) mode \citep{FernndezCosta2020, Hu2016}.
Although all of these experimental methods typically result in a similar output,
a matrix of quantitative values, the data are different and must be treated as such.
Secondly, data processing is dependent upon the experimental goal and biological
question being asked.
Here, we provide a step-by-step workflow for processing, analysing and
interpreting expression proteomics data derived from a bottom-up experiment
using DDA and either LFQ or TMT label-based peptide quantitation. We outline how
to process the data starting from a peptide spectrum match (PSM)- or peptide-level
\texttt{.txt} file. Such files are the outputs of most major third-party search
software (e.g., Proteome Discoverer, MaxQuant, FragPipe). As an example, we use
output files generated by Proteome Discoverer. Nonetheless, the key processing
steps remain the same for files generated using alternative third party software
and this workflow provides a framework for and discussion of such common steps.
We begin with data import and then guide users through the stages of data processing
including data cleaning, quality control filtering, management of missing values,
imputation, and aggregation to protein-level. Finally, we finish with how to discover
differentially abundant proteins and carry out biological interpretation of the
resulting data. The latter will be achieved through the application of gene
ontology (GO) enrichment analysis. Hence, users can expect to generate lists of
proteins that are significantly up- or downregulated in their system of interest,
as well as the GO terms that are significantly over-represented in these proteins.
Using the R statistical programming environment \citep{Rstat} we make use of several
state-of-the-art packages from the open-source, open-development Bioconductor
project \citep{Huber2015} to analyze use-case expression proteomics datasets
\citep{HutchingsData} from both LFQ and label-based technologies.
\subsection{Package installation}\label{package-installation}
In this workflow we make use of open-source software from the
\href{https://www.r-project.org}{R} \href{https://bioconductor.org}{Bioconductor}
\citep{Huber2015} project. The \href{https://bioconductor.org}{Bioconductor initiative}
provides \href{https://www.r-project.org}{R} software packages dedicated
to the processing of high-throughput complex biological data. Packages are
open-source, well-documented and benefit from an active community of developers.
We recommend users to download the
\href{https://posit.co/download/rstudio-desktop/}{RStudio} integrated development
environment (IDE) which provides a graphical interface to
\href{https://www.r-project.org}{R} programming language.
Detailed instructions for the installation of Bioconductor packages are
documented on the \href{http://bioconductor.org/install/}{Bioconductor Installation page}.
The main packages required for this workflow are installed using the code below.
\begin{Shaded}
\begin{Highlighting}[]
\ControlFlowTok{if}\NormalTok{ (}\SpecialCharTok{!}\FunctionTok{require}\NormalTok{(}\StringTok{"BiocManager"}\NormalTok{, }\AttributeTok{quietly =} \ConstantTok{TRUE}\NormalTok{)) \{}
\FunctionTok{install.packages}\NormalTok{(}\StringTok{"BiocManager"}\NormalTok{)}
\NormalTok{\}}
\NormalTok{BiocManager}\SpecialCharTok{::}\FunctionTok{install}\NormalTok{(}\FunctionTok{c}\NormalTok{(}\StringTok{"QFeatures"}\NormalTok{,}
\StringTok{"ggplot2"}\NormalTok{,}
\StringTok{"stringr"}\NormalTok{,}
\StringTok{"corrplot"}\NormalTok{,}
\StringTok{"Biostrings"}\NormalTok{,}
\StringTok{"limma"}\NormalTok{,}
\StringTok{"impute"}\NormalTok{,}
\StringTok{"dplyr"}\NormalTok{,}
\StringTok{"tibble"}\NormalTok{,}
\StringTok{"org.Hs.eg.db"}\NormalTok{,}
\StringTok{"clusterProfiler"}\NormalTok{,}
\StringTok{"enrichplot"}\NormalTok{))}
\end{Highlighting}
\end{Shaded}
After installation, each package must be loaded before it can be used in the R
session. This is achieved via the \texttt{library} function. For example, to load the
\texttt{QFeatures} package one would type \texttt{library("QFeatures")} after installation.
Here we load all packages included in this workflow.
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{library}\NormalTok{(}\StringTok{"QFeatures"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## Warning: package 'QFeatures' was built under R version 4.4.1
\end{verbatim}
\begin{verbatim}
## Warning: package 'MultiAssayExperiment' was built under R version 4.4.1
\end{verbatim}
\begin{verbatim}
## Warning: package 'matrixStats' was built under R version 4.4.1
\end{verbatim}
\begin{verbatim}
## Warning: package 'S4Vectors' was built under R version 4.4.1
\end{verbatim}
\begin{verbatim}
## Warning: package 'IRanges' was built under R version 4.4.1
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{library}\NormalTok{(}\StringTok{"ggplot2"}\NormalTok{)}
\FunctionTok{library}\NormalTok{(}\StringTok{"stringr"}\NormalTok{)}
\FunctionTok{library}\NormalTok{(}\StringTok{"dplyr"}\NormalTok{)}
\FunctionTok{library}\NormalTok{(}\StringTok{"tibble"}\NormalTok{)}
\FunctionTok{library}\NormalTok{(}\StringTok{"corrplot"}\NormalTok{)}
\FunctionTok{library}\NormalTok{(}\StringTok{"Biostrings"}\NormalTok{)}
\FunctionTok{library}\NormalTok{(}\StringTok{"limma"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## Warning: package 'limma' was built under R version 4.4.1
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{library}\NormalTok{(}\StringTok{"org.Hs.eg.db"}\NormalTok{)}
\FunctionTok{library}\NormalTok{(}\StringTok{"clusterProfiler"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## Warning: package 'clusterProfiler' was built under R version 4.4.1
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\FunctionTok{library}\NormalTok{(}\StringTok{"enrichplot"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## Warning: package 'enrichplot' was built under R version 4.4.1
\end{verbatim}
\section{Methods}\label{methods}
\subsection{The use-case: exploring changes in protein abundance in HEK293 cells upon perturbation}\label{the-use-case-exploring-changes-in-protein-abundance-in-hek293-cells-upon-perturbation}
As a use-case, we analyze two quantitative proteomics datasets derived from a
single experiment. The aim of the experiment was to reveal the differential
abundance of proteins in HEK293 cells upon a particular treatment, the exact
details of which are anonymised for the purpose of this workflow. An outline of
the experimental method is provided in Figure \ref{fig:experimental-method}.
Briefly, HEK293 cells were either (i) left untreated, or (ii) provided with the
treatment of interest. These two conditions are referred to as `control' and
`treated', respectively. Each condition was evaluated in triplicate. At 96-
hours post-treatment, samples were collected and separated into cell pellet and
supernatant fractions containing cellular and secreted proteins, respectively.
Both fractions were denatured, alkylated and digested to peptides using trypsin.
The supernatant fractions were de-salted and analysed over a two-hour gradient
in an Orbitrap Fusion\textsuperscript{TM} Lumos\textsuperscript{TM} Tribrid\textsuperscript{TM} mass spectrometer coupled to an
UltiMate\textsuperscript{TM} 3000 HPLC system (Thermo Fisher Scientific). LFQ was achieved at the
MS1 level based on signal intensities. Cell pellet fractions were labelled using
TMT technology before being pooled and subjected to high pH reversed-phase peptide
fractionation giving a total of 8 fractions. As before, each fraction was
analysed over a two-hour gradient in an Orbitrap Fusion\textsuperscript{TM} Lumos\textsuperscript{TM} Tribrid\textsuperscript{TM}
mass spectrometer coupled to an UltiMate\textsuperscript{TM} 3000 HPLC system (Thermo Fisher
Scientific). To improve the accuracy of the quantitation of TMT-labelled peptides,
synchronous precursor selection (SPS)-MS3 data acquisition was employed
\citep{McAlister2014, Ting2011}. Of note, TMT labelling of cellular proteins was
achieved using a single TMT6plex. Hence, this workflow will not include guidance
on multi-batch TMT effects or the use of internal reference scaling. For more
information about the use of multiple TMTplexes users are directed to \citep{Plubell2017, Brenes2019}.
\begin{figure}
{\centering \includegraphics[width=1\linewidth]{Images/experimental-method}
}
\caption{A schematic summary of the experimental protocol used to generate the use-case data.}\label{fig:experimental-method}
\end{figure}
The cell pellet and supernatant datasets were handled independently and we take
advantage of this to discuss the processing of TMT-labelled and LFQ proteomics
data. In both cases, the raw MS data were processed using Proteome Discoverer
v2.5 (Thermo Fisher Scientific). As a result, this workflow uses the specific
columns and files output by Proteome Discoverer. Nevertheless, most third-party
software output similar data structures and this workflow can be adapted. As an
example, information on how to adapt this workflow for data processed by MaxQuant
is provided in the \href{https://github.com/CambridgeCentreForProteomics/f1000_expression_proteomics}{appendix}.
While the focus in the workflow presented below is differential protein expression
analysis, the data processing and quality control steps described here are
applicable to any TMT or LFQ proteomics dataset. Importantly, however, the
experimental aim will influence data-guided decisions and the considerations
discussed here likely differ from those of spatial proteomics, for example.
\subsection{Downloading the data}\label{downloading-the-data}
The files required for this workflow can be found deposited to the ProteomeXchange
Consortium via the PRIDE \citep{PerezRiverol2021, Deutsch2022} partner repository
with the dataset identifier PXD041794, Zenodo at \url{http://doi.org/10.5281/zenodo.11196770}
and at the Github repository \url{https://github.com/CambridgeCentreForProteomics/f1000_expression_proteomics/}.
Users are advised to download these files into their current working directory.
In R the \texttt{setwd} function can be used to specify a working directory, or if
using RStudio one can use the Session -\textgreater{} Set Working Directory menu.
\subsection{\texorpdfstring{The infrastructure: \texttt{QFeatures} and \texttt{SummarizedExperiment}s}{The infrastructure: QFeatures and SummarizedExperiments}}\label{the-infrastructure-qfeatures-and-summarizedexperiments}
To be able to conveniently track each step of this workflow, users should make
use of the Quantitative features for mass spectrometry package, or
\href{https://www.bioconductor.org/packages/release/bioc/html/QFeatures.html}{\texttt{QFeatures}, Bioconductor package} \citep{QFeat}. Prior to utilizing the \texttt{QFeatures} infrastructure, it is first necessary to
understand the structure of a \href{https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html}{\texttt{SummarizedExperiment}} \citep{SumExp} object as \texttt{QFeatures} objects are based on the \texttt{SummarizedExperiment}
class. A \texttt{SummarizedExperiment}, often referred to as an SE, is a data container
and S4 object comprised of three components: (1) the \texttt{colData} (column data)
containing sample metadata, (2) the \texttt{rowData} containing data features, and (3)
the \texttt{assay} storing quantitation data, as illustrated in Figure
\ref{fig:summarized-experiment}. The sample metadata includes annotations such
as condition and replicate, and can be accessed using the \texttt{colData} function.
Data features, accessed via the \texttt{rowData} function, represent information
derived from the identification search. Examples include peptide sequence,
master protein accession, and confidence scores. Finally, quantitative data is
stored in the \texttt{assay} slot. These three independent data structures are neatly
stored within a single \texttt{SummarizedExperiment} object.
\begin{figure}
{\centering \includegraphics[width=1\linewidth]{Images/summarized-experiment}
}
\caption{A graphic representation of the SummarizedExperiment (SE) object structure. Figure reproduced from the SummarizedExperiment package \citep{SumExp} vignette with permission.}\label{fig:summarized-experiment}
\end{figure}
A \texttt{QFeatures} object holds each level of quantitative proteomics data, namely
(but not limited to) the PSM, peptide and protein-level data. Each level of the
data is stored as its own \texttt{SummarizedExperiment} within a single \texttt{QFeatures}
object. The lowest level data e.g.~PSM, is first imported into a \texttt{QFeatures}
object before aggregating upward towards protein-level (Figure
\ref{fig:qfeatures}). During this process of aggregation, \texttt{QFeatures} maintains
the hierarchical links between quantitative levels whilst allowing easy access
to all data levels for individual proteins of interest. This key aspect of
\texttt{QFeatures} will be exemplified throughout this workflow. Additional guidance on
the use of \texttt{QFeatures} can be found in \citep{QFeat}. For visualisation of the data,
all plots are generated using standard \texttt{ggplot} functionality, but could equally
be produced using base R.
\begin{figure}
{\centering \includegraphics[width=1\linewidth]{Images/qfeatures}
}
\caption{A graphic representation of the \texttt{QFeatures} object structure showing the relationship between assays. Figure modified from the \texttt{QFeatures} \citep{QFeat} vignette with permission.}\label{fig:qfeatures}
\end{figure}
\section{Processing and analysing quantitative TMT data}\label{processing-and-analysing-quantitative-tmt-data}
First, we provide a workflow for the processing and quality control of
quantitative TMT-labelled data. As outlined above, the cell pellet fractions of
triplicate control and treated HEK293 cells were labelled using a TMT6plex.
Labelling was as outlined in Table \ref{tab:table1}.
\begin{table}
\caption{\label{tab:table1}TMT labelling strategy in the use-case experiment}
\centering
\begin{tabular}[t]{l|l|r|l}
\hline
Sample Name & Condition & Replicate & Tag\\
\hline
S1 & Treated & 1 & TMT128\\
\hline
S2 & Treated & 2 & TMT127\\
\hline
S3 & Treated & 3 & TMT131\\
\hline
S4 & Control & 1 & TMT129\\
\hline
S5 & Control & 2 & TMT126\\
\hline
S6 & Control & 3 & TMT130\\
\hline
\end{tabular}
\end{table}
\subsection{Identification search of raw data}\label{identification-search-of-raw-data}
The first processing step in any MS-based proteomics experiment involves an
identification search using the raw data. The aim of this search is to identify
which peptide sequences, and therefore proteins, correspond to the raw spectra
output from the mass spectrometer. Several third-party software exist to
facilitate identification searches of raw MS data but ultimately the output of
any search is a list of PSMs, peptides and protein identifications along with
their corresponding quantification data.
The use-case data presented here was processed using Proteome Discoverer 2.5 and
additional information about this search is provided in the appendix. Further,
we provide template workflows for both the processing and consensus steps of the
Proteome Discoverer identification run in the supplementary information. We also
provide details on how to adapt this workflow for data processed with MaxQuant in
the \href{https://github.com/CambridgeCentreForProteomics/f1000_expression_proteomics}{appendix}.
It is also possible to determine several of the key parameter settings during the
preliminary data exploration. This step will be particularly important for those
using publicly available data without detailed knowledge of the identification
search parameters. For now, we simply export the PSM-level \texttt{.txt} file from the
Proteome Discoverer output.
\subsection{Data import, housekeeping and exploration}\label{data-import-housekeeping-and-exploration}
\subsubsection{Importing data into R and creating a QFeatures object}\label{importing-data-into-r-and-creating-a-qfeatures-object}
Data cleaning, exploration and filtering at the PSM-level is performed in R
using \texttt{QFeatures}. First, data is imported from our PSM-level \texttt{.txt} file into a
\texttt{data.frame} using the base R \texttt{read.delim} function (the equivalent for \texttt{.csv}
files would be \texttt{read.csv}). This file should be stored within the user's working
directory. Next, we use the \texttt{names} function to inspect the column names and
identify which columns contain quantitative data.
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Locate the PSM .txt file}
\NormalTok{cp\_psm }\OtherTok{\textless{}{-}} \StringTok{"cell\_pellet\_tmt\_results\_psms.txt"}
\DocumentationTok{\#\# Import into a dataframe}
\NormalTok{cp\_df }\OtherTok{\textless{}{-}} \FunctionTok{read.delim}\NormalTok{(cp\_psm)}
\DocumentationTok{\#\# Identify columns containing quantitative data}
\NormalTok{cp\_df }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{names}\NormalTok{()}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## [1] "PSMs.Workflow.ID" "PSMs.Peptide.ID"
## [3] "Checked" "Tags"
## [5] "Confidence" "Identifying.Node.Type"
## [7] "Identifying.Node" "Search.ID"
## [9] "Identifying.Node.No" "PSM.Ambiguity"
## [11] "Sequence" "Annotated.Sequence"
## [13] "Modifications" "Number.of.Proteins"
## [15] "Master.Protein.Accessions" "Master.Protein.Descriptions"
## [17] "Protein.Accessions" "Protein.Descriptions"
## [19] "Number.of.Missed.Cleavages" "Charge"
## [21] "Original.Precursor.Charge" "Delta.Score"
## [23] "Delta.Cn" "Rank"
## [25] "Search.Engine.Rank" "Concatenated.Rank"
## [27] "mz.in.Da" "MHplus.in.Da"
## [29] "Theo.MHplus.in.Da" "Delta.M.in.ppm"
## [31] "Delta.mz.in.Da" "Ions.Matched"
## [33] "Matched.Ions" "Total.Ions"
## [35] "Intensity" "Activation.Type"
## [37] "NCE.in.Percent" "MS.Order"
## [39] "Isolation.Interference.in.Percent" "SPS.Mass.Matches.in.Percent"
## [41] "Average.Reporter.SN" "Ion.Inject.Time.in.ms"
## [43] "RT.in.min" "First.Scan"
## [45] "Last.Scan" "Master.Scans"
## [47] "Spectrum.File" "File.ID"
## [49] "Abundance.126" "Abundance.127"
## [51] "Abundance.128" "Abundance.129"
## [53] "Abundance.130" "Abundance.131"
## [55] "Quan.Info" "Peptides.Matched"
## [57] "XCorr" "Number.of.Protein.Groups"
## [59] "Percolator.q.Value" "Percolator.PEP"
## [61] "Percolator.SVMScore"
\end{verbatim}
We see that our data contains six quantitative columns in positions 49 to 54.
Now that the quantitative data columns have been identified, we can pass our
\texttt{data.frame} to the \texttt{readQFeatures} function and specify where our quantitative
columns are. The latter is done by passing either a numeric or character vector
to an argument called \texttt{quantCols} (\texttt{ecols} in previous package versions).
Of note, the \texttt{readQFeatures} function can also take \texttt{fnames} as an argument to
specify a column to be used as the row names of the imported
object. Whilst previous \texttt{QFeatures} vignettes used the ``Sequence'' or
``Annotated.Sequence'' as row names, we advise against this because of the presence
of PSMs matched to the same peptide sequence with different modifications. In such
cases, multiple rows would have the same name forcing the \texttt{readQFeatures} function
to output a ``making assay row names unique'' message and add an identifying number
to the end of each duplicated row name. These sequences would then be considered
as unique during the aggregation of PSM to peptide, thus resulting in two independent
peptide- level quantitation values rather than one. Therefore, we do not pass a
\texttt{fnames} argument and the row names automatically become indices. Finally, we
pass the \texttt{name} argument to indicate the type of data added.
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Create QFeatures }
\NormalTok{cp\_qf }\OtherTok{\textless{}{-}} \FunctionTok{readQFeatures}\NormalTok{(}\AttributeTok{assayData =}\NormalTok{ cp\_df,}
\AttributeTok{quantCols =} \DecValTok{49}\SpecialCharTok{:}\DecValTok{54}\NormalTok{,}
\AttributeTok{name =} \StringTok{"psms\_raw"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## Checking arguments.
\end{verbatim}
\begin{verbatim}
## Loading data as a 'SummarizedExperiment' object.
\end{verbatim}
\begin{verbatim}
## Formatting sample annotations (colData).
\end{verbatim}
\begin{verbatim}
## Formatting data as a 'QFeatures' object.
\end{verbatim}
\subsubsection{\texorpdfstring{Accessing the \texttt{QFeatures} infrastructure}{Accessing the QFeatures infrastructure}}\label{accessing-the-qfeatures-infrastructure}
As outlined above, a \texttt{QFeatures} data object is a list of \texttt{SummarizedExperiment}
objects. As such, an individual \texttt{SummarizedExperiment} can be accessed using
the standard double bracket nomenclature, as demonstrated in the code chunk
below.
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Index using position}
\NormalTok{cp\_qf[[}\DecValTok{1}\NormalTok{]]}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## class: SummarizedExperiment
## dim: 48832 6
## metadata(0):
## assays(1): ''
## rownames(48832): 1 2 ... 48831 48832
## rowData names(55): PSMs.Workflow.ID PSMs.Peptide.ID ... Percolator.PEP
## Percolator.SVMScore
## colnames(6): Abundance.126 Abundance.127 ... Abundance.130
## Abundance.131
## colData names(0):
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Index using name}
\NormalTok{cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]]}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## class: SummarizedExperiment
## dim: 48832 6
## metadata(0):
## assays(1): ''
## rownames(48832): 1 2 ... 48831 48832
## rowData names(55): PSMs.Workflow.ID PSMs.Peptide.ID ... Percolator.PEP
## Percolator.SVMScore
## colnames(6): Abundance.126 Abundance.127 ... Abundance.130
## Abundance.131
## colData names(0):
\end{verbatim}
A summary of the data contained in the slots is printed to the screen.
To retrieve the \texttt{rowData}, \texttt{colData} or \texttt{assay} data from a particular
\texttt{SummarizedExperiment} within a \texttt{QFeatures} object users can make use of the
\texttt{rowData}, \texttt{colData} and \texttt{assay} functions. For plotting or data transformation
it is necessary to convert to a \texttt{data.frame} or \texttt{tibble}.
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Access feature information with rowData}
\DocumentationTok{\#\# The output should be converted to data.frame/tibble for further processing}
\NormalTok{cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]] }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{rowData}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{as\_tibble}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{summarise}\NormalTok{(}\AttributeTok{mean\_intensity =} \FunctionTok{mean}\NormalTok{(Intensity))}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## # A tibble: 1 x 1
## mean_intensity
## <dbl>
## 1 13281497.
\end{verbatim}
To ensure that quantitative data has been correctly imported as a numeric
variable, we can look at the \texttt{assay} slot.
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Look at top 6 rows of the assay (quantitative data)}
\NormalTok{cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]] }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{assay}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{head}\NormalTok{() }
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## Abundance.126 Abundance.127 Abundance.128 Abundance.129 Abundance.130
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 14.1 18.1 11.8 9.4 18.9
## 5 NA 2.5 NA 3.4 3.2
## 6 16.7 26.2 18.7 22.4 25.3
## Abundance.131
## 1 NA
## 2 NA
## 3 NA
## 4 13.5
## 5 NA
## 6 17.2
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Confirm numeric assay}
\NormalTok{cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]] }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{assay}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{type}\NormalTok{()}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## [1] "double"
\end{verbatim}
Our \texttt{assay} contains data of type \texttt{double}, a sub-type of numeric data in R. If
users find that their quantitative data is of type \texttt{character}, it is necessary
to correct this before moving on with the rest of the workflow.
\subsubsection{Adding metadata}\label{adding-metadata}
Having imported the data, each sample is first annotated with its TMT label,
sample reference and condition. As this information is experimental metadata, it
is added to the \texttt{colData} slot. It is important for users check that the \texttt{colData}
annotations correspond to their correct samples, particularly if the column order
in the imported results file does not align with the sample order. This is the
case for the use-case data since TMT labels were randomised in an attempt to
minimze the effect of TMT channel leakage.
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Add sample info as colData to QFeatures object}
\NormalTok{cp\_qf}\SpecialCharTok{$}\NormalTok{label }\OtherTok{\textless{}{-}} \FunctionTok{c}\NormalTok{(}\StringTok{"TMT126"}\NormalTok{,}
\StringTok{"TMT127"}\NormalTok{,}
\StringTok{"TMT128"}\NormalTok{,}
\StringTok{"TMT129"}\NormalTok{,}
\StringTok{"TMT130"}\NormalTok{,}
\StringTok{"TMT131"}\NormalTok{)}
\NormalTok{cp\_qf}\SpecialCharTok{$}\NormalTok{sample }\OtherTok{\textless{}{-}} \FunctionTok{c}\NormalTok{(}\StringTok{"S5"}\NormalTok{, }\StringTok{"S2"}\NormalTok{, }\StringTok{"S1"}\NormalTok{, }\StringTok{"S4"}\NormalTok{, }\StringTok{"S6"}\NormalTok{, }\StringTok{"S3"}\NormalTok{)}
\NormalTok{cp\_qf}\SpecialCharTok{$}\NormalTok{condition }\OtherTok{\textless{}{-}} \FunctionTok{c}\NormalTok{(}\StringTok{"Control"}\NormalTok{, }\StringTok{"Treated"}\NormalTok{, }\StringTok{"Treated"}\NormalTok{,}
\StringTok{"Control"}\NormalTok{, }\StringTok{"Control"}\NormalTok{, }\StringTok{"Treated"}\NormalTok{)}
\DocumentationTok{\#\# Verify}
\FunctionTok{colData}\NormalTok{(cp\_qf)}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## DataFrame with 6 rows and 3 columns
## label sample condition
## <character> <character> <character>
## Abundance.126 TMT126 S5 Control
## Abundance.127 TMT127 S2 Treated
## Abundance.128 TMT128 S1 Treated
## Abundance.129 TMT129 S4 Control
## Abundance.130 TMT130 S6 Control
## Abundance.131 TMT131 S3 Treated
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Assign the colData to first assay as well}
\FunctionTok{colData}\NormalTok{(cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]]) }\OtherTok{\textless{}{-}} \FunctionTok{colData}\NormalTok{(cp\_qf)}
\end{Highlighting}
\end{Shaded}
It is also useful to clean up sample names such that they are short, intuitive
and informative. This is done by editing the \texttt{colnames}. These steps may not
always be necessary depending upon the identification search output.
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Clean sample names}
\FunctionTok{colnames}\NormalTok{(cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]]) }\OtherTok{\textless{}{-}} \FunctionTok{c}\NormalTok{(}\StringTok{"S5"}\NormalTok{, }\StringTok{"S2"}\NormalTok{, }\StringTok{"S1"}\NormalTok{, }\StringTok{"S4"}\NormalTok{, }\StringTok{"S6"}\NormalTok{, }\StringTok{"S3"}\NormalTok{)}
\end{Highlighting}
\end{Shaded}
\subsubsection{Preliminary data exploration}\label{preliminary-data-exploration}
As well as cleaning and annotating the data, it is always advisable to check
that the import worked and that the data looks as expected. Further, preliminary
exploration of the data can provide an early sign of whether the experiment and
subsequent identification search were successful. Importantly, however, the
names of key parameters will vary depending on the software used, and will
likely change over time. Users will need to be aware of this and modify the code
in this workflow accordingly.
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Check what information has been imported}
\NormalTok{cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]] }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{rowData}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{colnames}\NormalTok{()}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## [1] "PSMs.Workflow.ID" "PSMs.Peptide.ID"
## [3] "Checked" "Tags"
## [5] "Confidence" "Identifying.Node.Type"
## [7] "Identifying.Node" "Search.ID"
## [9] "Identifying.Node.No" "PSM.Ambiguity"
## [11] "Sequence" "Annotated.Sequence"
## [13] "Modifications" "Number.of.Proteins"
## [15] "Master.Protein.Accessions" "Master.Protein.Descriptions"
## [17] "Protein.Accessions" "Protein.Descriptions"
## [19] "Number.of.Missed.Cleavages" "Charge"
## [21] "Original.Precursor.Charge" "Delta.Score"
## [23] "Delta.Cn" "Rank"
## [25] "Search.Engine.Rank" "Concatenated.Rank"
## [27] "mz.in.Da" "MHplus.in.Da"
## [29] "Theo.MHplus.in.Da" "Delta.M.in.ppm"
## [31] "Delta.mz.in.Da" "Ions.Matched"
## [33] "Matched.Ions" "Total.Ions"
## [35] "Intensity" "Activation.Type"
## [37] "NCE.in.Percent" "MS.Order"
## [39] "Isolation.Interference.in.Percent" "SPS.Mass.Matches.in.Percent"
## [41] "Average.Reporter.SN" "Ion.Inject.Time.in.ms"
## [43] "RT.in.min" "First.Scan"
## [45] "Last.Scan" "Master.Scans"
## [47] "Spectrum.File" "File.ID"
## [49] "Quan.Info" "Peptides.Matched"
## [51] "XCorr" "Number.of.Protein.Groups"
## [53] "Percolator.q.Value" "Percolator.PEP"
## [55] "Percolator.SVMScore"
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Find out how many PSMs are in the data}
\NormalTok{cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]] }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{dim}\NormalTok{()}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## [1] 48832 6
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\NormalTok{original\_psms }\OtherTok{\textless{}{-}}\NormalTok{ cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]] }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{nrow}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{as.numeric}\NormalTok{()}
\end{Highlighting}
\end{Shaded}
We can see that the original data includes 48832 PSMs
across the 6 samples. It is also useful to make note of how many peptides and
protein groups the raw PSM data corresponds to, and to track how many we remove during
the subsequent filtering steps. This can be done by checking how many unique entries
are located within the ``Sequence'' and ``Master.Protein.Accessions'' for peptides
and protein groups, respectively. Of note, searching for unique peptide sequences means
that the number of peptides does not include duplicated sequences with different
modifications.
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Find out how many peptides and master proteins are in the data}
\NormalTok{original\_peps }\OtherTok{\textless{}{-}}\NormalTok{ cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]] }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{rowData}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{as\_tibble}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{pull}\NormalTok{(Sequence) }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{unique}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{length}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{as.numeric}\NormalTok{()}
\NormalTok{original\_prots }\OtherTok{\textless{}{-}}\NormalTok{ cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]] }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{rowData}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{as\_tibble}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{pull}\NormalTok{(Master.Protein.Accessions) }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{unique}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{length}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{as.numeric}\NormalTok{()}
\FunctionTok{print}\NormalTok{(}\FunctionTok{c}\NormalTok{(original\_peps, original\_prots))}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## [1] 25969 5040
\end{verbatim}
Hence, the output of the identification search contains
48832 PSMs corresponding to
25969 peptide sequences and
5040 master proteins or protein groups.
Finally, we confirm that the identification search was carried out as expected.
For this, we print summaries of the key search parameters using the \texttt{table} function
for discrete parameters and \texttt{summary} for those which are continuous. This is also
helpful for users who are analyzing publicly available data and have limited
knowledge about the identification search parameters.
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Check missed cleavages}
\NormalTok{cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]] }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{rowData}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{as\_tibble}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{pull}\NormalTok{(Number.of.Missed.Cleavages) }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{table}\NormalTok{()}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## .
## 0 1 2
## 46164 2592 76
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Check precursor mass tolerance}
\NormalTok{cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]] }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{rowData}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{as\_tibble}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{pull}\NormalTok{(Delta.M.in.ppm) }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{summary}\NormalTok{()}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.9300 -0.6000 0.3700 0.6447 1.3100 9.6700
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Check fragment mass tolerance}
\NormalTok{cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]] }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{rowData}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{as\_tibble}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{pull}\NormalTok{(Delta.mz.in.Da) }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{summary}\NormalTok{()}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0110400 -0.0004100 0.0002500 0.0006812 0.0010200 0.0135100
\end{verbatim}
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Check PSM confidence allocations}
\NormalTok{cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]] }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{rowData}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{as\_tibble}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{pull}\NormalTok{(Confidence) }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{table}\NormalTok{()}
\end{Highlighting}
\end{Shaded}
\begin{verbatim}
## .
## High
## 48832
\end{verbatim}
\subsection{Experimental quality control checks}\label{experimental-quality-control-checks}
Experimental quality control of TMT-labelled quantitive proteomics data takes
place in two steps: (1) assessment of the raw mass spectrometry data, and (2)
evaluation of TMT labelling efficiency.
\subsubsection{Quality control of the raw mass spectrometry data}\label{quality-control-of-the-raw-mass-spectrometry-data}
Having taken an initial look at the output of the identification search, it is
possible to create some simple plots to inspect the raw mass spectrometry data.
Such plots are useful in revealing problems that may have occurred during
the mass spectrometry run but are far from extensive. Users who wish to carry
out a more in-depth evaluation of the raw mass spectrometry data may benefit
from use of the
\href{https://bioconductor.org/packages/release/bioc/html/Spectra.html}{\texttt{Spectra} Bioconductor package}
which allows for visualisation and exploration of raw chromatograms and spectra,
among other features \citep{Rainer2022}.
The first plot we generate looks at the delta precursor mass, that is the
difference between observed and estimated precursor mass, across retention time.
Importantly, exploration of this raw data feature can only be done when using
the raw data prior to recalibration. For users of Proteome Discoverer, this
means using the spectral files node rather than the spectral files recalibration
node.
\begin{Shaded}
\begin{Highlighting}[]
\DocumentationTok{\#\# Generate scatter plot of mass accuracy}
\NormalTok{cp\_qf[[}\StringTok{"psms\_raw"}\NormalTok{]] }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{rowData}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{as\_tibble}\NormalTok{() }\SpecialCharTok{\%\textgreater{}\%}
\FunctionTok{ggplot}\NormalTok{(}\FunctionTok{aes}\NormalTok{(}\AttributeTok{x =}\NormalTok{ RT.in.min, }\AttributeTok{y =}\NormalTok{ Delta.M.in.ppm)) }\SpecialCharTok{+}
\FunctionTok{geom\_point}\NormalTok{(}\AttributeTok{size =} \FloatTok{0.5}\NormalTok{, }\AttributeTok{shape =} \DecValTok{4}\NormalTok{) }\SpecialCharTok{+}
\FunctionTok{geom\_hline}\NormalTok{(}\AttributeTok{yintercept =} \DecValTok{5}\NormalTok{, }\AttributeTok{linetype =} \StringTok{"dashed"}\NormalTok{, }\AttributeTok{color =} \StringTok{"red"}\NormalTok{) }\SpecialCharTok{+}
\FunctionTok{geom\_hline}\NormalTok{(}\AttributeTok{yintercept =} \SpecialCharTok{{-}}\DecValTok{5}\NormalTok{, }\AttributeTok{linetype =} \StringTok{"dashed"}\NormalTok{, }\AttributeTok{color =} \StringTok{"red"}\NormalTok{) }\SpecialCharTok{+}
\FunctionTok{labs}\NormalTok{(}\AttributeTok{x =} \StringTok{"RT (min)"}\NormalTok{, }\AttributeTok{y =} \StringTok{"Delta precursor mass (ppm)"}\NormalTok{) }\SpecialCharTok{+}
\FunctionTok{scale\_x\_continuous}\NormalTok{(}\AttributeTok{limits =} \FunctionTok{c}\NormalTok{(}\DecValTok{0}\NormalTok{, }\DecValTok{120}\NormalTok{), }\AttributeTok{breaks =} \FunctionTok{seq}\NormalTok{(}\DecValTok{0}\NormalTok{, }\DecValTok{120}\NormalTok{, }\DecValTok{20}\NormalTok{)) }\SpecialCharTok{+}
\FunctionTok{scale\_y\_continuous}\NormalTok{(}\AttributeTok{limits =} \FunctionTok{c}\NormalTok{(}\SpecialCharTok{{-}}\DecValTok{10}\NormalTok{, }\DecValTok{10}\NormalTok{), }\AttributeTok{breaks =} \FunctionTok{c}\NormalTok{(}\SpecialCharTok{{-}}\DecValTok{10}\NormalTok{, }\SpecialCharTok{{-}}\DecValTok{5}\NormalTok{, }\DecValTok{0}\NormalTok{, }\DecValTok{5}\NormalTok{, }\DecValTok{10}\NormalTok{)) }\SpecialCharTok{+}
\FunctionTok{ggtitle}\NormalTok{(}\StringTok{"PSM retention time against delta precursor mass"}\NormalTok{) }\SpecialCharTok{+}
\FunctionTok{theme\_bw}\NormalTok{()}
\end{Highlighting}
\end{Shaded}
\begin{center}\includegraphics[height=0.3\textheight]{workflow_expressions_files/figure-latex/tmt_mass_accuracy-1} \end{center}
Since we applied a precursor mass tolerance of 10 ppm during the identification
search, all of the PSMs are within ±10 ppm. Ideally, however, we want the
majority of the data to be within ±5 ppm since smaller delta masses correspond
to a greater rate of correct peptide identifications. From the graph we have
plotted we can see that indeed the majority of PSMs are within ±5 ppm. If users
find that too many PSMs are outside of the desired ±5 ppm, it is advisable to
check the calibration of the mass spectrometer.
The second quality control plot of raw data is that of MS2 ion inject time
across the retention time gradient. Here, it is desirable to achieve an average
MS2 injection time of 50 ms or less, although the exact target threshold will
depend upon the sample load. If the average ion inject time is longer than
desired, then the ion transfer tube and/or front end optics of the instrument may
require cleaning.