-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy path17-SummaryComments.Rmd
1116 lines (897 loc) · 39.8 KB
/
17-SummaryComments.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# More details about tables and graphs {#SummariseComments}
<!-- Introductions; easier to separate by format -->
```{r, child = if (knitr::is_html_output()) {'./introductions/17-SummarisingData-HTML.Rmd'} else {'./introductions/17-SummarisingData-LaTeX.Rmd'}}
```
<!-- Define colours as appropriate -->
```{r, child = if (knitr::is_html_output()) {'./children/coloursHTML.Rmd'} else {'./children/coloursLaTeX.Rmd'}}
```
## Introduction {#MoreTablesGraphsIntro}
A summary of the data is important for understanding the data, and for planning the direction of the analysis.
In this chapter, we make some general comments for constructing graphs and tables.
Always remember:
::: {.importantBox .important data-latex="{iconmonstr-warning-8-240.png}"}
The purpose of a graph and a table is to display the information in the clearest, simplest possible way, to facilitate understanding the message(s) in the data.
:::
## More details about preparing graphs {#GraphsConstructing}
\index{Graphs!preparing}\index{Software output!graphs}\index{Graphs!using software}
Helping readers to understand the data is the goal of producing a graph.
You should be able to sketch graphs by hand, but *usually software is used to produce graphs*.\index{Computers and software}
Using a computer makes it easy to try different graphs, to change features of graphs, and to produce the best graph possible.
When creating graphs, ensure you:
* *do* make graphs clear and well-labelled
* *do* add titles and axis labels.
* *do* add units of measurement where necessary.
* *do* add informative captions *below* the figure.
* *do* add units of measurement and axis labels where appropriate.
* *do* make sure text and details are easy to read.
* *do* ensure the axis scales are appropriate.
* *do* add any necessary explanations.
* *do* make it easy for readers to easily make the important comparisons, as far as possible.
* *do not* add artificial third dimensions, or other 'chart junk' [@su2008s]; see Sect.\ \@ref(ThirdDimensions).
* *do not* add optical illusions, such as an artificial third dimension.
* *do not* use distracting colours and fonts; only use different colours and fonts if necessary, and explain that purpose if it is not clear.
* *do not* make errors.
Some specific problems to be aware of are discussed in the subsections that follows.
### Avoid unnecessary third dimensions {#ThirdDimensions}
Graphs should focus on clear communication.
One barrier to clear communication is using an unnecessary third dimension.
This is poor: such graphs can be misleading and hard to read [@siegrist1996use].
::: {.example #Bar2D3D name="Two- and three-dimensional plots"}
In the <span style="font-variant:small-caps;">nhanes</span> study [@data:NHANES3:Data], the age and sex of each participant were recorded.
Using Fig.\ \@ref(fig:Bar3D2D) (left panel), can you easily determine if more females or more males are present in each age group?
The artificial third dimension makes determining the heights of the bars hard.
In contrast, a side-by-side bar graph (Fig.\ \@ref(fig:Bar3D2D), right panel) makes it clear whether each age group has more females or more males.\index{Graphs!side-by-side bar chart}
:::
(ref:NHANES) Two plots of the <span style="font-variant:small-caps;">nhanes</span> participants, divided by age group and sex. Left: a three-dimensional bar chart. Right: a side-by-side bar chart.
```{r Bar3D2D, out.width='100%', fig.width=8.5, fig.height=3.75, fig.align="center", fig.cap="(ref:NHANES)"}
data(NHANES) # NHANES package
AgeD <- NHANES$AgeDecade
AgeD.levels <- levels(NHANES$AgeDecade)
num.levels <- length(AgeD.levels)
levels(AgeD) <- c( AgeD.levels[1:(num.levels - 2)],
rep(" 60+", 2))
AgeD.Gender <- xtabs( ~ AgeD + NHANES$Gender)
TwoDData <- data.frame(
Counts = c(
AgeD.Gender[, 1],
AgeD.Gender[, 2]),
Gender = c( rep("F", 7),
rep("M", 7)),
Age = levels(AgeD)
)
TwoDData <- data.frame(
Counts = c(
AgeD.Gender[, 1],
AgeD.Gender[, 2]),
Gender = c( rep("F", 7),
rep("M", 7)),
Age = levels(AgeD)
)
TwoDData.mat <- array( dim = c(2, 7))
TwoDData.mat[1, ] <- TwoDData$Counts[1:7]
TwoDData.mat[2, ] <- TwoDData$Counts[8:14]
rownames(TwoDData.mat) <- c("Female", "Male")
colnames(TwoDData.mat) <- unique(TwoDData$Age)
col.mat <- TwoDData.mat
col.mat[1, ] <- rep(1, 7)
col.mat[2, ] <- rep(2, 7)
cols <- c(ResponseColour,
IndividualColour)
#par( mfrow = c(1, 2))
layout( matrix(1:2,
nrow = 1),
widths = c(1.4, 1))
par( mar = c(2, 4, 4, 2) + 0.1 )
pmat <- plot3D::hist3D(z = TwoDData.mat,
x = c(0, 1),
y = 1:7,
col = cols,
colvar = col.mat,
border = "black",
colkey = FALSE,
facets = TRUE,
xlab = "Sex",
ylab = "Age group",
zlab = "Number",
#main = "Participants by group",
space = 0.4,
axes = FALSE,
ticktype = "detailed",
zlim = c(0, 900),
ylim = c(0.5, 7.5),
xlim = c(-0.5, 1.5),
expand = 1,
d = 25,
mar = c(2, 2, 0, 0.5))
# Set vars
x.axis <- 0:1
min.x <- -0.5
max.x <- 1.5
y.axis <- 1:7
min.y <- 0.5
max.y <- 7.5
z.axis <- seq(0, 900,
by = 100)
min.z <- 0
max.z <- 900
# LINES
lines(trans3d(x = x.axis,
y = min.y,
z = min.z,
pmat) ,
col = "black")
lines(trans3d(x = max.x,
y = y.axis,
z = min.z,
pmat) ,
col = "black")
lines(trans3d(x = min.x,
y = min.y,
z = z.axis,
pmat) ,
col = "black")
# TICKS
### See: http://entrenchant.blogspot.com/2014_03_01_archive.html
tick.start <- trans3d(x = x.axis,
y = min.y,
z = min.z,
pmat)
tick.end <- trans3d(x = x.axis,
y = (min.y - 0.25),
z = min.z,
pmat)
segments(x0 = tick.start$x,
y0 = tick.start$y,
x1 = tick.end$x,
y1 = tick.end$y)
tick.start <- trans3d(x = max.x,
y = y.axis,
z = min.z,
pmat)
tick.end <- trans3d(x = max.x + 0.20,
y = y.axis,
z = min.z,
pmat)
segments(x0 = tick.start$x,
y0 = tick.start$y,
x1 = tick.end$x,
y1 = tick.end$y)
tick.start <- trans3d(x = min.x,
y = min.y,
z = z.axis,
pmat)
tick.end <- trans3d(x = min.x,
y = (min.y - 0.20),
z = z.axis,
pmat)
segments(x0 = tick.start$x,
y0 = tick.start$y,
x1 = tick.end$x,
y1 = tick.end$y)
# LABELS
### See: http://entrenchant.blogspot.com/2014_03_01_archive.html
labels <- c("F",
"M")
label.pos <- trans3d(x = x.axis,
y = (min.y - 0.95),
z = min.z,
pmat)
text(x = label.pos$x,
y = label.pos$y,
labels = labels,
adj = c(0, NA),
#srt = 270,
cex = 0.65)
labels <- levels(factor(TwoDData$Age))
label.pos <- trans3d(x = (max.x + 0.25),
y = y.axis,
z = min.z,
pmat)
text(x = label.pos$x,
y = label.pos$y,
labels = labels,
adj = c(0, NA),
srt = 0,
cex = 0.65)
labels <- as.character(z.axis)
label.pos <- trans3d(x = min.x,
y = (min.y - 0.5),
z = z.axis,
pmat)
text(x = label.pos$x,
y = label.pos$y,
labels = labels,
adj = c(1, NA),
cex = 0.65)
### AXIS NAMES
label.pos <- trans3d(x = mean(x.axis),
y = (min.y - 1.25),
z = min.z,
pmat)
text(x = label.pos$x,
y = label.pos$y,
labels = "Sex",
cex = 0.85)
label.pos <- trans3d(x = (max.x + 0.80),
y = 4.5,
z = min.z,
pmat)
text(x = label.pos$x,
y = label.pos$y,
labels = "Age group",
cex = 0.85,
srt = 42)
label.pos <- trans3d(x = min.x - 0.9,
y = (min.y + 0.5),
z = 250,
pmat)
text(x = label.pos$x,
y = label.pos$y,
labels = "Number",
srt = 90,
cex = 0.85)
title("Participants by group",
cex.main = 1.2)
######################################################################################################
# Restore defaults
par( mar = c(5, 4, 4, 2) + 0.1 )
locatn <- barplot(t( AgeD.Gender ),
beside = TRUE,
legend = FALSE,
las = 2,
cex.names = 0.8,
ylim = c(0, 920),
args.legend = list(x = "bottom",
bg = "white",
cex = 0.75),
ylab = "Number",
xlab = "Age group",
main = "Participants by group",
col = cols)
#text(locatn[1, 4], 700, "F")
#text(locatn[2, 4], 700, "M")
legend("top",
legend = c("Females ",
"Males"),
#levels(NHANES$Gender),
bty = "n",
xpd = TRUE,
horiz = TRUE,
cex = 0.9, # For the *text* in the legend
fill = cols )
```
### Avoid overplotting {#Overplotting}
Some plots, such as dot charts and scatterplots, may suffer from *overplotting*:\index{Overplotting} when multiple observations have the same (or nearly the same) values, and these cannot be distinguished on the graph.
Overplotting can especially be a problem when plotting *discrete* quantitative data.
In many cases (such as dot plots), points can be *jittered*\index{Overplotting!jittering} by adding a small amount of randomness to the observations, or *stacked*; see Example\ \@ref(exm:Dotchart2DGorillas).\index{Overplotting!stacking}
Jittering is the best option for scatterplots.
Overplotted points can change readers' impression of the data, since some observations are obscured and are effectively 'lost' to the reader.
### Take care when truncating axes {#TruncatingAxes}
One common optical illusion occurs when the frequency (or percentage) axis does not start at zero.
This is a problem in graphs where the distance represented visually implies the frequency of those observations, as with the count (or percentage) axis in bar charts, dot charts, or histograms.
This is *not* a problem in, for example, boxplots and scatterplots, where the distance of points from zero do not visually imply any quantity of interest.
::: {.example #VerticalTruncationOK name="Truncating is not appropriate"}
Consider data recording the number of lung cancer cases in Fredericia in various age groups [@data:andersen:1977].
`r if (knitr::is_latex_output()) {
'Figure\\ \\@ref(fig:BarchartTruncated1) (left panel) shows a good bar chart with the count (vertical) axis starting at zero; the counts in each age group look similar. In contrast, if the vertical axis starts at\ $9$, the counts look very different (Fig.\\ \\@ref(fig:BarchartTruncated1), right panel) for two age categories, suggesting large difference between the number of lung cancer cases.'
} else {
'The animation below shows a bar chart with the count (vertical axis) starting at zero (when the counts in each age group look similar), and then gradually changing where the vertical axis starts... so that the final bar chart make the number of cases in each age groups look very different.'
}`
The graph is visually misleading when the graph does not start at a count of zero, since the height of the bars from the axis visually implies the frequency of those observations.
:::
```{r animation.hook='gifski', interval=0.5, dev=if (is_latex_output()){"pdf"}else{"png"}}
data(DanishLC)
ylim.lower.vec <- c( rep(0, 10),
rep(5, 10),
rep(7, 10),
rep(8.5, 10),
rep(9.75, 10) )
# seq(0, 9.75, by = 0.25) )
ylim.lower.vec <- c( ylim.lower.vec,
rep( max(ylim.lower.vec), 10))
draw.bar <- function( ylim.lower) {
egdata <- subset( DanishLC,
City == "Fredericia")
upperLimitYAxis <- 11 + (11 - ylim.lower) * 0.05 # Try to keep the gap between top bar and graph edge constant
out <- barplot( egdata$Cases,
las = 1,
col = plot.colour,
xlab = "Age groups",
ylab = "Number of cases",
main = paste("Number of lung cancer cases in Fredericia\n",
ifelse(ylim.lower == 0,
"(Vertical axis starting at zero)",
paste0("(Vertical axis truncated at ", ylim.lower,")")
)
),
xpd = FALSE, # BARS **DON'T** GO outside plotting region
ylim = c( ylim.lower, upperLimitYAxis),
axes = FALSE)
# ADD WHITE DOT to ensure many plots are created...
for (i in (1:10)){
points(2 + (i / 10),
mean( upperLimitYAxis, 11),
pch = 19,
cex = 0.2,
col = "white")
}
axis(side = 1,
at = out[, 1],
labels = egdata$Age,
las = 2)
axis(side = 2,
at = seq( floor(ylim.lower), 12, by = 1),
las = 1)
box()
}
if (knitr::is_html_output()){
for (i in 1:length(ylim.lower.vec) ){
ylim.lower <- ylim.lower.vec[i]
draw.bar( ylim.lower)
}
}
```
```{r BarchartTruncated1, fig.align="center", fig.width=10, fig.height=3.65, out.width='90%', fig.cap="The same data presented in two bar charts, without truncating the vertical axis (left) and truncating the vertical axis (right)." }
if (knitr::is_latex_output()){
len.ylv <- length(ylim.lower.vec)
par( mfrow = c(1, 2))
draw.bar( ylim.lower.vec[1] )
draw.bar( 9 )
}
```
::: {.example #VerticalTruncation name="Truncating is appropriate"}
Consider data recording the body temperature of $n = 130$ people (@data:mackowiak:bodytemp, @data:Shoemaker1996:Temperature).
A histogram of the data (Fig.\ \@ref(fig:HistosTemp), left panel) clearly shows the distribution of body temperatures.
The vertical axis, displaying the counts, must start at zero since the bar heights visually imply a quantity of interest.
However, the horizontal axis starts at\ $35.5$^o^C, which does not create any problems as the *distances* from a temperature of\ $0$^o^C do not visually imply any quantity of interest.\index{Graphs!histogram}
In contrast, starting the horizontal axis at a temperature of\ $0$^o^C (Fig.\ \@ref(fig:HistosTemp), right panel) makes any details in the histogram difficult to see; the histogram is pointless.
:::
```{r HistosTemp, fig.align="center", fig.width=10, fig.height=3.5, out.width='100%', fig.cap="Two histograms displaying the body temperature of $130$ people. Left: a well-constructed histogram. Right: a poorly-constructed histogram." }
data(BodyTemp)
par(mfrow = c(1, 2))
hist(BodyTemp$BodyTempC,
las = 1,
xlab = "Body Temp. (in degrees C)",
ylab = "Frequency",
main = "Histogram of body temp.:\nTruncation used")
hist(BodyTemp$BodyTempC,
xlim = c(0, 38.5),
las = 1,
xlab = "Body Temp. (in degrees C)",
ylab = "Frequency",
main = "Histogram of body temp.:\nNo truncation used")
```
<iframe src="https://learningapps.org/watch?v=pgn3q7ptv22" style="border:0px;width:100%;height:500px" allowfullscreen="true" webkitallowfullscreen="true" mozallowfullscreen="true"></iframe>
### Take care when using pie charts {#PieChartProblems}
\index{Graphs!bar chart}\index{Graphs!pie chart}
As noted in Sect.\ \@ref(PieCharts), pie charts may be hard to read: humans compare *lengths* (bar and dot charts) better than *angles* (pie charts) [@data:Friel:Graphs].
Pie charts are also difficult to use with levels having zero or small counts.
::: {.example #PieSmallCounts name="Pie charts with small counts"}
@data:Solomon2002:ginkgo studied the use of ginkgo for memory enhancement.
Caregivers blinded\index{Blinding} to the treatment (ginkgo or placebo)\index{Placebo} reported the impact on subjects' memory.
The bar chart (Fig.\ \@ref(fig:PieSmallCounts), left panel), for subjects on the placebo, shows that four of the available categories had zero responses, and one had a very small number of responses (two).
The pie chart (right) make the small category difficult to see, and the categories with zero counts impossible to see.
:::
```{r PieSmallCounts, fig.width=8.5, fig.height=4, out.width="100%", fig.align="center", fig.cap="Data with zeros and small counts are easy to see in a bar chart (left panel) and dot chart, but difficult to see in a pie chart (right panel)."}
if( knitr::is_latex_output() ) {
colList <- grey.colors(n = 4,
start = 0,
end = 1)
} else {
colList <- viridis::viridis(4)
}
colList <- c(colList,
colList[1:3] )
Responses <- c(0, 2, 31, 77, 0, 0, 0)
Response.Names <- c("Very much improved",
"Much improved",
"Minimally improved",
"No change",
"Minimally worse",
"Much worse",
"Very much worse")
par( mfrow = c(1, 2),
oma = c(1, 1, 1, 1) )
nf <- layout(
matrix(c(1, 2),
ncol = 2,
byrow = TRUE),
widths = c(1.5, 2),
)
Response.Names.Pie <- c("",
"Much improved",
"Minimally\nimproved",
"No change",
"",
"",
"")
Response.Names <- c("Very much improved",
"Much improved",
"Minimally improved",
"No change",
"Minimally worse",
"Much worse",
"Very much worse")
###
par( mar = c(10, 4, 2, 1))
barplot(Responses,
names.arg = Response.Names,
col = colList,
ylim = c(0, 80),
main = "Bar chart of responses",
ylab = "Number",
las = 2)
#box()
###
par( mar = c(1, 0, 2.4, 0))
pie(Responses,
labels = Response.Names.Pie,
init.angle = 100,
radius = 0.4,
main = "Pie chart of responses",
col = colList)
#box()
```
## More details about preparing tables {#TablesConstructing}
\index{Tables!preparing}
A computer is helpful for constructing tables.
Using a computer also makes it easy to try different orientations or layouts.
As with graphs, the purpose of tables is to help readers understand the data.
When creating numerical summary tables, ensure you:
* *do* make tables clear and well-labelled.
* *do* use clear row and column labels (as necessary).
* *do* add units of measurement where necessary.
* *do* add informative captions *above* the table.
* *do* add units of measurement and value labels where appropriate.
* *do* make sure text and details are easy to read.
* *do* round numbers appropriately (don't necessarily use all significant figures provided by software).
* *do* align numbers in the table by decimal point if possible, for easier reading and comparing.
* *do* construct the table to allow readers to easily make the important comparisons, as far as possible (space restriction may take precedence, for example).
* *do not* use distracting colours and fonts; only use different colours and fonts if necessary, and explain that purpose if it is not clear.
* *do not* use vertical lines (in general), and use *very few* horizontal lines.
Horizontal lines can be used to group columns (for example, see Table\ \@ref(tab:WaterAccessSummaryCommentsTable)).
## Example: water access {#WaterAcessSummary}
@lopez2022farmers recorded data about access to water in three rural communities in Cameroon (see Sects.\ \@ref(WaterAccessQual) and\ \@ref(WaterAccessQuant)).
The study could be used to determine contributors to the incidence of diarrhoea in young children ($85$ households had children under\ $5$ years of age).
Relationships between the incidence of diarrhoea and some other variables appear in Figs.\ \@ref(fig:WaterAccessQualCompare) and\ \@ref(fig:WaterAccessCompareQuantFigs).
A summary table of information can also be constructed (Table\ \@ref(tab:WaterAccessSummaryCommentsTable)).
In this table, note that:
* quantitative and qualitative variables are summarised differently, but appropriately.
* units of measurements are given where appropriate (i.e., only for age).
* numbers in columns are aligned for easier reading and comparing.
```{r WaterAccessSummaryCommentsTable}
data(WaterAccess)
WAKids <- subset(WaterAccess,
HouseholdUnder5s > 0)
WAkidsTable <- array(dim = c(3, 6))
colnames(WAkidsTable) <- c("$n$",
"Summary",
"$n$",
"Summary",
"$n$",
"Summary")
quantStuff <- function(x, digits = c(1, 1)){
c( pad(realLength(x),
decDigits = 0,
surroundMaths = TRUE,
targetLength = 2),
paste( pad( median(x, na.rm = TRUE),
targetLength = 4,
decDigits = digits[1]),
ifelse( is_latex_output(),
"\\",
" "),
"$(",
pad( IQR(x, na.rm = TRUE),
targetLength = 4,
surroundMaths = FALSE,
decDigits = digits[2]),
")$"),
#
pad(realLength(x[WAKids$Diarrhea == "N"]),
decDigits = 0,
targetLength = 2),
paste( pad( median(x[WAKids$Diarrhea == "N"], na.rm = TRUE),
decDigits = digits[1],
targetLength = 4),
ifelse( is_latex_output(),
"\\",
" "),
"$(",
pad( IQR(x[WAKids$Diarrhea == "N"], na.rm = TRUE),
targetLength = 4,
surroundMaths = FALSE,
decDigits = digits[2]),
")$"),
#
pad( realLength(x[WAKids$Diarrhea == "Y"]),
surroundMaths = TRUE,
decDigits = 0,
targetLength = 2),
paste( pad( median(x[WAKids$Diarrhea == "Y"], na.rm = TRUE),
decDigits = digits[1],
targetLength = 4),
"\\ $(",
pad( IQR(x[WAKids$Diarrhea == "Y"], na.rm = TRUE),
decDigits = digits[2],
surroundMaths = FALSE,
targetLength = 4),
")$")
)
}
WAkidsTable[1, ]<- quantStuff( WAKids$Age)
WAkidsTable[2, ]<- quantStuff( WAKids$HouseholdPeople)
WAkidsTable[3, ]<- quantStuff( WAKids$HouseholdUnder5s)
######
qualStuff <- function(x){
if ( !is.factor(x) ) x <- factor(x)
numLevels <- nlevels( x )
temp <- array( dim = c(numLevels, 6))
WAtab <- xtabs( ~ x + WAKids$Diarrhea)
temp[1:numLevels, 1] <- pad(rowSums(WAtab),
decDigits = 0,
targetLength = 2)
tabx <- table(x)
temp[1:numLevels, 2] <- paste0(pad( round(tabx/sum(tabx) * 100, 1),
decDigits = 1,
targetLength = 4),
"\\%")
temp[1:numLevels, c(3, 5, 4, 6)] <- array( c( pad(WAtab,
decDigits = 0,
targetLength = 2),
paste0( pad( round( prop.table(WAtab, 1) * 100, 1),
targetLength = 4,
decDigits = 1),
"\\%")),
dim = c(numLevels, 4) )
temp
}
###
WAkidsTable2 <- array(dim = c(11, 6))
rownames(WAkidsTable2) <- c( levels(factor(WAKids$Region)),
levels(WAKids$WaterSource),
levels(WAKids$Education),
"No", "Yes")
WAkidsTable2[1:3, ] <- qualStuff(WAKids$Region)
WAkidsTable2[4:7, ] <- qualStuff(WAKids$WaterSource)
WAkidsTable2[8:9, ] <- qualStuff(WAKids$Education)
WAkidsTable2[10:11, ] <- qualStuff(WAKids$HasLivestock)
######
if( knitr::is_latex_output() ) {
rownames(WAkidsTable) <- c("\\textbf{Age (in years)}$^a$",
"\\textbf{Household size}$^a$",
"\\textbf{Under $5$s in household}$^a$")
kable( (rbind(WAkidsTable,
WAkidsTable2) ),
format = "latex",
longtable = FALSE,
booktabs = TRUE,
escape = FALSE,
linesep = c("", "", "\\addlinespace",
"", "", "\\addlinespace",
"", "", "", "\\addlinespace",
"", "\\addlinespace",
"", ""),
align = "c",
digits = 2,
caption = "Numerical summary of the water-access data in $85$ households with children. `All households' are broken into those that reported, and did not report, diarrhoea in children under $5$ years of age in the last two weeks.") %>%
#row_spec(0, bold = TRUE) %>%
kable_styling(font_size = 8) %>%
#row_spec(1:3, bold = TRUE) %>%
add_header_above( c(" " = 1,
"All households" = 2,
"Diarrhoea" = 2,
"No diarrhoea" = 2),
bold = TRUE) %>%
pack_rows( "Region$^b$",
start_row = 4,
end_row = 6,
escape = FALSE) %>%
pack_rows( "Water source$^b$",
start_row = 7,
end_row = 10,
escape = FALSE) %>%
pack_rows( "Education$^b$",
start_row = 11,
end_row = 12,
escape = FALSE) %>%
pack_rows( "Has livestock$^b$",
start_row = 13,
end_row = 14,
escape = FALSE) %>%
add_footnote(c("Quantitative variables are summarised using medians and IQR.", "Qualitative variables are summarised using counts and percentages."),
notation = "alphabet")
}
if( knitr::is_html_output() ) {
rownames(WAkidsTable) <- c("Age$^a$",
"Household size$^b$",
"Under 5s in household$^c$")
kable(rbind(WAkidsTable,
WAkidsTable2),
format = "html",
longtable = FALSE,
booktabs = TRUE,
align = "c",
digits = 2,
caption = "Numerical summary of the water-access data in $85$ households with children, according of whether children under $5$ years of age had reported diarrhoea in the last two weeks.") %>%
row_spec(0, bold = TRUE) %>%
row_spec(1:3, bold = TRUE) %>%
add_header_above( c(" " = 1,
"All households" = 2,
"Reported diarrhoea" = 2,
"No reported diarrhoea" = 2),
bold = TRUE) %>%
pack_rows( "Region$^b$",
start_row = 4,
end_row = 6,
escape = FALSE) %>%
pack_rows( "Water source$^b$",
start_row = 7,
end_row = 10,
escape = FALSE) %>%
pack_rows( "Education$^b$",
start_row = 11,
end_row = 12,
escape = FALSE) %>%
pack_rows( "Has livestock$^b$",
start_row = 13,
end_row = 14,
escape = FALSE) %>%
add_footnote(c("Quantitative variables are summarised using medians and IQR.", "Qualitative variables are summarised using counts and percentages."),
notation = "alphabet")
}
```
The table summarises the *sample*, but RQs are about the *population*.
For example, one RQ could be:
> Is the percentage of households with children under\ $5$ years of age having diarrhoea the same for households that do and do not keep livestock?
Since the observed sample is one of countless possible samples that may have been selected, answering RQs about the population is not straightforward.
In the observed sample, $85.0$% of households that *did not* keep livestock reported diarrhoea in children under\ $5$, while $64.6$% of households that *did* keep livestock reported diarrhoea in children under\ $5$.
That is, a difference is seen *in the sample*; but RQs ask about the *population*.
Broadly, two possible reasons could explain why the *sample* percentages of households reporting diarrhoea in children are different:
1. *The **population** percentages are the same*.
The *sample* percentages are *different* simply because of the households selected in this particular sample.
Another sample, with different households, might produce different sample percentages.
*Sampling variation explains the difference in the sample percentages*.
2. *The **population** percentages are different*.
The difference between the *sample* percentages reflects this difference between the *population* percentages.
The difficulty is knowing which of these reasons ('hypotheses')\index{Hypotheses} is the most likely explanation for the difference between the sample percentages.
This question is of prime importance as it answers the RQ.\spacex
Tools for answering these questions are considered later in this book.
## Quick revision questions {#SummaryCommentsQuickReview}
::: {.webex-check .webex-box}
Are the following statements *true* or *false*?
1. Graphs usually have their captions *under* the figure. \tightlist
`r if( knitr::is_html_output() ) {torf(answer = TRUE)}`
1. Graphs should use as many colours as possible.
`r if( knitr::is_html_output() ) {torf(answer = FALSE)}`
1. Graphs should usually be carefully created using computer software.
`r if( knitr::is_html_output() ) {torf(answer = TRUE)}`
1. Tables should have plenty of horizontal and vertical lines.
`r if( knitr::is_html_output() ) {torf(answer = FALSE)}`
1. Tables usually have their captions *under* the table.
`r if( knitr::is_html_output() ) {torf(answer = FALSE)}`
:::
## Exercises {#SummariseCommentsExercises}
[Answers to odd-numbered exercises] are given at the end of the book.
`r if( knitr::is_latex_output() ) "\\captionsetup{font=small}"`
::: {.exercise #Graphs123}
What would be the best graph for displaying the data for these situations?
1. Researchers record the pH of water and the temperature of the water, in various creeks, to explore the relationship.
`r if( knitr::is_html_output() ) {longmcq( c(
"A bar chart",
"A histogram",
"A boxplot",
"A histogram of the differences",
answer = "A scatterplot",
"A side-by-side bar chart"))}`
2. Researchers measure the difference between each swimmers' fastest $100\ms$ time and their fastest $200\ms$ time.
The researchers were interested in the average time *difference*.
`r if( knitr::is_html_output() ) {longmcq( c(
"A bar chart",
"A histogram",
"A boxplot",
answer = "A histogram of the differences",
"A scatterplot",
"A side-by-side bar chart"))}`
3. A research study examined the way in which students usually came to university (bus; private car; carpooling; etc.) and their program of study.
`r if( knitr::is_html_output() ) {longmcq( c(
"A bar chart",
"A histogram",
"A boxplot",
"A histogram of the differences",
"A scatterplot",
answer = "A side-by-side bar chart"))}`
:::
::: {.exercise #Graphs456}
What would be the best graph for displaying the data for these situations?
1. Researchers record the number of times a specific recycling bin is used each day at a shopping centre, over many days.
`r if( knitr::is_html_output() ) {longmcq( c(
"A bar chart",
answer = "A histogram",
"A boxplot",
"A histogram of the differences",
"A scatterplot",
"A side-by-side bar chart"))}`
2. Researchers measure the difference between heart rate before and two hours after drinking a cup of coffee.
The researchers were interested in the average increase in heart rate.
`r if( knitr::is_html_output() ) {longmcq( c(
"A bar chart",
"A histogram",
"A boxplot",
answer = "A histogram of the differences",
"A scatterplot",
"A side-by-side bar chart"))}`
3. A research study recorded the diet of students (vegan; vegetarian; other) and the cost of groceries in the previous week, for many students.
The researchers were exploring if there was any relationship between diet and cost of groceries.
`r if( knitr::is_html_output() ) {longmcq( c(
"A bar chart",
"A histogram",
answer = "A boxplot",
"A histogram of the differences",
"A scatterplot",
"A side-by-side bar chart"))}`
:::
::: {.exercise #GraphsLimeTrees}
@data:ForestBiomass2017 recorded these variables for $385$\ lime trees in Russia:
the foliage biomass (in\ kg; the response variable);
the tree diameter (in\ cm; the explanatory variable);
the age of the tree (in\ years); and
the origin of the tree (one of Coppice, Natural, or Planted).
The purpose of the study is to estimate the foliage biomass from the tree diameter, and perhaps the other extraneous variables.
What graphs would be useful?
:::
::: {.exercise #GraphNitrogenInSoil}
@data:Lane2002:GLMsoilscience recorded the soil nitrogen after applying different fertilizer doses.
The researchers recorded:
* the fertilizer dose, in kilograms of nitrogen per hectare;
* the soil nitrogen, in kilograms of nitrogen per hectare; and
* the fertilizer source; one of 'inorganic' or 'organic'.
What graphs would be useful for understanding the data?
:::
::: {.exercise #GraphNoisyMiners}
@data:Maron:eucthreshold counted the number of noisy miners (an Australian bird) and eucalyptus trees in random quadrats.
Critique the graph of the data (Fig.\ \@ref(fig:MinerCrabPlot), left panel).
:::
```{r, MinerCrabPlot, fig.cap="Left: the number of noisy miners and the number of eucalyptus trees. Right: a scatterplot of the colour of female horseshoe crabs and the condition of their spines.", fig.align="center", fig.width=8, fig.height=2.5, out.width='95%'}
par(mfrow = c(1, 2))
data(NMiner) ### Exercise
par( mar = c(4, 4, 3, 2) + 0.1 )
plot(Minerab ~ Eucs,
data = NMiner,
ylim = c(0, 35),
xlab = "E_trees",
ylab = "num_miners",
pch = 1:9)
###
data(HCrabs) ### Exercise
plot( as.numeric( factor(Col) ) ~ as.numeric( factor(Spine) ),
data = HCrabs,
xlab = "spine",
ylab = "colour",
main = "Dot chart",
las = 1)
```
::: {.exercise #GraphHorseshoeCrabs}
@data:brockmann:crabs recorded, among other things, the colour of the carapace ('Light medium', 'Medium', 'Dark medium' or 'Dark') and the condition of the carapace ('Both OK', 'One OK', 'None OK') of $n = 173$ female horseshoe crabs.
Critique the scatterplot (Fig.\ \@ref(fig:MinerCrabPlot), right panel) used to explore the data.
:::
::: {.exercise #GraphsMADRS}
@data:Danielsson2014:Depression examined the change in <span style="font-variant:small-caps;">madrs</span> (a *quantitative* scale measuring level of depression) and treatment group (whether each person was treated using: exercise; body awareness; or advice).
1. What is the response variable?
1. What is the explanatory variable?
1. What graphs would be useful for exploring the data and the relationships of interest?
:::
::: {.exercise #GraphsSkewBar}
A study of high-performance athletes at the *Australian Institute of Sport* (AIS) [@data:Telford1991:sexsportsize] recorded numerous variables about athletes.
A plot for the sports played by the athletes is shown in Fig.\ \@ref(fig:AISSportBarchart).
How would you describe the data: left skewed, right skewed, approximately symmetrical?
Or something else?
:::
```{r AISSportBarchart, fig.cap="Sports played by athletes in the AIS study.", fig.align="center", fig.width=5, fig.height=3.25}
data(AISsub) ### Exercise
SPT <- sort( table(AISsub$Sport) ) # Sorted from smallest to largest category
par(mar = c(6, 4, 4, 2) + 0.1 ) # DEFAULT: c(5, 4, 4, 2) + 0.1.
barplot(SPT,
las = 2,
ylab = "Number of AIS athletes",
xlab = "Sport",
main = "Sports played by AIS athletes",
col = plot.colour)
```
::: {.exercise #GraphsTyping}