-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathuboats_example.rmd
1461 lines (1136 loc) · 79.1 KB
/
uboats_example.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "German Uboat Attacks in WWII data project"
author: "adenoz"
date: "2023-04-21"
output:
pdf_document: default
html_document: default
---
# Introduction
This document aims to provide an example of data analysis that is conducted against an *event* dataset. This event dataset contains curated and carefully structured data on German Uboat attacks during WWII.
This analysis was conducted using the R programming language, in an R Markdown notebook using the RStudio Integrated Development Environment (IDE). While this analysis was conducted using R, it could just as easily be conducted using Python or Julia in Jupyter Notebooks.
Data science notebooks allow analysts to write prose while also include *code blocks* that conduct some action against the imported data. Typically, this can include data ingestion, wrangling, tidying, transformations, plotting and graphing as well as modelling.
The PDF you are reading is an example of what a notebook can be exported as. The actual code blocks can all be included in the output, a selection can be included, or none can be included. This will depend on the audience. If you are sharing an analysis with other technical users, you will probably want to include the code so they can see what has been done. For a non-technical reader, no code can be included.
In this example, all code will be included as the intent of this is to expose analysts to what code is used to generate the various outputs. However, just be aware that if the purpose was to communicate knowledge about data, the 'so what', none of the actual code would be needed and would actually detract. Note that in the code blocks, lines beginning with # are known as comments. This is typically not actually code, but provides some explanations of what is happening to the code. The program does not actually read any comments as it skips any line beginning with #. The comments are there purely for human readers. However some lines of code can also be "commented out" so they are not run, but they are easy to use again by deleting the #.
This document shows the progress and thinking and exploring as we get deeper and deeper into a dataset. This sort of analysis is known as *exploratory data analysis*. In this form, it may not all be required to be in a final report. Typically, you would go through this exploratory phase, find out interesting things, then produce a seperate more focused more concise report (or dashboard) that doesn't take a reader on a journey but just shows them the so what. However this example is aimed at analysts for educational purposes, so it does show the exploratory journey.
The following code block contains instructions for setting up and importing the required packages used in this analysis. It also sets the working directory.
```{r setup, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
#install.packages("dplyr")
#install.packages("ggplot2")
#install.packages("lubridate")
#install.packages("faraway")
#install.packages("tidyr")
#install.packages("MASS")
#library(faraway)
library(MASS)
library(lubridate)
library(tidyr)
library(knitr)
library(dplyr)
library(ggplot2)
# Set working directory
#workd <- "path/to/directory"
#setwd(workd)
```
# The data
The dataset contains details of all attacks (events) conducted by German Uboats during WWII. This dataset contains data collected from [uboats.net](http://www.uboats.net/index.html). The uboat data was downloaded from [this Github repository](https://github.com/kadenhendron/uboat-data/tree/master/data).
The uboat data is an interesting dataset. It contains records of over a thousand German u-boat submarines regarding their activities during WWII. It contains not only data about the boats themselves but about their operational activities including the number of ships they sunk, tonnage sunk and number of human fatalities from their attacks. It also includes details of their commanders and the time periods they were active.
This dataset offers numerous opportunities to answer some interesting questions that would have been interesting to be able to seek answers to during the actual conflict to inform planning and operations.
There are two CSV files for this dataset. The following files were downloaded from the repository:
- uboat-data.csv
- uboat-target-data.csv
```{r importing data}
uboat.df = read.csv("uboat_data.csv")
uboat.target.df = read.csv("uboat_target_data.csv")
# View(uboat.df) # View opens the data in a new tab to visually explore.
# View(uboat.target.df) # This can be handy while you are getting used to a new dataset
# head(uboat.df)
# head(uboat.target.df, 20)
```
### Uboat dataframe
The uboats data has `r nrow(uboat.df)` rows, or observations, and `r ncol(uboat.df)` columns, or variables. The dataframe contains basic records of each uboat from the date they were commissioned through to their fate. This dataframe also includes a metric of the total ships each uboat sunk. There is one line per uboat. Note that some uboats were not included. For example, there are no records for uboats 112 to 115. Further research could explore why this may be or if there even were uboats with those designations or not and why that may be etc. Gaps in data can be important to know and understand the implications of.
Here are the names of the variables.
```{r uboat variables}
names(uboat.df)
```
### Uboat target dataframe
The uboat target data has `r nrow(uboat.target.df)` observations and `r ncol(uboat.target.df)` variables. This dataframe contains a single observation for each attack from a uboat. It includes details of each attack including the uboat, date, time and location as well as numerous details of the target including type, nationality, tonnage, complement of personnel, survivors, killed and the uboat commander for each attack.
Here are the names of the variables.
```{r uboat target variables}
names(uboat.target.df)
```
# Research questions
By analysing this data, we seek to gain an insight into German uboat operations and answer the following questions:
- Who was the highest threat uboat commander?
- Who was the least effective uboat commander?
- What were the preferred targets of the highest threat uboat commanders?
- When were attacks most successful?
- How did uboat operations change over time?
- Where were uboats most active or lethal? (don't know if I'll get to this one)
I haven't focused on it in this analysis, but there are coordinates for both dataframes which could facilitate interesting geo-spatial analysis.
# Data wrangling
## Data cleaning
Before diving into the data and our analysis, we need to firstly do some data cleaning. The year variables are entered inconsistently (sometimes the year is written in full like 1941 and sometimes abbreviated like 41). The time variable is entered as a character string so is also inconsistent. We'll need to change some data types to the proper type and this will enable us to sort and filter as desired.
```{r data cleaning}
# DATES========================
# NOTE: I could almost do this all in one go doing all four date columns at once.
# HOWEVER: I could not work it out when using the ifelse statement to fix the year formats
# SO: I am needing to do this manually for each date columns which isn't nice
# ordered
# Note the ifelse statement. This is needed as years in data are inconsistent using
# short (42) and long (1942)
# so need the ifelse to fix both formats.
# start with the long format date change. Then the short years will look like year 0042.
# so the if < "1900-01-01" will pick up years like "0042" and change the year to "19%y".
# note also below code works when day or month have zero in front or not like 7 or 07.
uboat.df$ordered = as.Date(uboat.df$ordered, format = "%m/%d/%Y")
uboat.df$ordered = as.Date(ifelse(uboat.df$ordered < "1900-01-01",
format(uboat.df$ordered, "19%y-%m-%d"), format(uboat.df$ordered)))
# laid_down
uboat.df$laid_down = as.Date(uboat.df$laid_down, format = "%m/%d/%Y")
uboat.df$laid_down = as.Date(ifelse(uboat.df$laid_down < "1900-01-01",
format(uboat.df$laid_down, "19%y-%m-%d"), format(uboat.df$laid_down)))
# commissioned
uboat.df$commissioned = as.Date(uboat.df$commissioned, format = "%m/%d/%Y")
uboat.df$commissioned = as.Date(ifelse(uboat.df$commissioned < "1900-01-01",
format(uboat.df$commissioned, "19%y-%m-%d"), format(uboat.df$commissioned)))
# launched
uboat.df$launched = as.Date(uboat.df$launched, format = "%m/%d/%Y")
uboat.df$launched = as.Date(ifelse(uboat.df$launched < "1900-01-01",
format(uboat.df$launched, "19%y-%m-%d"), format(uboat.df$launched)))
# fate
uboat.df$fate = as.Date(uboat.df$fate, format = "%m/%d/%Y")
uboat.df$fate = as.Date(ifelse(uboat.df$fate < "1900-01-01",
format(uboat.df$fate, "19%y-%m-%d"), format(uboat.df$fate)))
# also for uboat.target.df for attack_date variable
uboat.target.df$attack_date = as.Date(uboat.target.df$attack_date, format = "%m/%d/%Y")
uboat.target.df$attack_date = as.Date(ifelse(uboat.target.df$attack_date < "1900-01-01",
format(uboat.target.df$attack_date, "19%y-%m-%d"),
format(uboat.target.df$attack_date)))
#head(uboat.target.df)
# TIMES=========================
# Note that time is linked to a date. so we need to link the attack_time variable to the
# attack_date variable
# so we'll combine those columns
# BUT. Even before that, note we have notation for AM and PM. parsing dates won't recognise that.
# SO, we need to separate that data out into a separate column so we can fix the dates later.
# ALSO, some of the observations do not have entries for attack_time so these will be
# empty or have weird values
# This separates (using tidyr) the AM and PM out from the attack_time variable
# the AM and PM will be in a new variable called 'when'
uboat.target.df = uboat.target.df %>%
separate(attack_time, c('attack_timez', 'when'), sep=" ")
# Combine date and time columns into one (so we can relate the time to a date)
uboat.target.df$dateandtime <- as.character(paste(uboat.target.df$attack_date,
uboat.target.df$attack_timez, sep = ' '))
# Then format the new variable to include date and time using as.POSIXct
uboat.target.df$dateandtime = as.POSIXct(uboat.target.df$dateandtime, format = "%Y-%m-%d %H.%M.%S")
# now we need to add 12hrs to the time column if the 'when' column has PM
# adding 12hrs involves using 12 * 60 * 60 to the time column
# This is how we add 12hrs need to multiply 60 seconds and 60 minutes and 12 hours
hourz = 12 * 60 * 60
# this is basically an if statement. There's probably a nicer way to do this in R...
# if the 'when' column is "PM" and the dateandtime column is not NA, add hourz to dateandtime
uboat.target.df$dateandtime[uboat.target.df$when == "PM" & !is.na(uboat.target.df$dateandtime)] =
uboat.target.df$dateandtime[uboat.target.df$when == "PM" & !is.na(uboat.target.df$dateandtime)] + hourz
# note because the times were formatted according to AM and PM, when adding 12 hours for only PM...
# we'll never need to worry about going past midnight and into the next day which would affect time.
# and check it has worked correctly. Refer to the View and confirm correct
#head(uboat.target.df)
# NAs===================
# We also want to deal with NA entries for numbers where we want to do calculations later.
# We don't want erroneous errors that may not be evident. So in this caes we'll change
# NA values to zero.
# Other times it may be more appropriate to change NA to the previous value or the
# mean or some other value.
#sum(is.na(uboat.target.df$dead))
#sum(is.na(uboat.target.df$complement))
uboat.target.df$dead[is.na(uboat.target.df$dead)] = 0
uboat.target.df$complement[is.na(uboat.target.df$complement)] = 0
```
```{r run this before knitting, include=FALSE}
#==========THIS NEEDS TO BE RUN BEFORE KNITTING TO PDF==============
# This section was added here later in the analysis when these issues arose.
# UNICODE problems====================
# Some of the uboat commander names use unicode values that do not render properly when knitting
# a notebook to PDF. So we'll change only the names that do appear in our analyses here up front.
# technically these names will now be incorrect but it simply allows for easy use of text
uboat.target.df$commander[uboat.target.df$commander == "G�nther Lorentz"] = "Gunther Lorentz"
uboat.target.df$commander[uboat.target.df$commander == "Wolfgang L�th"] = "Wolfgang Luth"
uboat.target.df$commander[uboat.target.df$commander == "G�nther Prien"] = "Gunther Prien"
uboat.target.df$commander[uboat.target.df$commander == "Karl-J�rg W�chter"] = "Karl-Jurg Wachter"
uboat.target.df$commander[uboat.target.df$commander == "Viktor Sch�tze"] = "Viktor Schutze"
uboat.target.df$commander[uboat.target.df$commander == "Otto von B�low"] = "Otto von Bulow"
uboat.target.df$commander[uboat.target.df$commander == "Horst H�ltring"] = "Horst Holtring"
uboat.target.df$commander[uboat.target.df$commander == "Burckhard Hackl�nder"] = "Burckhard Hacklander"
uboat.target.df$commander[uboat.target.df$commander == "Hans Rudolf R�sing"] = "Hans Rudolf Rosing"
uboat.target.df$commander[uboat.target.df$commander == "G�nter Hessler"] = "Gunter Hessler"
```
That concludes the initial data wrangling. Dates and times are now all fixed. We can now begin thinking about the data we have, our questions and what variables we may want to generate ourselves.
## Variable creation
We can now start to consider ways we can deconstruct our data and variables and generate new potentially interesting variables we may like to explore our data with.
### Lethality ratio
What would constitute the most lethal commander? Should total tonnage sunk be the metric? What about personnel killed? However these values would skew the results to favour those who targeted the largest ships. Is this the best? This *could* be useful. But how about calculating a ratio that takes those killed as a proportion of total personnel, being between zero and one. With this ratio or metric, a figure approaching one would mean almost all personnel on board are killed. A figure closer to zero would mean almost all personnel survive. This could provide another indication of how devastating an attack may be. A ship that eventually sunk may have sunk so slowly that all personnel onboard may have been able to escape. A more devastating attack may not have provided the time and killed more personnel. No single metric is complete on its own but each will enable different questions to be answered.
```{r lethality ratio}
# create new variable
uboat.target.df = uboat.target.df %>%
mutate(lethality = round(dead / complement, 2))
uboat.target.df$lethality[is.nan(uboat.target.df$lethality)] = 0
# look at distribution of new variable
hist(uboat.target.df$lethality, main = "Distribution of Lethality Ratio",
xlab = "Scale of lethality ratio",
sub = "Lethality ratio = deaths / complement")
```
The above is an interesting distribution. Most attacks didn't kill many personnel, as a proportion. But there is an element of attacks that were very lethal. We'll see if we can build an understanding of this further as we go. We could easily filter and view our data based on lethality ratio > 0.8 but we'll keep going for now.
### Time of day factor
We may also want to see how attacks fared based on the time of day. We do already have proper times setup now so partitioning our data based on some categories for time of day may prove insightful. We already have AM and PM which may be interesting so we'll keep that variable and just enrich it further with a supplemental variable.
```{r time period of day, include=FALSE}
# this didn't quite work for some reason though leaving here for now.
# The next code block worked great and is much simpler in my opinion
#tt = uboat.target.df
#breaks=hour(hm("00:00", "05:00", "07:00", "11:00", "13:00","17:00","19:00", "23:59"))
#labels = c("small_hours", "sunrise", "morning", "midday", "afternoon", "sunset", "evening")
#tt$day_period = cut(x = hour(tt$dateandtime), breaks = breaks, labels = labels, include.lowest = TRUE)
#tail(tt, 30)
```
```{r time period of day mutate}
uboat.target.df = uboat.target.df %>%
mutate(day_period = case_when(
hour(dateandtime) < 6 ~ "small_hours",
hour(dateandtime) < 8 ~ "sunrise",
hour(dateandtime) < 11 ~ "morning",
hour(dateandtime) < 13 ~ "midday",
hour(dateandtime) < 16 ~ "afternoon",
hour(dateandtime) < 20 ~ "sunset",
hour(dateandtime) < 24 ~ "evening",
TRUE ~ "nil"
))
#head(uboat.target.df, 30)
kable(uboat.target.df %>%
count(day_period) %>%
arrange(desc(n)))
```
We can see that the small hours (midnight to sunrise) and evening (sunset to midnight) are the preferred time periods to conduct attacks overall, with the small hours being the most popular. So it would appear that uboats were most dangerous and active at night. While the above is sorted by the count, the time of day is considered ordinal data, not nominal. This means there is an inherent order which shouldn't be changed. So we'll now plot the data in time of day order to get a different sense / perspective of the time of day counts.
```{r time of day bar}
uboat.target.df %>%
filter(day_period != "nil") %>%
group_by(day_period) %>%
summarise(count = n()) %>%
mutate(day_period = factor(day_period, levels = c("small_hours", "sunrise", "morning", "midday", "afternoon", "sunset", "evening", "nil"))) %>%
ggplot(aes(day_period, count)) +
geom_bar(stat = 'identity') +
labs(title = "Attack count by time of day",
x = "Time of day",
y = "Count") +
theme_minimal()
```
The above plot visually represents the count of attacks by time of day in a more logical manner. This shows some nuance around the plotting / presenting of ordinal data. Ordinal data *can* be re-organised, but only after deliberate consideration. It's generally a bad practise though and should be discouraged.
Let's quickly count AM and PM. AM and PM will only give us a fairly crude metric but it could improve our understanding of our data. AM will be for attacks that happen in the first half of the day, and PM for the later half of the day.
```{r when table}
kable(uboat.target.df %>%
count(when) %>%
arrange(desc(n)))
```
We can see that there seem to be more attacks in the first half of the day, likely due to the attacks in the small_hours, sunrise and morning. The difference doesn't seem large but it may still be a worthwhile metric to keep in the dataset.
We can also see there is some crappy data in the `when` variable which we should clean up while we're passing through. These are likely remnants from when we creating the `when` variable by pulling at out of the initial `time` variable that contained a string with AM and PM at the end, separated by a space. There's only a fairly small number of these so we won't worry about going back to our initial cleaning and seek to refine it. However, if this was a live dataset or we were aiming to disseminate our findings, we would go back and explore each individual case above to refine our initial data wrangling.
```{r cleaning up when variable}
uboat.target.df$when[uboat.target.df$when != "AM" & uboat.target.df$when != "PM"
| is.na(uboat.target.df$when)] = "Nil"
#is.na(tt$when)
kable(uboat.target.df %>%
count(when) %>%
arrange(desc(n)))
```
### Day of week
Much like is of interest in other data, we may want to look at days of the week. However in this example, we'll use the sum of tonnage sunk, rather than just count the attacks.
```{r weekdays}
# create weekday variable
uboat.target.df$weekday <- weekdays(uboat.target.df$dateandtime)
# group tonnage sunk by day and sum the amounts
uboat.day = uboat.target.df %>%
group_by(weekday) %>%
summarize(tonz = sum(tonnage))
uboat.day = na.omit(uboat.day) # Need this otherwise there'll be a blank day for the NAs
# organise days into logical order
uboat.day = uboat.day %>%
arrange(factor(weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")))
barplot(uboat.day$tonz, names.arg = uboat.day$weekday, horiz=TRUE, las=1,
main="Tonnage sunk by day")
```
There are no especially strong patterns here for now, though we could also measure the daily amount for other variables at a later point.
### Time period served
We'll now create a new variable that measures the time a uboat served starting from the time it was launched into the water until it's ultimate fate. This metric may or may not contribute to its overall threat. Because dates are in the proper date format, R will automatically count the days. We'll simply print out the longest serving 6 for now. We may use this new variable later.
```{r time served}
# create new variable for uboat.df dataframe
uboat.df = uboat.df %>%
mutate(total_time = uboat.df$fate - uboat.df$launched)
# print out table for production
kable(uboat.df %>%
mutate(total_time = uboat.df$fate - uboat.df$launched) %>%
select(name, type, launched, fate, total_time) %>%
arrange(desc(total_time)) %>%
head())
#head(uboat.df)
```
### Quickest to attack
We may want to find which uboats were the quickest from the time of launch to their first attack. While elements of luck and the tempo of the war are likely to be at play here, it may also identify competent crews and captains.
This will be the first variable where we need data from both dataframes so we'll need to join the elements we need as well.
```{r quickest to attack}
# First attack date
earliest_attack = uboat.target.df[c("name", "attack_date", "commander")]
earliest_attack = earliest_attack %>%
group_by(name, commander) %>%
summarize(first_attack = min(attack_date))
# date of launch
launched = uboat.df[c("name", "launched")]
# inner join
quickest = earliest_attack %>%
inner_join(launched, by = "name")
# Show sorted results
kable(quickest %>%
mutate(time_to_attack = first_attack - launched) %>% # calculate new variable
select(name, launched, commander, first_attack, time_to_attack) %>% # simply re-organising
arrange(time_to_attack) %>%
head(10))
#head(quickest)
```
The above may or may not be of interest. Besides ordering by the time to the first attack, we've included not only the uboat names but the commanders. These names may appear again later in our analysis. There may or may not be a correlation or relationship between the `time_to_attack` variable and a commanders effectiveness.
# Data analyses
Now that we have our data in a reasonable condition, we start to actually analyse it with minimal stopping and starting. It's likely we'll want to do more with the data we have, but we can start be exploring the data as it is now. We should already have enough to glean some interesting insights.
## Overall uboat attack tempo
To start with, let's take a big picture view of the tempo of attacks conducted by the uboats during WWII. Note that rather than use ggplot, we use the base R plot. We'll use both base R and ggplot throughout this project, just for fun and to show how they both work with different applications.
```{r overall uboat timelines}
all_uboat_attacks = uboat.target.df %>%
group_by(month=floor_date(dateandtime, "month")) %>%
count(month)
# Note the below is exploring putting the grid lines behind the plot, using par(new = TRUE) and two plots...
plot(all_uboat_attacks, type = 'n', , xlab = "", ylab="", ylim = range(c(0,150)))
grid(0, NULL, lty=1, lwd=0.5)
par(new = TRUE)
plot(all_uboat_attacks, type='l', col = 'red', lwd = 2,
main = "Overall Attack Tempo of Uboat Operations WWII",
xlab = "Time",
ylab = "Number of attacks per month",
ylim = range(c(0,150)))
```
In the above plot, we do see an increase in uboat operations from early 1942 until around early to mid 1943 where the tempo drops off to below the pre 1942 levels. This gives us a broad baseline understanding of uboat operations before diving into some specifics.
## Tonnage sunk by nationality over time
Let's now have a look at the tonnage of ships sunk over time, looking at only the countries with the five biggest losses overall.
```{r ggplot of targets over time by nationality}
# use ggplot to plot the above but seperate by target nationality
most_sunk = uboat.target.df %>%
count(nationality) %>%
arrange(desc(n)) %>%
head(10)
uboat.target.df %>%
#group_by(nationality) %>%
filter(nationality == most_sunk[1:5,1]) %>%
filter(loss_type == "Sunk") %>%
group_by(month=floor_date(dateandtime, "month"), nationality) %>%
summarize(tonz = sum(tonnage)) %>%
ggplot(aes(month, tonz, col=nationality)) +
#geom_smooth(se = FALSE) +
geom_line(linewidth = 0.7) +
#scale_colour_viridis_d()+
scale_color_brewer(palette = "Dark2") +
labs(title = "Tonnage sunk by uboats over time",
subtitle = "Top 5 nationalities most losses",
x = "Time",
y = "Tonnage") +
theme_minimal()
# theme_classic()
```
In the above plot we can see the raw values. It is quite messy though we can see that uboats sunk British shipping the most, by tonnage. Second is the US. It does get a little messy comparing that middle section so it can be useful to fit smoothed lines to look at broader smoother trends over time. Such an approach will lose detail but can be more informative.
The below plot contains the exact same data, but now the raw tonnage is smoothed using the *lowess* method.
```{r tonnage over time lowess}
uboat.target.df %>%
#group_by(nationality) %>%
filter(nationality == most_sunk[1:5,1]) %>%
filter(loss_type == "Sunk") %>%
group_by(month=floor_date(dateandtime, "month"), nationality) %>%
summarize(tonz = sum(tonnage)) %>%
ggplot(aes(month, tonz, col=nationality)) +
geom_smooth(se = FALSE) +
#geom_line() +
#scale_colour_viridis_d()+
scale_color_brewer(palette = "Dark2") +
labs(title = "Tonnage sunk by uboats over time (lowess smoothed)",
subtitle = "Top 5 nationalities with the most losses",
x = "Time",
y = "Tonnage (lowess smoothed)") +
theme_minimal()
```
In the above plot, we can more clearly see the trends of the tonnage sunk over time and more easily compare the differences between nationalities. This plot actually shows that towards the end of our data, British tonnage lost drops below the US and Norwegian.
## Attacks by day period over time
We'll now look at what time of the day where the most tonnage tended to be sunk, throughout the WWII period.
```{r uboat tonnage sunk by time of day}
uboat.target.df %>%
filter(loss_type == "Sunk" & day_period != "nil") %>%
group_by(month=floor_date(dateandtime, "month"), day_period) %>%
summarize(tonz = sum(tonnage)) %>%
ggplot(aes(month, tonz, col=day_period)) +
geom_line(linewidth = 0.7) +
scale_color_brewer(palette = "Dark2") +
ylim(0,225000) +
labs(title = "Tonnage sunk by uboats by time of day",
subtitle = "When did uboats sink the most tonnage?",
x = "Time",
y = "Tonnage") +
theme_minimal()
```
The above plot is quite messy. It does appear that the small hours tended to be the time of day that the most tonnage was sunk, but it's not really clear looking at the raw data. Like before, lets smoothen it up and see if we can glean some clearer insights.
```{r most tonnage by time of day smoothed}
uboat.target.df %>%
filter(loss_type == "Sunk") %>%
group_by(month=floor_date(dateandtime, "month"), day_period) %>%
summarize(tonz = sum(tonnage)) %>%
ggplot(aes(month, tonz, col=day_period)) +
geom_smooth(se = FALSE, span = 2/3) +
scale_color_brewer(palette = "Dark2") +
ylim(0,225000) +
labs(title = "Tonnage sunk by uboats by time of day (lowess smoothed)",
subtitle = "When did uboats sink the most tonnage?",
x = "Time",
y = "Tonnage (lowess smoothed)") +
theme_minimal()
```
We can now see more clearly, that the most tonnage tended to be sunk in the small hours of the morning. This become a strong trend early in the conflict. The second highest threat time of day was the evening. Besides being quite high early in the conflict, the afternoon tended to be the safest time of day for maritime vessels from the uboat threat. For the first half of the conflict, sunset tended to be more likely to lose more tonnage than sunrise. Looking at this data over time seeks to identify changes. This may indicate a change in TTPs in either the attacking force (Uboats in this case) or defending forces. Perhaps this would be as a response to successful attacks.
## Lethality of uboats over time
Let's now look at a similar smoothed lowess line chart but for the mean lethality rather than the time of day.
```{r lethality over time}
uboat.target.df %>%
filter(loss_type == "Sunk") %>%
group_by(month=floor_date(dateandtime, "month")) %>%
summarize(Lethality = mean(lethality)) %>%
ggplot(aes(month, Lethality)) +
#geom_line() +
geom_smooth(span = 2/3, na.rm = TRUE) +
#scale_color_brewer(palette = "Dark2") +
ylim(0,1) +
labs(title = "Lethality of uboats over time (lowess smoothed with confidence intervals)",
subtitle = "When were uboats most lethal? Mean lethality by month.",
x = "Time",
y = "Mean lethality (lowess smoothed)") +
theme_minimal()
```
We can see that uboat lethality basically doubled from pre-1940 to the end of 1941 to almost 0.5. Then uboat lethality decreased back to around 0.25 towards the end of 1943 then began to increase again. These changes could warrant additional research to better understand. Perhaps this plot is showing us how allied responses to the uboat threat mitigated some of its lethality. Perhaps interestingly, when lethality began to drop into 1942, uboat tonnage rates increased. So this could also be the result of deliberate targeting choices being made by uboats around that time where a priority was placed on larger vessels for tonnage rather than on killing personnel. Likewise when we see an upward tick in lethality through 1944, this was roughly at the time that tonnage sunk was low.
These plots gave us a broad understanding of how uboat operations progressed over time at a high level.
Note the two previous plots used the *ggplot* graphical package which offers very rich customisation. These two plots only used some basic functionality, much more can be done to customise plots which includes building an entire custom theme which could be good for teams to quickly make nice looking on brand plots with minimal coding. Once a theme is built, all it requires is dropping the theme name in the plot code and that is all.
## Most devastating attacks
We'll now begin exploring our data in more detail and seek to answer some of our questions. The first area we'll explore is to understand what were the most devastating attacks conducted by uboats. There are a few ways to measure this so we'll go through a few of them.
## Top 10 sunk by biggest tonnage
We'll start by simply seeking to understand what were the biggest ships sunk (not just damaged) by uboats, by tonnage. We'll include a few other potentially interesting variables to assist us in building a picture of the uboat threat as we go through our analyses.
```{r top ten biggest tonnage sunk}
kable(uboat.target.df %>%
select(name, loss_type, nationality, tonnage, commander, ship_name, dead, lethality, day_period) %>%
filter(loss_type == "Sunk") %>%
arrange(desc(tonnage)) %>%
select(-loss_type) %>%
head(10))
```
Before we start looking for understanding, we can see what appears to be a duplicate entry for the U-331 sinking of HMS Barham. I looked through the full dataframe and all details are identical. This is something to be mindful of as we start counting sums of bigger data we may not be able to look at entirely with our own eyes. We may later conduct some deliberate searches for duplicates. In the meantime, let's look for interesting findings.
The attacks on HMS Barham and HMS Royal Oak seem especially devastating. Not only are the tonnages large, we can see high numbers of killed personnel and fairly high lethality for such large vessels. Somewhat interestingly, or not, the largest ships sunk were British. This could be more a reflection of the size of the British navy and maritime industries than anything else. Or it could show a preference or attractiveness for striking British vessels. While this is only our first top ten list, we see no standout uboats, commanders or preferred attack day period as yet. But this table allows us to start building our understanding of the uboat threat.
## Top 10 by most personnel killed
We'll now do something very similar, though will rank by the most personnel killed.
```{r top ten by most killed}
kable(uboat.target.df %>%
select(name, nationality, tonnage, commander, ship_name, dead, lethality, day_period) %>%
arrange(desc(dead)) %>%
head(10))
```
The attack on HMS Barham actually resulted in the most personnel killed as well. We also start to see some targets from other nationalities here. These attacks resulted in significant loss of life. This simple table, and the one before it, does show the significant effect uboats were having in the battlespace around Europe during WWII.
For the first time, we see one uboat and commander feature twice in the top ten. U-47 and Commander Gunther Prien sunk HMS Royal Oak and the Arandora Star which together resulted in over 1,000 deaths. Perhaps interestingly, there does seem to be a higher proportion of attacks that occurred around sunset or sunrise compared to the other times of day.
## Top 10 most lethal devastating attacks
We'll now look at the top ten most devastating attacks by the highest lethality. There are many instances of attacks in the dataset where all crew were lost so filtering by a lethality of 1 would produce a very long list. So we've kept increasing the filter for number of dead higher until we start to introduce lethality rates of lower than 1 in our list. We got up to 150 killed before we saw less than ten results with lethality of 1. Remember, lethality is simply the ratio of those killed over the total number of personnel on board. So a lethality of 1 means all personnel were killed. A lethality of 0 means no personnel were killed, all survived.
```{r most lethal devastating attack}
kable(uboat.target.df %>%
select(name, nationality, tonnage, commander, ship_name, dead, lethality, day_period) %>%
filter(dead > 150) %>%
arrange(desc(lethality)) %>%
head(10))
```
The first thing that stands out here is there is a German vessel sunk by a uboat. A quick search online confirmed our suspicions that the [Doggerbank was sunk by U-43 by mistake](https://en.wikipedia.org/wiki/German_ship_Doggerbank). The Doggerbank was a former UK cargo ship that was captured by the German Navy in 1941 and converted to an auxiliary minelayer and blockade runner. This is likely why it was sunk. There was perhaps some vulnerability in the uboat targeting and authorisation process that could have been exploited here.
There does seem to be far more attacks here that occurred in the evening, especially if combined with the attacks in the small hours. Perhaps attacks that happened during these times resulted in more confusion on the vessel, resulting in more delayed evacuation as well as impact the ability of rescue vessels to assist during the night. It's also possible that night allowed the uboats to use stealth to maximum effect and achieve more optimum target acquisition and launch to achieve a more effective hit on the targets.
## Understanding the uboat types
There are different uboat types as shown by the different designations. However these types are not included in the target dataset, but the base uboat dataset.
```{r uboat types}
unique(uboat.df$type)
```
However the uboat dataset does have a metric for the total ships sunk for each uboat. We could extract these types and join them with the attack dataset, but for now we'll compare the uboat types by counting the ships sunk.
```{r}
uboat.df %>%
group_by(type) %>%
summarize(sunk = sum(ships_sunk)) %>%
arrange(desc(sunk)) %>%
head(10) %>%
ggplot(aes(reorder(type, sunk), sunk)) +
coord_flip() +
geom_bar(stat = 'identity') +
labs(title = "Uboat types by total ships sunk (top ten only)",
y = "Total Ships Sunk",
x = "Uboat type") +
theme_minimal()
```
Complementing the above barplot we [can see that via wikipedia](https://en.wikipedia.org/wiki/List_of_U-boat_types) the VIIC was the German attack submarine 'workhorse' for the majority of WWII. While that may be the case, the IXC also presented a significant threat. The IXC is identified as an upgraded ocean boat used in the opening years of WWII. The IXB and VIIB were identified as 'standard' fleet ocean boat or 'standard' attack submarines, though both types sunk hundreds of vessels. Regardless, the VIIC type represented the most significant threat if we take the count of ships sunk as the key metric.
## Most effective uboat commander
Let's now move on to analyse the uboat commanders in more detail. There are a few different ways we could look to identify the most effective uboat commanders. We could look at those who conducted the most attacks, those who sunk the most tonnage, those who killed the most personnel or those who scored the highest on our lethality ratio. We could also build an index of sorts considering each of these. But we won't do that here. Perhaps you could try to develop an index considering each of the metrics?
We could also plot all of these figures in one graphic and compare the various results and look for the commanders who feature in more than one metric. Note that for the lethality ratio, because it would be fairly easy to score highly with one attack that may have killed 5 out of 5 personnel, we've filtered this plot to only include commanders who conducted more than 15 attacks. This filter allows us to ensure we are only capturing skilled consistent lethality and not a one-off fluke.
```{r most effective commanders}
# most attacks
comd_attacks = uboat.target.df %>%
count(commander) %>%
arrange(desc(n)) %>%
head(10) %>%
arrange(n)
# most tonnage
most_tonnage = uboat.target.df %>%
group_by(commander) %>%
filter(loss_type == "Sunk") %>%
summarize(tonnagez = sum(tonnage)) %>%
arrange(desc(tonnagez)) %>%
head(10) %>%
arrange(tonnagez)
# most kills
most_killed = uboat.target.df %>%
group_by(commander) %>%
summarize(killed = sum(dead)) %>%
arrange(desc(killed)) %>%
head(10) %>%
arrange(killed)
# commanders with at least n attacks
five_attacks = uboat.target.df %>%
count(commander) %>%
filter(n > 15)
# mean of lethality
most_lethal = uboat.target.df %>%
filter(lethality < 1.1) %>%
group_by(commander) %>%
summarize(lethal = mean(lethality))
# left join lethality onto at least n
five = five_attacks %>%
left_join(most_lethal, by = "commander") %>%
arrange(desc(lethal)) %>%
head(10) %>%
arrange(lethal)
#quickest = earliest_attack %>%
# inner_join(launched, by = "name")
# start plotting
par(mfrow = c(2,2), mar = c(4,13.5,1,2))
barplot(comd_attacks$n, names.arg = comd_attacks$commander, horiz = TRUE, las=1,
xlim = c(0,60), cex.names = 0.8,
main = "Most Attacks",
xlab = "Number of attacks")
barplot(most_tonnage$tonnagez, names.arg = most_tonnage$commander, horiz = TRUE, las=1,
cex.names = 0.8,
main = "Most Tonnage Sunk",
xlab = "Sum of tonnage sunk")
barplot(most_killed$killed, names.arg = most_killed$commander, horiz = TRUE, las=1,
xlim = c(0,2000), cex.names = 0.8,
main = "Most Personnel Killed",
xlab = "Sum of personnel killed")
barplot(five$lethal, names.arg = five$commander, horiz = TRUE, las=1, cex.names = 0.8,
xlim = c(0,0.8),
main = "Most Lethal",
xlab = "Average (mean) lethality")
```
From the above we can see that Otto Kretschmer appeared to be a very effective uboat commander. Not only did he conduct the most attacks he also sunk the most tonnage, by some margin. Wolfgang Luth was second for both the number of attacks and tonnage sunk. Erich Topp was the next most effective commander by these two metrics. Gunther Prien killed the most personnel, followed by Hans-Diedrich Freiherr von Tiesenhausen. These two killed the most personnel by some margin. Gunther Prien is noteworthy for featuring in three of the four plots.
Remember that the most lethal plot only includes commanders who conducted more than 15 attacks. Heinrich Letmann-Willenbrock featured not only the second highest for most lethal, but also fourth for the most personnel killed and sixth for most tonnage sunk. He is one of the few commanders who featured on at least three of these plots.
Wolfgang Luth is the standout here as he features in all four plots and as high as two in two plots, behind Otto Kretschmer.
### Wolfgang Luth
Let's now take a closer look at Wolfgang Luth to see if we can better understand his *modus operandi*.
```{r wolfgang activity}
# count the activity of wolfgang by month
# this will give only months and n values
wolf = uboat.target.df %>%
filter(commander == "Wolfgang Luth") %>%
group_by(month=floor_date(dateandtime, "month")) %>%
count(month)
# wolf tonnage
wolf_tonnage = uboat.target.df %>%
filter(commander == "Wolfgang Luth" & loss_type == "Sunk") %>%
group_by(month=floor_date(dateandtime, "month")) %>%
summarize(tonnz = sum(tonnage))
# wolf killed
wolf_killed = uboat.target.df %>%
filter(commander == "Wolfgang Luth" & loss_type == "Sunk") %>%
group_by(month=floor_date(dateandtime, "month")) %>%
summarize(killed = sum(dead))
# wolf lethality
wolf_lethality = uboat.target.df %>%
filter(commander == "Wolfgang Luth" & loss_type == "Sunk") %>%
group_by(month=floor_date(dateandtime, "month")) %>%
summarize(lethal = mean(lethality))
# plotting
par(mfrow= c(2,2), oma=c(0,1,1,0))
#xl = c("1940", "1944")
plot(wolf, type = 'l', col = 'red', lwd = 2, main = "Attack tempo",
xlab = "Time",
ylab = "Attacks per month")
grid(0, NULL, lty=1, lwd=0.5)
plot(wolf_tonnage$month, wolf_tonnage$tonnz, type = 'l', col='red', lwd= 2,
main = "Tonnage sunk over time",
xlab = "Time",
ylab = "Tonnage sunk")
grid(0, NULL, lty=1, lwd=0.5)
plot(wolf_killed$month, wolf_killed$killed, type = 'l', col='red', lwd= 2,
main = "Personnel killed over time",
xlab = "Time",
ylab = "Personnel killed")
grid(0, NULL, lty=1, lwd=0.5)
plot(wolf_lethality$month, wolf_lethality$lethal, type = 'l', col='red', lwd= 2,
main = "Lethality over time",
xlab = "Time",
ylab = "Lethality")
grid(0, NULL, lty=1, lwd=0.5)
mtext("Analysis of Wolfgang Luth Uboat Activities over time", outer = TRUE, side=3)
```
The attack tempo plot shows simply the number of attacks conducted each month. The tonnage sunk plot shows the sum of tonnage sunk for each month. The personnel killed plot also sums the personnel killed for each month. The lethality plot contains the mean average of the lethality rate for each month.
There are some significant gaps in the activities of Wolfgang Luth, likely between times being away on patrol. There is about an eleven month gap between Jan 1942 and Nov 1942. However in Nov Wolfgang conducted 10 attacks that month, all resulting in the targets being sunk. The lethality ratios for these individual attacks ranged from zero to 0.83 (note the plot for lethality is monthly mean values for lethality). The attack in 1943 where he killed the most personnel was the following:
```{r most killed wolfgang}
kable(uboat.target.df %>%
filter(commander == "Wolfgang Luth" & dead > 100) %>%
select(name, loss_type, attack_date, nationality, ship_name, dead, lethality, day_period))
```
Throughout his career, he captained the following uboats:
- U-9
- U-43
- U-138
- U-181
We'll now look into some of the other particulars regarding how Wolfgang Luth conducted attacks.
```{r attack times of Luth}
uboat.target.df %>%
filter(commander == "Wolfgang Luth") %>%
count(day_period) %>%
arrange(desc(n)) %>%
ggplot(aes(reorder(day_period, n), n)) +
coord_flip() +
geom_bar(stat = 'identity') +
labs(title = "Preferred Attack Times of Wolfgang Luth",
x = "Time of day",
y = "Count of attacks") +
theme_minimal()
```
Wolfgang Luth was a night hunter. Most of his attacks were either in the evening or small hours of the morning before sunrise. We need to be conscious of the ordering in the above. Is that ok?
```{r nationality of targets Luth}
uboat.target.df %>%
filter(commander == "Wolfgang Luth") %>%
count(nationality) %>%
arrange(desc(n)) %>%
ggplot(aes(reorder(nationality, n), n)) +
coord_flip() +
geom_bar(stat = 'identity') +
labs(title = "Preferred Nationality of Targets of Wolfgang Luth",
x = "Nationalities",
y = "Count of attacks") +
theme_minimal()
```
Wolfgang struck mostly British vessels. This could be more due to the sheer number of British vessels in his AO. It might actually be interesting to measure the proportion of targets which British from all of the data then compare with individual commanders. But for now, we'll leave that.
```{r Luth by day}
uboat.target.df %>%
filter(commander == "Wolfgang Luth") %>%
count(weekday) %>%
mutate(weekday = factor(weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"))) %>%
ggplot(aes(weekday, n)) +
coord_flip() +
geom_bar(stat = 'identity') +
labs(title = "Count of Attacks by Wolfgang Luth, by day of week",
x = "Day of week",
y = "Count") +
theme_minimal()
```
There may not be any significance to the plot above, but Wolfgang Luth conducted the most attacks on Fridays, and the least attacks on Wednesdays. This is quite different to the overall Uboat attacks by day of the week. So while it may justify additional qualitative research, it may also be a coincidence.
In a possibly fitting conclusion to the Wolfgang Luth story, he was accidentally shot and killed by a German soldier standing guard who asked Luth for the password three times without response - so the guard fired and killed him. More on Luth can be found at the [wikipedia page](https://en.wikipedia.org/wiki/Wolfgang_L%C3%BCth).
## Analysing lethality in some more depth
We'll now quickly look at lethality in some other ways to understand the uboat threat in more detail.
### Nationalities targeted
We'll start by attempting to understand what targets uboats had the most success against. We'll remove the attacks where a vessel from a nation was only struck by a uboat one time. For example, one Greendlandic vessel was attacked by a uboat and this one attack resulted in a lethality ratio of 1 as all personnel were killed. A one-off attack doesn't indicate any particular vulnerabilities to uboats so we'll only include measures where a nations vessel was attacked by a uboat at least three times.
```{r most successful targets for uboats}
# create data of interest, which is when nation was attacked more than twice
three_attacks = uboat.target.df %>%
count(nationality) %>%
filter(n > 2)
# create actual data we want to plot
most_lethal_nation = uboat.target.df %>%
filter(lethality < 1.1) %>%
group_by(nationality) %>%
summarize(lethal = mean(lethality))
# combine both, joining onto data of interest
three = three_attacks %>%
left_join(most_lethal_nation, by = "nationality") %>%
arrange(desc(lethal)) %>%
head(10) %>%
arrange(lethal)
# plot
par(mar = c(5.5,7,3,2))
barplot(three$lethal, names.arg = three$nationality, horiz = TRUE, las = 1, xlim = c(0,0.8),
main= "Nations where uboats found great targeting successes",
xlab = "Mean lethality score by nationality",
sub = "Only includes nations that were attacked by uboats at least three times")
grid(NULL, 0, lty = 1, lwd = 0.5, col = 'grey')
```
In the above plot, we can see that Icelandic vessels appeared especially vulnerable to uboat attacks as attacks tended to result in large proportions of their personnel being killed. It is unclear why this may be but the data would indicate additional research would be warranted. It is possible that the water temperature could be a factor, or remoteness. This is where perhaps some geo-spatial visualisations may provide additional insights.
Note also that German vessels tended to be vulnerable to uboat attacks... There were actually four occasions where uboats accidentally sunk other German vessels. We've already mentioned the Doggerbank being one of these. Only one of these attacks resulted in no personnel being killed, which was the Steinbek. Each case saw a different commander in charge. Research into these may identify opportunities that may have been exploitable.
### Time of day and lethality
We'll conduct a similar analysis but seeing how the time of day of uboat attacks may impact their lethality.
```{r uboat time of day lethality}
uboat.target.df %>%
group_by(day_period) %>%
filter(lethality < 1.1 & day_period != "nil") %>%
summarize(lets = mean(lethality)) %>%
mutate(day_period = factor(day_period, levels = c("sunrise", "morning", "midday",
"afternoon", "sunset", "evening", "small_hours"))) %>%
ggplot(aes(day_period, lets)) +
coord_flip() +
geom_bar(stat = 'identity') +
ylim(0, 0.8) +
labs(title = "Time of day where Uboats were most lethal",
x = "Time of day",
y = "Mean Lethality") +
theme_minimal()
```
In the above plot we can see that the small hours of the day were when uboats were most lethal. We have already seen that the small hours were the preferred attack times of uboats generally and also by Wolfgang Luth. Interestingly, uboat attacks around midday tended to be the second most lethal time of attack for uboats. This is somewhat surprising as we didn't see much preference to attack at that time. This could possibly due to uboats only attacking at this more vulnerable time when they had a great targeting opportunity that was worth the risk. There could also be other reasons. This could justify conducting more research into understanding this.
## Vulnerabilities of uboats
We've looked at the targeting and operations conducted by uboats. We'll now see if we can better understand their vulnerabilities.
Firstly, we'll have a look at what the final fates were of all uboats.
```{r all fates for uboats table}
kable(uboat.df %>%
count(fate_type) %>%
arrange(desc(n)))
```
From the above we can see that a larger proportion of them were sunk. This is perhaps a higher proportion than would be expected. We'll now compare the numbers of uboats being sunk with the number of uboats being launched into service. This may interesting.
```{r uboats sunk and launched}
# uboats sunk
sunk_time = uboat.df %>%
filter(fate_type == "Sunk") %>%
group_by(month=floor_date(fate, "month")) %>%
summarize(count = n())
#summarize(uboats_sunk = count(fate_type == "Sunk"))
# uboats launched
launched_time = uboat.df %>%
group_by(month=floor_date(launched, "month")) %>%
summarize(count = n())
# plotting
plot(sunk_time$month, sunk_time$count, type = 'l', col = 'red', lwd = 2,
main = "Uboats launched and sunk over time",
xlab = "Time",
ylab = "Uboats per month")
grid(0, NULL, col = 'grey', lty = 1, lwd = 0.5)
lines(launched_time$month, launched_time$count, type = 'l', col = 'blue', lwd = 2)
legend("topleft", c("Sunk", "Launched"), lty = 1, lwd = 2, col = c('red', 'blue'))
```
The above plot is quite interesting. Initially, the Germans were able to field more uboats than were being sunk. However from around mid 1943, it appears as though manufacturing peaked and losses of uboats exceeded the German ability to replace them. It appears as though that 1943 period would have really hit the uboat capability quite hard. Not only was there a sudden increase in the numbers of uboats being sunk but the numbers being launched not only plateaued but dropped quite significantly for some time, into the middle of 1945. There was some reversal of those trends but that wasn't long lasting.
Let's now look at how the fate of the uboats evolved over time. We'll omit the three smallest fate types which were captured (5), grounded (4) and destroyed (3) so we can keep the plot clean and easier to understand.
```{r fate of uboats through the war}
uboat.df %>%
group_by(month=floor_date(fate, "month"), fate_type) %>%
filter(fate_type != "Captured" & fate_type != "Grounded" & fate_type != "Destroyed") %>%
summarize(num = n()) %>%
ggplot(aes(month, num, col = fate_type))+
geom_line(linewidth = 0.7) +
labs(title = "Fate of uboats over time",
subtitle = "What happened to the uboats through the war?",
x = "Time",
y = "Number of uboats")+
scale_color_brewer(palette = "Dark2")+
theme_minimal()
#scale_colour_viridis_d()
```
So it seems our data has captured the final fate of Germany's uboats post-WWII as we see the large spikes for Scuttled and Surrendered. We can also see that numbers for Sunk dwarfed the other fate types so we might leave those fates here for now.
```{r smaller fates of uboats}
uboat.df %>%
group_by(month=floor_date(fate, "month"), fate_type) %>%
filter(fate_type == "Captured" | fate_type == "Grounded" | fate_type == "Destroyed"
| fate_type == "Damaged" | fate_type == "Missing") %>%
summarize(num = n()) %>%
ggplot(aes(month, num, fill = fate_type))+
geom_bar(stat = 'identity') +
labs(title = "Fate of uboats over time",
subtitle = "How did the less common fates evolve over time?",
x = "Time",
y = "Number of uboats") +
theme_minimal()
#scale_color_brewer(palette = "Dark2")
```
We don't really gain any great insights from the above plot looking at less common fates. Note we've resorted to using bars instead of lines due to various entries with values of zero and a small number of categories. It is interesting that some uboats went missing and this did seem to increase from the end of 1942 but without more qualitative knowledge if this is important, there's not much else to do for now. Without much knowledge, we'd expect that many of those uboats that were missing went missing during attacks and were actually sunk. But this may be more interesting than it appears and could warrant additional research to better understand.
We'll see if there are any interesting insights to be gleaned by looking at uboats sunk over time by type. We'll select just the top five most dangerous types we identified earlier.
```{r fate by uboat type}
uboat.df %>%
group_by(month=floor_date(fate, "month"), type) %>%
filter(fate_type == "Sunk" & type == "VIIC" | type == "IXC" | type == "IXB" | type == "VIIB"
| type == "IX") %>%
summarize(num = n()) %>%
ggplot(aes(month, num, fill = type))+
geom_bar(stat = 'identity') +
#geom_line(linewidth = 0.7) +
labs(title = "Sunk uboat types over time",
x = "Time",
y = "Number of uboats sunk, per month")+
#scale_color_brewer(palette = "Dark2")+
theme_minimal()
```
We don't really see anything of particular interest here. We know the VIIC was the main attack workhorse and we know the number of attacks they conducted was high so we expect their numbers to be represented here. An avenue potentially worth exploring would be to compare what we know about the overall proportions of Uboat types (from earlier in this analysis) and compare those with the data in the above plot. But we won't do that here.
## Understanding the most efficient uboats and crews
We'll now look at the most deadly uboats. These may or may not be different to insights we gleaned analysing the uboat commanders. Perhaps the effectiveness of some uboats was at least partly due to competent high performing crews. A good commander could make the most use of good crews. A good commander could be constrained by poorer performing crews. Some commanders commanded multiple uboats, they didn't only work on one uboat throughout the war.
### Number of attacks
We'll start off by looking at the uboats that conducted the most attacks.
```{r uboat attacks}
uboat.target.df %>%
group_by(name) %>%