-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathcheck_lakes_report.Rmd
328 lines (226 loc) · 16.5 KB
/
check_lakes_report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
---
title: "Checking the progress of site analysis"
author: "Simon Goring"
date: "July 16, 2017"
output:
html_document:
output: index.html
self_contained: true
code_folding: hide
toc: true
toc_float: true
css: css/sjg_markdown.css
---
```{r runSetup, cache=FALSE, message = FALSE, warning=FALSE, results='hide', echo = FALSE}
version <- '1.6'
source('R/setup.R')
all_ds <- neotoma::bind(neotoma::get_dataset(datasettype = 'pollen',
gpid = 'Canada'),
neotoma::get_dataset(datasettype = 'pollen',
gpid = 'United States'),
neotoma::get_dataset(datasettype = 'pollen',
gpid = 'Mexico'))
sitedataset <- data.frame(stid = sapply(all_ds, function(x) x$site.data$site.id),
dsid = sapply(all_ds, function(x) x$dataset.meta$dataset.id),
lat = sapply(all_ds, function(x) x$site.data$lat),
long = sapply(all_ds, function(x) x$site.data$long),
name = sapply(all_ds, function(x) x$site.data$site.name),
stringsAsFactors = FALSE)
control_output <- pull_chron_table(version, all_ds)
existing_lakes <- neotoma_lakes("connect_remote.json") %>%
filter(dsid %in% sapply(all_ds, function(x) x$dataset.meta$dataset.id))
if (nrow(existing_lakes) == 1) {
warning("You are not connected to the Database.")
}
```
## Status of Lake Records in Neotoma
Neotoma contains records from a large number of lakes and other depositional environments. Coverage for Canada, the United States and Mexico is good, with `r nrow(sitedataset)` datasets, representing `r length(unique(sitedataset$stid))` unique sites. Lake area is critical for understanding pollen source area [@dawson2016quantifying]. Within this domain, a number of lakes in the Neotoma Database have recorded lake areas (*n = `r nrow(existing_lakes)`*), a number of sites do not have associated lake areas. Many of lakes with missing lake area parameters are reported from legacy publications, meaning primary information is not directly available from principal investigators. In addition to missing areas, examination of site coordinates indicates that a large number of records have rounded coordinate values (due to conversion from DMS coordinates) which may not intersect with the waterbodies from which sedimentary or water chemistry records were obtained.
## Load in the records
Bailey Zak and Claire Rubblike have been processing a number of lake sites from Neotoma, to align them with existing records in the Canadian and United States Hydrological Databases. This provides an important service, it allows us to check reported lake areas & deposition types as well as providing an opportunity to correct geopositioning for sites that were reported with rounding for latitude or longitude coordinates.
### Set up
The first step was time-consuming. All pollen cores from Canada, Mexico and the United States were pulled from Neotoma, along with the chronological controls for each record. The chronological controls are used to help prioritize reconstruction, since the current effort is largely focused on the reconstruction of "high quality" sites.
The result here is a `data.frame` called `control_output` that contains each site id, dataset id and a sum of all the chron control types used in reconstruction. This is for every pollen record (regardless of whether it's a lake or not). In principle, this could be run at this stage, and the code is provided here, however in practice, we use these numbers earlier in the report and so the code is actually evaluated earlier in the RMarkdown report.
```{r lakedata, cache=FALSE, message = FALSE, warning=FALSE, results='hide', eval=FALSE }
version <- '1.6'
source('R/setup.R')
all_ds <- neotoma::bind(neotoma::get_dataset(datasettype = 'pollen',
gpid = 'Canada'),
neotoma::get_dataset(datasettype = 'pollen',
gpid = 'United States'))
sitedataset <- data.frame(stid = sapply(all_ds, function(x) x$site.data$site.id),
dsid = sapply(all_ds, function(x) x$dataset.meta$dataset.id),
lat = sapply(all_ds, function(x) x$site.data$lat),
long = sapply(all_ds, function(x) x$site.data$long),
name = sapply(all_ds, function(x) x$site.data$site.name),
stringsAsFactors = FALSE)
control_output <- pull_chron_table(version, all_ds)
```
This code block provides a list of all pollen datasets in Neotoma (`all_ds`) along with the chronological controls used to develop their chronologies. It results in a data frame containing `r nrow(control_output)` rows and `r ncol(control_output)` columns.
We can pull all existing lake data from Neotoma and add it to our estimates using a direct connection to the database. Currently this requires access to the Neotoma Database, however we've written the script in such a way as to allow a missing file if an individual is running the program locally.
```{r getQuery, cache=FALSE, message = FALSE, warning=FALSE, results='hide', eval=FALSE}
existing_lakes <- neotoma_lakes("connect_remote.json") %>%
filter(dsid %in% sapply(all_ds, function(x) x$dataset.meta$dataset.id))
if (nrow(existing_lakes) == 1) {
warning("You are not connected to the Database.")
}
```
This provides us with `r nrow(existing_lakes)` datasets with lake information in the Neotoma Paleoecology Database. This is approximately `r round(sum(sitedataset$stid %in% existing_lakes$stid) / nrow(sitedataset), 1)`% of the pollen datasets we have obtained within North America.
```{r, echo=FALSE, cache=FALSE, message = FALSE, warning=FALSE, fig.caption= "All pollen site locations in North America from lacustrine environments." }
sitedataset %>%
distinct(stid, .keep_all = TRUE) %>%
leaflet() %>%
addTiles() %>%
addMarkers()
```
### Geographic Co-Location
For lakes without reported areas we can access external data resources. Lakes are checked against data in the Canadian [CanVec – Hydro Features](http://open.canada.ca/data/en/dataset/8ba2aa2a-7bb9-4448-b4d7-f164409fe056) and the [United States Hydrolography Dataset](https://nhd.usgs.gov/NHD_High_Resolution.html) to obtain "best fit" matches for lakes. The script, which uses the [`sp` package](https://cran.r-project.org/web/packages/sp/index.html) to overlay the points over lake polygons is contained in the files [`R/GetLakeAreas_usa.R`](https://github.com/NeotomaDB/neotoma_lakes/blob/master/R/GetLakeAreas_usa.R) and [`R/GetLakeAreas_canada.R`](https://github.com/NeotomaDB/neotoma_lakes/blob/master/R/GetLakeAreas_usa.R). State-level shapefles are very large for some states, and as such this is a time and resource consuming process. We run this step on the Pensylvannia State University's Center for Environmental Informatics server. The result is a `data.frame` with columns that are indexed using the `stid`.
Given the network intensity of the operation we run the scripts on the CEI servers and then use `scp`, or secure copy, to move the files back to the end-users computer. This script is not run as part of the current RMarkdown document. The scripts provide us with the unique identifer of any pollen sites that overlay lakes in the hydrological datasets, and provides matches for any similarly named lakes proximate to a Neotoma lake.
#### Direct Matches
Direct matches between the hydrological database and Neotoma occur when a Neotoma site lies directly over top of a record in the Canadian or United States NHD. These records are tagged with the unique identifier for the lake. Here records with positive matches are indicated in red and those without are indicated in black:
```{r, loadLakeData, echo=FALSE, message=FALSE, results='hide'}
ca_lakes <- readr::read_csv('data/ca_lakes.csv') %>%
filter(!is.na(lake_area_ha)) %>% distinct()
us_lakes <- readr::read_csv('data/usa_lakes.csv') %>%
filter(!is.na(AREAHA)) %>% distinct()
assertthat::assert_that(!any(duplicated(us_lakes$SiteID) &
!duplicated(us_lakes$AREAHA)),
msg = "There are US lakes with duplicated site IDs but without
duplicated lake areas.")
assertthat::assert_that(!any(duplicated(ca_lakes$SiteID) &
!duplicated(ca_lakes$lake_area_ha)),
msg = "There are Canadian lakes with duplicated site IDs but without
duplicated lake areas.")
recorded_lakes <- data.frame(sitename = c(ca_lakes$site.name, us_lakes$site.name),
dsid = c(ca_lakes$DatasetID, rep(NA, nrow(us_lakes))),
stid = c(ca_lakes$SiteID, us_lakes$SiteID),
lat = c(ca_lakes$lat, us_lakes$lat),
long = c(ca_lakes$long, us_lakes$long),
area = c(ca_lakes$lake_area_ha, us_lakes$AREAHA),
source = "Geographic matching",
notes = "From hydro database.",
stringsAsFactors = FALSE)
area_lakes <- recorded_lakes %>%
full_join(existing_lakes) %>%
select(-units) %>%
clean_matches(keep = "Geographic matching", sitedataset = sitedataset) %>%
distinct(stid, .keep_all = TRUE)
```
```{r plotLakes, echo=FALSE}
plot_linked(area_lakes)
```
A total of **`r nrow(ca_lakes)`** were matched directly to lakes within the Canadian hydrographic database and **`r nrow(us_lakes)`** were matched to the US. For both of these datasets combined, `r sum(existing_lakes$stid %in% recorded_lakes$stid)` also had records in Neotoma, resulting in a combined dataset of `r nrow(area_lakes)`.
### Reanalysis
<img src="images/map_schema.svg" style="float:right;" width="200px">
From here we passed all lakes without direct matches to a workflow in which individuals examined regional maps, the original publications and secondary information (including Alwynne Beaudoin's [Lake Files](http://www.scirpus.ca/lakes/lakes.php)) to determine whether a site had a match with a nearby lake or other depositional environment.
For all sites we examined location and then either edited the site coordinates or left the site in place, recording the decision using a numeric key and a descriptive element:
| Value | Interpretation |
| ----- | -------------- |
| 0 | This is a data artifact from ArcGIS editing, which assigned a 0 to records edited using ArcMap |
| 1 | Site located and moved. |
| 2 | Site is in the right place, not moved. |
| 3 | Site not moved, no reasonable match. |
All lakes that have been edited are diplayed here.
```{r, loadData, message=FALSE, warning=FALSE, results='hide'}
source('R/load_entered_data.R')
linked_lakes <- area_lakes %>%
mutate(edited = 2)
edited_lakes <- suppressMessages(load_lakes()) %>%
filter(!is.na(edited)) %>%
dplyr::union_all(linked_lakes) %>%
distinct(stid, notes, area, .keep_all = TRUE) %>%
arrange(stid)
dup_sites <- edited_lakes$stid[duplicated(edited_lakes$stid)] %>% unique
# There are currently some duplicate sites, these don't vary too much, so
# we take either the value associated with the hydrological database, or
# the first of the two entered into the database.
for (i in dup_sites) {
set <- edited_lakes$stid == i
hydro_set <- edited_lakes$source %in% area_lakes$source
if(any(hydro_set[set])) {
drop <- ! ((1:nrow(edited_lakes)) %in% which(set & !hydro_set))
edited_lakes <- edited_lakes[drop, ]
} else{
drop <- ! ((1:nrow(edited_lakes)) %in% which(set)[-1])
edited_lakes <- edited_lakes[drop, ]
}
}
edited_lakes <- edited_lakes %>%
mutate(area = signif(area, 3))
assertthat::assert_that(!any(duplicated(edited_lakes$stid)),
msg = "Some duplicated site IDs are present.")
#assertthat::assert_that(all(edited_lakes$stid %in% sitedataset$stid),
# msg = "There are site IDs that are not present in the
# full North American dataset")
ste_match <- match(sitedataset$stid, edited_lakes$stid)
est_match <- match(edited_lakes$stid, sitedataset$stid)
assertthat::assert_that(all(edited_lakes$sitename == sitedataset$name[est_match],
na.rm = TRUE),
msg = "All assigned sitenames match as expected")
sitedataset[,c('area', 'source', 'edited', 'notes')] <-
edited_lakes[ste_match, c("area", "source", "edited", "notes")]
sitedataset <- sitedataset %>%
left_join(control_output %>%
select(avg_int, max_int, controls, dsid, stid))
readr::write_csv(sitedataset,
paste0('data/version_', version, '/full_output_', version, '.csv'))
readr::write_csv(sitedataset %>% filter(!is.na(area)),
paste0('data/version_', version, '/assigned_areas_', version, '.csv'))
```
```{r, editedSites, echo = FALSE, warning=FALSE}
plot_edited(sitedataset)
```
This results in a dataset of `r nrow(edited_lakes)` sites with areas, of which `r sum(edited_lakes$edited %in% c(1,2))` were matched to reported locations. A total of `r sum(edited_lakes$edited == 1)` lake positions were moved, `r sum(edited_lakes$edited == 3)` could not be properly positioned. There remain `r sum(!sitedataset$stid %in% edited_lakes$stid)` lakes to work through.
```{r, generateMissing, echo = FALSE, warning=FALSE, results='hide'}
new_output <- sitedataset %>%
dplyr::left_join(control_output) %>%
filter(is.na(area) & long < 0) %>%
arrange(controls) %>%
distinct(stid, .keep_all = TRUE) %>%
st_as_sf(coords = c('long', 'lat'), crs = 4326)
new_output[is.na(new_output)] <- ''
new_output <- new_output %>%
select(stid, dsid, name, edited, notes, area, avg_int, max_int, controls, geometry)
leaflet(new_output) %>%
addTiles() %>%
addCircleMarkers(popup = ~name)
sf::write_sf(new_output,
paste0('data/version_',version,'/dataset_',version,'.shp'),
delete_layer = TRUE)
```
## Conclusion
This process results in the progressive alignment of lakes with existing lake records, and acts as a check to update lake records in Neotoma.
You can explore the dataset further by sorting, filtering or clicking on the links to the Neotoma Explorer in the table below:
```{r echo=FALSE}
sitedataset$area[is.na(sitedataset$area)] <- -999
sitedataset <- sitedataset %>%
mutate(area = round(area, 1),
lat = round(lat, 3),
long = round(long, 3),
stid = paste0("<a target=_blank href='https://apps.neotomadb.org/explorer/?siteids=",
stid, "'>",
stid,"</a>"),
dsid = paste0("<a target=_blank href='https://apps.neotomadb.org/explorer/?datasetid=",
dsid, "'>",
dsid,"</a>"))
DT::datatable(sitedataset, escape = FALSE,
rownames = FALSE, filter = "bottom",
options = list(scrollX = '600px'))
```
If you sort by `area` you will see that for ease of use all sites without assigned areas have the value of -999. Using the links for sites and dataset IDs it is possible to open the Neotoma Explorer and view these datasets directly. Please contact us if you have information about these sites. You can email Simon Goring at <a href="emailto:goring@wisc.edu">goring@wisc.edu</a> if you feel that sites ought to be corrected, or you have amended information.
Note that many sites along marine margins (e.g., salt marshes) may have an artificailly high area value, since the effective basin is in fact the ocean.
If you wish to edit some of the missing sites by hand, please check the folder of the latest posted version (e.g., [version 1.4](https://github.com/NeotomaDB/neotoma_lakes/tree/master/data/version_1.4)). The shapefile posted in that folder can be used to edit the sites by using a basemap that includes lakes (e.g., Google Maps) and tracing the polygon of lakes of interest to estimate area (in hectares).
## Version Updates
```{r, eval = FALSE}
# This is not implemented yet.
vers <- list.files('data', pattern='^version_')
summ_tab <- data.frame(filename = rep(NA, length(vers)),
areas = rep(NA, length(vers)))
for (i in 1:length(vers)) {
num <- stringr::str_match(vers[i], '\\d\\.\\d')
filename <- paste0('data/', vers[i], '/area_lakes_', num, '.csv')
table <- readr::read_csv(filename)
summ_tab$filename[i] <- filename
summ_tab$areas[i] <- table %>% filter(!is.na(area)) %>% nrow()
}
```
# References