check_lakes_report.Rmd

---
title: "Checking the progress of site analysis"
author: "Simon Goring"
date: "July 16, 2017"
output:
  html_document:
    output: index.html
    self_contained: true
    code_folding: hide
    toc: true
    toc_float: true
    css: css/sjg_markdown.css
---

```{r runSetup, cache=FALSE, message = FALSE, warning=FALSE, results='hide', echo = FALSE}

version <- '1.6'

source('R/setup.R')

all_ds <- neotoma::bind(neotoma::get_dataset(datasettype = 'pollen',
                                             gpid = 'Canada'),
                        neotoma::get_dataset(datasettype = 'pollen',
                                             gpid = 'United States'),
                        neotoma::get_dataset(datasettype = 'pollen',
                                             gpid = 'Mexico'))

sitedataset <- data.frame(stid = sapply(all_ds, function(x) x$site.data$site.id),
                          dsid = sapply(all_ds, function(x) x$dataset.meta$dataset.id),
                          lat  = sapply(all_ds, function(x) x$site.data$lat),
                          long  = sapply(all_ds, function(x) x$site.data$long),
                          name = sapply(all_ds, function(x) x$site.data$site.name),
                          stringsAsFactors = FALSE)

control_output <- pull_chron_table(version, all_ds)

existing_lakes <- neotoma_lakes("connect_remote.json") %>%
  filter(dsid %in% sapply(all_ds, function(x) x$dataset.meta$dataset.id))

if (nrow(existing_lakes) == 1) {
  warning("You are not connected to the Database.")
}
```


## Status of Lake Records in Neotoma

Neotoma contains records from a large number of lakes and other depositional environments.  Coverage for Canada, the United States and Mexico is good, with `r nrow(sitedataset)` datasets, representing `r length(unique(sitedataset$stid))` unique sites.  Lake area is critical for understanding pollen source area [@dawson2016quantifying].  Within this domain, a number of lakes in the Neotoma Database have recorded lake areas (*n = `r nrow(existing_lakes)`*), a number of sites do not have associated lake areas.  Many of lakes with missing lake area parameters are reported from legacy publications, meaning primary information is not directly available from principal investigators. In addition to missing areas, examination of site coordinates indicates that a large number of records have rounded coordinate values (due to conversion from DMS coordinates) which may not intersect with the waterbodies from which sedimentary or water chemistry records were obtained.

## Load in the records

Bailey Zak and Claire Rubblike have been processing a number of lake sites from Neotoma, to align them with existing records in the Canadian and United States Hydrological Databases.  This provides an important service, it allows us to check reported lake areas & deposition types as well as providing an opportunity to correct geopositioning for sites that were reported with rounding for latitude or longitude coordinates.

### Set up

The first step was time-consuming.  All pollen cores from Canada, Mexico and the United States were pulled from Neotoma, along with the chronological controls for each record.  The chronological controls are used to help prioritize reconstruction, since the current effort is largely focused on the reconstruction of "high quality" sites.

The result here is a `data.frame` called `control_output` that contains each site id, dataset id and a sum of all the chron control types used in reconstruction.  This is for every pollen record (regardless of whether it's a lake or not).  In principle, this could be run at this stage, and the code is provided here, however in practice, we use these numbers earlier in the report and so the code is actually evaluated earlier in the RMarkdown report.

```{r lakedata, cache=FALSE, message = FALSE, warning=FALSE, results='hide', eval=FALSE }

version <- '1.6'

source('R/setup.R')

all_ds <- neotoma::bind(neotoma::get_dataset(datasettype = 'pollen',
                                             gpid = 'Canada'),
                        neotoma::get_dataset(datasettype = 'pollen',
                                             gpid = 'United States'))

sitedataset <- data.frame(stid = sapply(all_ds, function(x) x$site.data$site.id),
                          dsid = sapply(all_ds, function(x) x$dataset.meta$dataset.id),
                          lat  = sapply(all_ds, function(x) x$site.data$lat),
                          long  = sapply(all_ds, function(x) x$site.data$long),
                          name = sapply(all_ds, function(x) x$site.data$site.name),
                          stringsAsFactors = FALSE)

control_output <- pull_chron_table(version, all_ds)

```

This code block provides a list of all pollen datasets in Neotoma (`all_ds`) along with the chronological controls used to develop their chronologies.  It results in a data frame containing `r nrow(control_output)` rows and `r ncol(control_output)` columns.

We can pull all existing lake data from Neotoma and add it to our estimates using a direct connection to the database.  Currently this requires access to the Neotoma Database, however we've written the script in such a way as to allow a missing file if an individual is running the program locally.

```{r getQuery, cache=FALSE, message = FALSE, warning=FALSE, results='hide', eval=FALSE}

existing_lakes <- neotoma_lakes("connect_remote.json") %>%
  filter(dsid %in% sapply(all_ds, function(x) x$dataset.meta$dataset.id))

if (nrow(existing_lakes) == 1) {
  warning("You are not connected to the Database.")
}
```

This provides us with `r nrow(existing_lakes)` datasets with lake information in the Neotoma Paleoecology Database.  This is approximately `r round(sum(sitedataset$stid %in% existing_lakes$stid) / nrow(sitedataset), 1)`% of the pollen datasets we have obtained within North America.

```{r, echo=FALSE, cache=FALSE, message = FALSE, warning=FALSE, fig.caption= "All pollen site locations in North America from lacustrine environments." }

sitedataset %>%
  distinct(stid, .keep_all = TRUE) %>%
  leaflet() %>%
  addTiles() %>%
  addMarkers()
```

### Geographic Co-Location

For lakes without reported areas we can access external data resources.  Lakes are checked against data in the Canadian [CanVec – Hydro Features](http://open.canada.ca/data/en/dataset/8ba2aa2a-7bb9-4448-b4d7-f164409fe056) and the [United States Hydrolography Dataset](https://nhd.usgs.gov/NHD_High_Resolution.html) to obtain "best fit" matches for lakes. The script, which uses the [`sp` package](https://cran.r-project.org/web/packages/sp/index.html) to overlay the points over lake polygons is contained in the files [`R/GetLakeAreas_usa.R`](https://github.com/NeotomaDB/neotoma_lakes/blob/master/R/GetLakeAreas_usa.R) and [`R/GetLakeAreas_canada.R`](https://github.com/NeotomaDB/neotoma_lakes/blob/master/R/GetLakeAreas_usa.R).  State-level shapefles are very large for some states, and as such this is a time and resource consuming process.  We run this step on the Pensylvannia State University's Center for Environmental Informatics server.  The result is a `data.frame` with columns that are indexed using the `stid`.

Given the network intensity of the operation we run the scripts on the CEI servers and then use `scp`, or secure copy, to move the files back to the end-users computer.  This script is not run as part of the current RMarkdown document.  The scripts provide us with the unique identifer of any pollen sites that overlay lakes in the hydrological datasets, and provides matches for any similarly named lakes proximate to a Neotoma lake.

#### Direct Matches

Direct matches between the hydrological database and Neotoma occur when a Neotoma site lies directly over top of a record in the Canadian or United States NHD.  These records are tagged with the unique identifier for the lake. Here records with positive matches are indicated in red and those without are indicated in black:

```{r, loadLakeData, echo=FALSE, message=FALSE, results='hide'}

ca_lakes <- readr::read_csv('data/ca_lakes.csv') %>%
  filter(!is.na(lake_area_ha)) %>% distinct()

us_lakes <- readr::read_csv('data/usa_lakes.csv') %>%
  filter(!is.na(AREAHA)) %>% distinct()

assertthat::assert_that(!any(duplicated(us_lakes$SiteID) &
                               !duplicated(us_lakes$AREAHA)),
                        msg = "There are US lakes with duplicated site IDs but without
                               duplicated lake areas.")

assertthat::assert_that(!any(duplicated(ca_lakes$SiteID) &
                               !duplicated(ca_lakes$lake_area_ha)),
                        msg = "There are Canadian lakes with duplicated site IDs but without
                               duplicated lake areas.")

recorded_lakes <- data.frame(sitename = c(ca_lakes$site.name, us_lakes$site.name),
                             dsid      = c(ca_lakes$DatasetID, rep(NA, nrow(us_lakes))),
                             stid      = c(ca_lakes$SiteID, us_lakes$SiteID),
                             lat       = c(ca_lakes$lat, us_lakes$lat),
                             long      = c(ca_lakes$long, us_lakes$long),
                             area      = c(ca_lakes$lake_area_ha, us_lakes$AREAHA),
                             source    = "Geographic matching",
                             notes     = "From hydro database.",
                             stringsAsFactors = FALSE)

area_lakes <- recorded_lakes %>%
  full_join(existing_lakes) %>%
  select(-units) %>%
  clean_matches(keep = "Geographic matching", sitedataset = sitedataset) %>%
  distinct(stid, .keep_all = TRUE)
```

```{r plotLakes, echo=FALSE}
plot_linked(area_lakes)

```

A total of **`r nrow(ca_lakes)`** were matched directly to lakes within the Canadian hydrographic database and **`r nrow(us_lakes)`** were matched to the US.  For both of these datasets combined, `r sum(existing_lakes$stid %in% recorded_lakes$stid)` also had records in Neotoma, resulting in a combined dataset of `r nrow(area_lakes)`.

### Reanalysis

<img src="images/map_schema.svg" style="float:right;" width="200px">

From here we passed all lakes without direct matches to a workflow in which individuals examined regional maps, the original publications and secondary information (including Alwynne Beaudoin's [Lake Files](http://www.scirpus.ca/lakes/lakes.php)) to determine whether a site had a match with a nearby lake or other depositional environment.

For all sites we examined location and then either edited the site coordinates or left the site in place, recording the decision using a numeric key and a descriptive element:

| Value | Interpretation |
| ----- | -------------- |
| 0     | This is a data artifact from ArcGIS editing, which assigned a 0 to records edited using ArcMap |
| 1     | Site located and moved.  |
| 2     | Site is in the right place, not moved. |
| 3     | Site not moved, no reasonable match. |

All lakes that have been edited are diplayed here.

```{r, loadData, message=FALSE, warning=FALSE, results='hide'}

source('R/load_entered_data.R')

linked_lakes <- area_lakes %>%
  mutate(edited = 2)

edited_lakes <- suppressMessages(load_lakes()) %>%
  filter(!is.na(edited)) %>%
  dplyr::union_all(linked_lakes) %>%
  distinct(stid, notes, area, .keep_all = TRUE) %>%
  arrange(stid)

dup_sites <- edited_lakes$stid[duplicated(edited_lakes$stid)] %>% unique

# There are currently some duplicate sites, these don't vary too much, so
# we take either the value associated with the hydrological database, or
# the first of the two entered into the database.

for (i in dup_sites) {

  set <- edited_lakes$stid == i
  hydro_set <- edited_lakes$source %in% area_lakes$source

  if(any(hydro_set[set])) {
    drop <- ! ((1:nrow(edited_lakes)) %in% which(set & !hydro_set))
    edited_lakes <- edited_lakes[drop, ]
  } else{
    drop <- ! ((1:nrow(edited_lakes)) %in% which(set)[-1])

    edited_lakes <- edited_lakes[drop, ]
  }

}

edited_lakes <- edited_lakes  %>%
  mutate(area = signif(area, 3))

assertthat::assert_that(!any(duplicated(edited_lakes$stid)),
                        msg = "Some duplicated site IDs are present.")

#assertthat::assert_that(all(edited_lakes$stid %in% sitedataset$stid),
#                        msg = "There are site IDs that are not present in the
#                               full North American dataset")

ste_match <- match(sitedataset$stid, edited_lakes$stid)
est_match <- match(edited_lakes$stid, sitedataset$stid)

assertthat::assert_that(all(edited_lakes$sitename == sitedataset$name[est_match],
                            na.rm = TRUE),
                        msg = "All assigned sitenames match as expected")

sitedataset[,c('area', 'source', 'edited', 'notes')] <-
  edited_lakes[ste_match, c("area", "source", "edited", "notes")]

sitedataset <- sitedataset %>%
  left_join(control_output %>%
    select(avg_int, max_int, controls, dsid, stid))

readr::write_csv(sitedataset,
                 paste0('data/version_', version, '/full_output_', version, '.csv'))

readr::write_csv(sitedataset %>% filter(!is.na(area)),
                 paste0('data/version_', version, '/assigned_areas_', version, '.csv'))

```

```{r, editedSites, echo = FALSE, warning=FALSE}

plot_edited(sitedataset)

```

This results in a dataset of `r nrow(edited_lakes)` sites with areas, of which `r sum(edited_lakes$edited %in% c(1,2))` were matched to reported locations.  A total of `r sum(edited_lakes$edited == 1)` lake positions were moved, `r sum(edited_lakes$edited == 3)` could not be properly positioned.  There remain `r sum(!sitedataset$stid %in% edited_lakes$stid)` lakes to work through.

```{r, generateMissing, echo = FALSE, warning=FALSE, results='hide'}

new_output <- sitedataset %>%
  dplyr::left_join(control_output) %>%
  filter(is.na(area) & long < 0) %>%
  arrange(controls) %>%
  distinct(stid, .keep_all = TRUE) %>%
  st_as_sf(coords = c('long', 'lat'), crs = 4326)

new_output[is.na(new_output)] <- ''

new_output <- new_output %>%
  select(stid, dsid, name, edited, notes, area, avg_int, max_int, controls, geometry)

leaflet(new_output) %>%
  addTiles() %>%
  addCircleMarkers(popup = ~name)

sf::write_sf(new_output,
             paste0('data/version_',version,'/dataset_',version,'.shp'),
             delete_layer = TRUE)

```

## Conclusion

This process results in the progressive alignment of lakes with existing lake records, and acts as a check to update lake records in Neotoma.

You can explore the dataset further by sorting, filtering or clicking on the links to the Neotoma Explorer in the table below:

```{r echo=FALSE}
sitedataset$area[is.na(sitedataset$area)] <- -999

sitedataset <- sitedataset %>%
  mutate(area = round(area, 1),
         lat  = round(lat, 3),
         long = round(long, 3),
         stid = paste0("<a target=_blank href='https://apps.neotomadb.org/explorer/?siteids=",
                           stid, "'>",
                           stid,"</a>"),
         dsid = paste0("<a target=_blank href='https://apps.neotomadb.org/explorer/?datasetid=",
                           dsid, "'>",
                           dsid,"</a>"))

DT::datatable(sitedataset, escape = FALSE,
              rownames = FALSE, filter = "bottom",
              options = list(scrollX = '600px'))
```

If you sort by `area` you will see that for ease of use all sites without assigned areas have the value of -999.  Using the links for sites and dataset IDs it is possible to open the Neotoma Explorer and view these datasets directly.  Please contact us if you have information about these sites.  You can email Simon Goring at <a href="emailto:goring@wisc.edu">goring@wisc.edu</a> if you feel that sites ought to be corrected, or you have amended information.

Note that many sites along marine margins (e.g., salt marshes) may have an artificailly high area value, since the effective basin is in fact the ocean.

If you wish to edit some of the missing sites by hand, please check the folder of the latest posted version (e.g., [version 1.4](https://github.com/NeotomaDB/neotoma_lakes/tree/master/data/version_1.4)).  The shapefile posted in that folder can be used to edit the sites by using a basemap that includes lakes (e.g., Google Maps) and tracing the polygon of lakes of interest to estimate area (in hectares).

## Version Updates

```{r, eval = FALSE}

# This is not implemented yet.

vers <- list.files('data', pattern='^version_')

summ_tab <- data.frame(filename = rep(NA, length(vers)),
                       areas    = rep(NA, length(vers)))

for (i in 1:length(vers)) {

  num <- stringr::str_match(vers[i], '\\d\\.\\d')
  filename <- paste0('data/', vers[i], '/area_lakes_', num, '.csv')
  table <- readr::read_csv(filename)
  summ_tab$filename[i] <- filename
  summ_tab$areas[i] <- table  %>% filter(!is.na(area)) %>% nrow()
}

```

# References