Skip to content

Commit

Permalink
Implement URL composition feature for Beacon files
Browse files Browse the repository at this point in the history
  • Loading branch information
jonatansteller committed Oct 8, 2023
1 parent a1b2059 commit 77aeaf0
Show file tree
Hide file tree
Showing 5 changed files with 48 additions and 28 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@
- Rename `-source_url_type` to `-content_type`
- Add option to harvest from file dump
- Bring back option to compile CSV table from scraped data
- Implement URL composition feature for Beacon files
17 changes: 8 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

# Hydra Scraper

**Comprehensive scraper for APIs with Hydra pagination as well as for file dumps**
**Comprehensive scraper for Hydra-paginated APIs, Beacon files, and RDF file dumps**

This scraper provides a command-line toolset to pull data from various sources,
such as Hydra paginated APIs, beacon files, or local file dumps. The tool
such as Hydra paginated APIs, Beacon files, or local file dumps. The tool
differentiates between resource lists and individual resource files in
RDF-compatible formats such as JSON-LD or Turtle, but it can also handle, for
example, LIDO files. Command-line calls can be combined and adapted to build
Expand Down Expand Up @@ -42,12 +42,12 @@ run the script without interaction.
- `-download '<string list>'`: comma-separated list of what you need, possible values:
- `lists`: all Hydra-paginated lists (requires `-source_url`)
- `list_triples`: all RDF triples in a Hydra API (requires`-source_url`)
- `beacon`: beacon file of all resources listed in an API (requires `-source_url`)
- `resources`: all resources of an API or beacon (requires `-source_url`/`_file`)
- `beacon`: Beacon file of all resources listed in an API (requires `-source_url`)
- `resources`: all resources of an API or Beacon (requires `-source_url`/`_file`)
- `resource_triples`: all RDF triples of resources (requires `-source_url`/`_file`/`_folder`)
- `resource_table`: CSV table of data in resources (requires `-source_url`/`_file`/`_folder`)
- `-source_url '<url>'`: use this entry-point URL to scrape content (default: none)
- `-source_file '<path to file>'`: use the URLs in this beacon file to scrape content (default: none)
- `-source_file '<path to file>'`: use the URLs in this Beacon file to scrape content (default: none)
- `-source_folder '<name of folder>'`: use this folder (default: none, requires `-content_type`)
- `-content_type '<string>'`: request/use this content type when scraping content (default: none)
- `-target_folder '<name of folder>'`: download to this subfolder of `downloads` (default: timestamp)
Expand Down Expand Up @@ -79,7 +79,7 @@ Get **CGIF data** from an API entry point:
python go.py -download 'list_triples' -source_url 'https://corpusvitrearum.de/cvma-digital/bildarchiv.html' -target_folder 'sample-cgif'
```

Get **CGIF data from a beacon** file:
Get **CGIF data from a Beacon** file:

```
python go.py -download 'resource_triples' -source_file 'downloads/sample-cgif/beacon.txt' -target_folder 'sample-cgif'
Expand Down Expand Up @@ -153,7 +153,7 @@ This package has three main areas:

1. The file `go.py` provides the main logic of a scraping run.
2. It relies on several `helpers` that provide basic functions such as cleaning up request arguments, saving files, printing status updates, or providing configuration options throughout the package.
3. The two classes `Hydra` and `Beacon` do the heavy lifting of paging through an API entry point or a (beacon) list of individual resources, respectively. In addition to a standard initialisation, both classes have a `populate()` function that retrieves and saves data. Additional functions may then carry out further tasks such as saving a beacon list or saving collected triples.
3. The two classes `Hydra` and `Beacon` do the heavy lifting of paging through an API entry point or a (Beacon) list of individual resources, respectively. In addition to a standard initialisation, both classes have a `populate()` function that retrieves and saves data. Additional functions may then carry out further tasks such as saving a Beacon list or saving collected triples.

If you change the code, please remember to document each function and walk other users through significant steps. This package is governed by the [Contributor Covenant](https://www.contributor-covenant.org/de/version/1/4/code-of-conduct/) code of conduct. Please keep this in mind in all interactions.

Expand All @@ -168,11 +168,10 @@ Use GitHub to make the release. Use semantic versioning once the scraper has rea

## Roadmap

- Add URL composition feature of the Beacon standard
- Enable checking `schema:dateModified` when collating paged results
- Implement a JSON return (including dateModified, number of resources, errors)
- Add conversion routines, i.e. for LIDO to CGIF or for the RADAR version of DataCite/DataVerse to CGIF
- Allow filtering triples for CGIF, align triples produced by lists and by resources, add any quality assurance that is needed
- Allow usage of OAI-PMH APIs to produce beacon lists
- Allow usage of OAI-PMH APIs to produce Beacon lists
- Re-add the interactive mode
- Properly package the script and use the system's download folder
30 changes: 15 additions & 15 deletions classes/beacon.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,26 +154,26 @@ def populate(self, save_original_files:bool = True, clean_resource_urls:list = [
self.non_rdf_resources_list.append(resource_url)
continue

# Report any failed state
if self.missing_resources >= self.number_of_resources:
status_report['success'] = False
status_report['reason'] = 'All resources were missing.'
elif self.missing_resources > 0 and self.non_rdf_resources > 0:
status_report['reason'] = 'Resources retrieved, but ' + str(self.missing_resources) + ' were missing and ' + str(self.non_rdf_resources) + ' were not RDF-compatible.'
status_report['missing'] = self.missing_resources_list
status_report['non_rdf'] = self.non_rdf_resources_list
elif self.missing_resources > 0:
status_report['reason'] = 'Resources retrieved, but ' + str(self.missing_resources) + ' were missing.'
status_report['missing'] = self.missing_resources_list
elif self.non_rdf_resources > 0:
status_report['reason'] = 'Resources retrieved, but ' + str(self.non_rdf_resources) + ' were not RDF-compatible.'
status_report['non_rdf'] = self.non_rdf_resources_list

# Delay next retrieval to avoid a server block
echo_progress('Retrieving individual resources', number, self.number_of_resources)
if self.resources_from_folder == False:
sleep(config['download_delay'])

# Report any failed state
if self.missing_resources >= self.number_of_resources:
status_report['success'] = False
status_report['reason'] = 'All resources were missing.'
elif self.missing_resources > 0 and self.non_rdf_resources > 0:
status_report['reason'] = 'Resources retrieved, but ' + str(self.missing_resources) + ' were missing and ' + str(self.non_rdf_resources) + ' were not RDF-compatible.'
status_report['missing'] = self.missing_resources_list
status_report['non_rdf'] = self.non_rdf_resources_list
elif self.missing_resources > 0:
status_report['reason'] = 'Resources retrieved, but ' + str(self.missing_resources) + ' were missing.'
status_report['missing'] = self.missing_resources_list
elif self.non_rdf_resources > 0:
status_report['reason'] = 'Resources retrieved, but ' + str(self.non_rdf_resources) + ' were not RDF-compatible.'
status_report['non_rdf'] = self.non_rdf_resources_list

# Notify object that it is populated
self.populated = True

Expand Down
22 changes: 21 additions & 1 deletion helpers/fileio.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
# Import libraries
from os import makedirs
from glob import glob
from re import search

# Import script modules
from helpers.clean import clean_lines
Expand Down Expand Up @@ -79,13 +80,32 @@ def read_list(file_path:str) -> list:
f = open(file_path, 'r')
content = f.read()

# Optionally identify an ID pattern
pattern = search(r"(?<=#TARGET: ).*(?<!\n)", content)
if pattern != None:
pattern = pattern.group()
if pattern.find('{ID}') == -1:
pattern = None

# Clean empty lines and comments
content = clean_lines(content)
lines = iter(content.splitlines())

# Add each line to list
# Go through each line
entries = []
for line in lines:

# Remove additional Beacon features
line_option1 = line.find(' |')
line_option2 = line.find('|')
if line_option1 != -1:
line = line[:line_option1]
elif line_option2 != -1:
line = line[:line_option2]

# Add complete line to list
if pattern != None:
line = pattern.replace('{ID}', line)
entries.append(line)

# Return list
Expand Down
6 changes: 3 additions & 3 deletions helpers/status.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,17 +64,17 @@ def echo_help():
list_triples: all RDF triples in a Hydra API (requires -source_url)
beacon: beacon file of all resources listed in an API (requires -source_url)
beacon: Beacon file of all resources listed in an API (requires -source_url)
resources: all resources of an API or beacon (requires -source_url/_file)
resources: all resources of an API or Beacon (requires -source_url/_file)
resource_triples: all RDF triples of resources (requires -source_url/_file/_folder)
resource_table: CSV table of data in resources (requires -source_url/_file/_folder)
-source_url '<url>': use this entry-point URL to scrape content (default: none)
-source_file '<path to file>': use the URLs in this beacon file to scrape content (default: none)
-source_file '<path to file>': use the URLs in this Beacon file to scrape content (default: none)
-source_folder '<name of folder>': use this folder (default: none, requires -content_type)
Expand Down

0 comments on commit 77aeaf0

Please sign in to comment.