Implement URL composition feature for Beacon files

digicademy · Oct 8, 2023 · 77aeaf0 · 77aeaf0
1 parent a1b2059
commit 77aeaf0
Show file tree

Hide file tree

Showing 5 changed files with 48 additions and 28 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -13,3 +13,4 @@
 - Rename `-source_url_type` to `-content_type`
 - Add option to harvest from file dump
 - Bring back option to compile CSV table from scraped data
+- Implement URL composition feature for Beacon files
diff --git a/README.md b/README.md
@@ -2,10 +2,10 @@
 
 # Hydra Scraper
 
-**Comprehensive scraper for APIs with Hydra pagination as well as for file dumps**
+**Comprehensive scraper for Hydra-paginated APIs, Beacon files, and RDF file dumps**
 
 This scraper provides a command-line toolset to pull data from various sources,
-such as Hydra paginated APIs, beacon files, or local file dumps. The tool
+such as Hydra paginated APIs, Beacon files, or local file dumps. The tool
 differentiates between resource lists and individual resource files in
 RDF-compatible formats such as JSON-LD or Turtle, but it can also handle, for
 example, LIDO files. Command-line calls can be combined and adapted to build
@@ -42,12 +42,12 @@ run the script without interaction.
 - `-download '<string list>'`: comma-separated list of what you need, possible values:
   - `lists`: all Hydra-paginated lists (requires `-source_url`)
   - `list_triples`: all RDF triples in a Hydra API (requires`-source_url`)
-  - `beacon`: beacon file of all resources listed in an API (requires `-source_url`)
-  - `resources`: all resources of an API or beacon (requires `-source_url`/`_file`)
+  - `beacon`: Beacon file of all resources listed in an API (requires `-source_url`)
+  - `resources`: all resources of an API or Beacon (requires `-source_url`/`_file`)
   - `resource_triples`: all RDF triples of resources (requires `-source_url`/`_file`/`_folder`)
   - `resource_table`: CSV table of data in resources (requires `-source_url`/`_file`/`_folder`)
 - `-source_url '<url>'`: use this entry-point URL to scrape content (default: none)
-- `-source_file '<path to file>'`: use the URLs in this beacon file to scrape content (default: none)
+- `-source_file '<path to file>'`: use the URLs in this Beacon file to scrape content (default: none)
 - `-source_folder '<name of folder>'`: use this folder (default: none, requires `-content_type`)
 - `-content_type '<string>'`: request/use this content type when scraping content (default: none)
 - `-target_folder '<name of folder>'`: download to this subfolder of `downloads` (default: timestamp)
@@ -79,7 +79,7 @@ Get **CGIF data** from an API entry point:
 python go.py -download 'list_triples' -source_url 'https://corpusvitrearum.de/cvma-digital/bildarchiv.html' -target_folder 'sample-cgif'
 ```
 
-Get **CGIF data from a beacon** file:
+Get **CGIF data from a Beacon** file:
 
 ```
 python go.py -download 'resource_triples' -source_file 'downloads/sample-cgif/beacon.txt' -target_folder 'sample-cgif'
@@ -153,7 +153,7 @@ This package has three main areas:
 
 1. The file `go.py` provides the main logic of a scraping run.
 2. It relies on several `helpers` that provide basic functions such as cleaning up request arguments, saving files, printing status updates, or providing configuration options throughout the package.
-3. The two classes `Hydra` and `Beacon` do the heavy lifting of paging through an API entry point or a (beacon) list of individual resources, respectively. In addition to a standard initialisation, both classes have a `populate()` function that retrieves and saves data. Additional functions may then carry out further tasks such as saving a beacon list or saving collected triples.
+3. The two classes `Hydra` and `Beacon` do the heavy lifting of paging through an API entry point or a (Beacon) list of individual resources, respectively. In addition to a standard initialisation, both classes have a `populate()` function that retrieves and saves data. Additional functions may then carry out further tasks such as saving a Beacon list or saving collected triples.
 
 If you change the code, please remember to document each function and walk other users through significant steps. This package is governed by the [Contributor Covenant](https://www.contributor-covenant.org/de/version/1/4/code-of-conduct/) code of conduct. Please keep this in mind in all interactions.
 
@@ -168,11 +168,10 @@ Use GitHub to make the release. Use semantic versioning once the scraper has rea
 
 ## Roadmap
 
-- Add URL composition feature of the Beacon standard
 - Enable checking `schema:dateModified` when collating paged results
 - Implement a JSON return (including dateModified, number of resources, errors)
 - Add conversion routines, i.e. for LIDO to CGIF or for the RADAR version of DataCite/DataVerse to CGIF
 - Allow filtering triples for CGIF, align triples produced by lists and by resources, add any quality assurance that is needed
-- Allow usage of OAI-PMH APIs to produce beacon lists
+- Allow usage of OAI-PMH APIs to produce Beacon lists
 - Re-add the interactive mode
 - Properly package the script and use the system's download folder
diff --git a/classes/beacon.py b/classes/beacon.py
@@ -154,26 +154,26 @@ def populate(self, save_original_files:bool = True, clean_resource_urls:list = [
                         self.non_rdf_resources_list.append(resource_url)
                         continue
 
-                # Report any failed state
-                if self.missing_resources >= self.number_of_resources:
-                    status_report['success'] = False
-                    status_report['reason'] = 'All resources were missing.'
-                elif self.missing_resources > 0 and self.non_rdf_resources > 0:
-                    status_report['reason'] = 'Resources retrieved, but ' + str(self.missing_resources) + ' were missing and ' + str(self.non_rdf_resources) + ' were not RDF-compatible.'
-                    status_report['missing'] = self.missing_resources_list
-                    status_report['non_rdf'] = self.non_rdf_resources_list
-                elif self.missing_resources > 0:
-                    status_report['reason'] = 'Resources retrieved, but ' + str(self.missing_resources) + ' were missing.'
-                    status_report['missing'] = self.missing_resources_list
-                elif self.non_rdf_resources > 0:
-                    status_report['reason'] = 'Resources retrieved, but ' + str(self.non_rdf_resources) + ' were not RDF-compatible.'
-                    status_report['non_rdf'] = self.non_rdf_resources_list
-
                 # Delay next retrieval to avoid a server block
                 echo_progress('Retrieving individual resources', number, self.number_of_resources)
                 if self.resources_from_folder == False:
                     sleep(config['download_delay'])
 
+            # Report any failed state
+            if self.missing_resources >= self.number_of_resources:
+                status_report['success'] = False
+                status_report['reason'] = 'All resources were missing.'
+            elif self.missing_resources > 0 and self.non_rdf_resources > 0:
+                status_report['reason'] = 'Resources retrieved, but ' + str(self.missing_resources) + ' were missing and ' + str(self.non_rdf_resources) + ' were not RDF-compatible.'
+                status_report['missing'] = self.missing_resources_list
+                status_report['non_rdf'] = self.non_rdf_resources_list
+            elif self.missing_resources > 0:
+                status_report['reason'] = 'Resources retrieved, but ' + str(self.missing_resources) + ' were missing.'
+                status_report['missing'] = self.missing_resources_list
+            elif self.non_rdf_resources > 0:
+                status_report['reason'] = 'Resources retrieved, but ' + str(self.non_rdf_resources) + ' were not RDF-compatible.'
+                status_report['non_rdf'] = self.non_rdf_resources_list
+
         # Notify object that it is populated
         self.populated = True
 

diff --git a/helpers/fileio.py b/helpers/fileio.py
@@ -9,6 +9,7 @@
 # Import libraries
 from os import makedirs
 from glob import glob
+from re import search
 
 # Import script modules
 from helpers.clean import clean_lines
@@ -79,13 +80,32 @@ def read_list(file_path:str) -> list:
         f = open(file_path, 'r')
         content = f.read()
 
+        # Optionally identify an ID pattern
+        pattern = search(r"(?<=#TARGET: ).*(?<!\n)", content)
+        if pattern != None:
+            pattern = pattern.group()
+            if pattern.find('{ID}') == -1:
+                pattern = None
+
         # Clean empty lines and comments
         content = clean_lines(content)
         lines = iter(content.splitlines())
 
-        # Add each line to list
+        # Go through each line
         entries = []
         for line in lines:
+
+            # Remove additional Beacon features
+            line_option1 = line.find(' |')
+            line_option2 = line.find('|')
+            if line_option1 != -1:
+                line = line[:line_option1]
+            elif line_option2 != -1:
+                line = line[:line_option2]
+
+            # Add complete line to list
+            if pattern != None:
+                line = pattern.replace('{ID}', line)
             entries.append(line)
 
         # Return list

diff --git a/helpers/status.py b/helpers/status.py
@@ -64,17 +64,17 @@ def echo_help():
 
     list_triples: all RDF triples in a Hydra API (requires -source_url)
 
-    beacon: beacon file of all resources listed in an API (requires -source_url)
+    beacon: Beacon file of all resources listed in an API (requires -source_url)
 
-    resources: all resources of an API or beacon (requires -source_url/_file)
+    resources: all resources of an API or Beacon (requires -source_url/_file)
 
     resource_triples: all RDF triples of resources (requires -source_url/_file/_folder)
 
     resource_table: CSV table of data in resources (requires -source_url/_file/_folder)
 
 -source_url '<url>': use this entry-point URL to scrape content (default: none)
 
--source_file '<path to file>': use the URLs in this beacon file to scrape content (default: none)
+-source_file '<path to file>': use the URLs in this Beacon file to scrape content (default: none)
 
 -source_folder '<name of folder>': use this folder (default: none, requires -content_type)