Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cff): doi structure parsing #121

Merged
merged 81 commits into from
Dec 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
c102e28
feat:write goal of cff parser
rmfranken Nov 14, 2024
5ae7f8c
feat: add function body and placeholder
rmfranken Nov 14, 2024
1206d30
refactor: do list(dict) formatting within the function
rmfranken Nov 14, 2024
751918a
fix: forgot arrow in type annotation
rmfranken Nov 14, 2024
9edcd6f
chore: black reformat
rmfranken Nov 14, 2024
68f48a2
feat: update parsers to output graphs instead of property "doubles", …
rmfranken Nov 15, 2024
84db808
chore: update docstrings
rmfranken Nov 15, 2024
b43f448
fix: remove python <= 3.12 req
rmfranken Nov 15, 2024
5fc788a
feat: add "affiliation" of authors to output
rmfranken Nov 18, 2024
2dd85dd
docs: add CFF file (#111)
rmfranken Nov 14, 2024
f17dd9f
feat: add test for cff author parsing
rmfranken Nov 18, 2024
93d1286
fix: python dependencies in poetry
rmfranken Nov 18, 2024
15dab2d
fix: adapt parser tests to graph structure instead of property struct…
rmfranken Nov 18, 2024
fa53024
fix: remove faulty cff function example description
rmfranken Nov 18, 2024
a536511
Update test_cff.py
rmfranken Nov 18, 2024
33e407d
Merge branch 'main' into cff_parsing
rmfranken Nov 18, 2024
d9fe975
Merge branch 'main' into cff_parsing
rmfranken Nov 19, 2024
a91b59c
fix: unused import
rmfranken Nov 20, 2024
b088a45
fix: typo
rmfranken Nov 20, 2024
d3eb1f4
refactor: rename variable
rmfranken Nov 20, 2024
b81b5b6
docs: add docstring parameter for parser class
rmfranken Nov 20, 2024
5238b64
refactor: rename variable
rmfranken Nov 20, 2024
5645f6d
feat: check if orcid is valid before writing
rmfranken Nov 20, 2024
6a25951
refactor: rename variable
rmfranken Nov 20, 2024
f7a1165
chore: remove pyshacl
rmfranken Nov 20, 2024
5aba09d
fix: typo
rmfranken Nov 20, 2024
aef27dd
fix: remove unused imports
rmfranken Nov 20, 2024
ee9238e
fix: tests for cff, add test for doi, move doi and orcid matchers to …
rmfranken Nov 22, 2024
195c778
docs:fix docs of valid_doi_extractor
rmfranken Nov 22, 2024
1ba0ee7
Merge branch 'main' into cff_parsing
rmfranken Nov 22, 2024
03adbf0
refactor: doi re matcher
rmfranken Nov 22, 2024
9673aac
chore: remove unneccessary comment
rmfranken Nov 22, 2024
420252e
chore(docker): bump base layer to python 3.13
cmdoret Nov 28, 2024
2a6272a
Update gimie/parsers/abstract.py
rmfranken Nov 28, 2024
a331a63
Update gimie/parsers/cff.py
rmfranken Nov 28, 2024
9d69267
chore(docker): use python 3.12 base
cmdoret Nov 28, 2024
b91df2a
fix: improve tests, rename some variables
rmfranken Nov 28, 2024
8f61fe2
Merge branch 'cff_parsing' of github.com:sdsc-ordes/gimie into cff_pa…
rmfranken Nov 28, 2024
477ab88
fix:rename the example in extract_doi_march
rmfranken Nov 28, 2024
0eb1423
fix: DOI from dict, not flat value
rmfranken Dec 16, 2024
43fdf51
fix:make cff example correct
rmfranken Dec 16, 2024
343defd
docs: adapt docstring example to match real CFF structure
rmfranken Dec 16, 2024
513a546
fix:docstring still fucked
rmfranken Dec 16, 2024
d650e6a
fix:typo docstring
rmfranken Dec 16, 2024
077ad37
fix:chatGPT's suggestion for docstring formatting
rmfranken Dec 16, 2024
5a16e3c
fix:OK, no multiline, and double escape newlines
rmfranken Dec 16, 2024
6e9d243
fix: spelling mistake in run as library docs (#113)
rmfranken Nov 19, 2024
1188f3b
fix: unused import
rmfranken Nov 20, 2024
409232a
fix: typo
rmfranken Nov 20, 2024
4dcfc24
refactor: rename variable
rmfranken Nov 20, 2024
faf0628
docs: add docstring parameter for parser class
rmfranken Nov 20, 2024
246167c
refactor: rename variable
rmfranken Nov 20, 2024
2d063b7
feat: check if orcid is valid before writing
rmfranken Nov 20, 2024
200348a
refactor: rename variable
rmfranken Nov 20, 2024
3d27929
chore: remove pyshacl
rmfranken Nov 20, 2024
d3bee47
fix: typo
rmfranken Nov 20, 2024
c39b22d
fix: remove unused imports
rmfranken Nov 20, 2024
3828419
fix: tests for cff, add test for doi, move doi and orcid matchers to …
rmfranken Nov 22, 2024
141f67a
docs:fix docs of valid_doi_extractor
rmfranken Nov 22, 2024
074ee88
Improve authentication error messages (#116)
raj921 Nov 19, 2024
fa7b3b4
ci: make conventional PR title check optional (#117)
cmdoret Nov 19, 2024
f74cab3
refactor: doi re matcher
rmfranken Nov 22, 2024
08a938f
chore: remove unneccessary comment
rmfranken Nov 22, 2024
1a50e72
fix: improve tests, rename some variables
rmfranken Nov 28, 2024
b3d988e
chore(docker): bump base layer to python 3.13
cmdoret Nov 28, 2024
cd583f8
Update gimie/parsers/abstract.py
rmfranken Nov 28, 2024
b2d399b
Update gimie/parsers/cff.py
rmfranken Nov 28, 2024
9499f9e
chore(docker): use python 3.12 base
cmdoret Nov 28, 2024
f0863d1
fix:rename the example in extract_doi_march
rmfranken Nov 28, 2024
4f5ef97
fix: DOI from dict, not flat value
rmfranken Dec 16, 2024
3dbda53
fix:make cff example correct
rmfranken Dec 16, 2024
9ac34d6
docs: adapt docstring example to match real CFF structure
rmfranken Dec 16, 2024
0cc3efe
fix:docstring still fucked
rmfranken Dec 16, 2024
d9a05df
fix:typo docstring
rmfranken Dec 16, 2024
ce86cbb
fix:chatGPT's suggestion for docstring formatting
rmfranken Dec 16, 2024
a4445e4
fix:OK, no multiline, and double escape newlines
rmfranken Dec 16, 2024
aa1dde6
Merge branch 'hotfix-CFF-doi' of github.com:sdsc-ordes/gimie into hot…
rmfranken Dec 16, 2024
70bcac9
Merge branch 'main' into hotfix-CFF-doi
rmfranken Dec 17, 2024
a884cbc
feat: support multiple DOI's
rmfranken Dec 17, 2024
df9295e
refactor(cff): reduce nesting
cmdoret Dec 17, 2024
76060ca
chore(cff): use type hint for list from python standard collection
rmfranken Dec 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 36 additions & 27 deletions gimie/parsers/cff.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,21 +34,22 @@ def __init__(self, subject: str):
super().__init__(subject)

def parse(self, data: bytes) -> Graph:
"""Extracts a DOI link and list of authors from a CFF file and returns a
graph with a single triple <subject> <schema:citation> <doi>
"""Extracts DOIs and list of authors from a CFF file and returns a
graph with triples <subject> <schema:citation> <doi>
and a number of author objects with <schema:name> and <md4i:orcid> values.
If no DOI is found, it will not be included in the graph.
If no authors are found, it will not be included in the graph.
If neither authors nor DOI are found, an empty graph is returned.
If no DOIs are found, they will not be included in the graph.
If no authors are found, they will not be included in the graph.
If neither authors nor DOIs are found, an empty graph is returned.
"""
extracted_cff_triples = Graph()
doi = get_cff_doi(data)
dois = get_cff_doi(data)
authors = get_cff_authors(data)

if doi:
extracted_cff_triples.add(
(self.subject, SDO.citation, URIRef(doi))
)
if dois:
for doi in dois:
extracted_cff_triples.add(
(self.subject, SDO.citation, URIRef(doi))
)
if not authors:
return extracted_cff_triples
for author in authors:
Expand Down Expand Up @@ -119,8 +120,8 @@ def doi_to_url(doi: str) -> str:
return f"https://doi.org/{doi_match}"


def get_cff_doi(data: bytes) -> Optional[str]:
"""Given a CFF file, returns the DOI, if any.
def get_cff_doi(data: bytes) -> Optional[list[str]]:
"""Given a CFF file, returns a list of DOIs, if any.

Parameters
----------
Expand All @@ -129,15 +130,16 @@ def get_cff_doi(data: bytes) -> Optional[str]:

Returns
-------
str, optional
doi formatted as a valid url
list of str, optional
DOIs formatted as valid URLs

Examples
--------
>>> get_cff_doi(bytes("doi: 10.5281/zenodo.1234", encoding="utf8"))
'https://doi.org/10.5281/zenodo.1234'
>>> get_cff_doi(bytes("identifiers:\\n - type: doi\\n value: 10.5281/zenodo.1234\\n - type: doi\\n value: 10.5281/zenodo.5678", encoding="utf8"))
['https://doi.org/10.5281/zenodo.1234', 'https://doi.org/10.5281/zenodo.5678']
>>> get_cff_doi(bytes("identifiers:\\n - type: doi\\n value: 10.5281/zenodo.9012", encoding="utf8"))
['https://doi.org/10.5281/zenodo.9012']
>>> get_cff_doi(bytes("abc: def", encoding="utf8"))

"""

try:
Expand All @@ -146,18 +148,25 @@ def get_cff_doi(data: bytes) -> Optional[str]:
logger.warning("cannot read CITATION.cff, skipped.")
return None

doi_urls = []

try:
doi_url = doi_to_url(cff["doi"])
# No doi in cff file
identifiers = cff["identifiers"]
except (KeyError, TypeError):
logger.warning("CITATION.cff does not contain a 'doi' key.")
doi_url = None
# doi is malformed
except ValueError as err:
logger.warning(err)
doi_url = None

return doi_url
logger.warning(
"CITATION.cff does not contain a valid 'identifiers' key."
)
return None

for identifier in identifiers:
if identifier.get("type") == "doi":
try:
doi_url = doi_to_url(identifier["value"])
doi_urls.append(doi_url)
except ValueError as err:
logger.warning(err)

return doi_urls or None


def get_cff_authors(data: bytes) -> Optional[List[dict[str, str]]]:
Expand Down
2 changes: 1 addition & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

19 changes: 15 additions & 4 deletions tests/test_cff.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,12 +73,23 @@ def test_broken_cff(cff_file):
def test_parse_doi():
cff_file = b"""
cff-version: 1.2.0
title: gimie
doi: 10.5281/zenodo.1234567
message: If you use this software, please cite it using these metadata.
title: 'napari: a multi-dimensional image viewer for Python'
identifiers:
- type: doi
value: 10.5281/zenodo.3555620
rmfranken marked this conversation as resolved.
Show resolved Hide resolved
- type: doi
value: 10.21105/joss.01274
"""
obj = next(
parsed_dois = list(
CffParser(subject=URIRef("https://example.org/"))
.parse(data=cff_file)
.objects()
)
assert URIRef("https://doi.org/10.5281/zenodo.1234567") == obj
expected_dois = [
URIRef("https://doi.org/10.5281/zenodo.3555620"),
URIRef("https://doi.org/10.21105/joss.01274"),
]
# parsed_dois already contains all parsed DOI objects
for doi in expected_dois:
assert doi in parsed_dois
Loading