Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging KG2.10.1 build code #411

Merged
merged 126 commits into from
Sep 8, 2024
Merged
Changes from 1 commit
Commits
Show all changes
126 commits
Select commit Hold shift + click to select a range
42d1a43
#390 first attempt at reducing dependencies
ecwood Jul 11, 2024
20eeff6
#390 hopefully these can go as well
ecwood Jul 11, 2024
281bf43
#398 first pass at clinical trials kg
ecwood Jul 16, 2024
4f338f5
#398 forgot the url map entry
ecwood Jul 16, 2024
cfd305e
#399 to avoid confusion
ecwood Jul 16, 2024
42d9872
#398 updating the Snakemake pipeline
ecwood Jul 16, 2024
134f5ec
#398 correcting a typo
ecwood Jul 16, 2024
a61d098
#398 correcting another typo
ecwood Jul 16, 2024
93168e7
#398 correcting typo in extract
ecwood Jul 16, 2024
dd73e65
#398 correcting typo in conversion
ecwood Jul 16, 2024
0c16ea3
#398 correcting to standardize in kg2_util
ecwood Jul 16, 2024
4a13131
#398 versioning attempt
ecwood Jul 16, 2024
974ed5c
#398 clean up the versioning
ecwood Jul 16, 2024
9d65fd9
#398 add an entry into KL/AT map
ecwood Jul 16, 2024
bdf1ad1
#398 reworking update date
ecwood Jul 16, 2024
f0f73fd
#398 revising some access patterns
ecwood Jul 16, 2024
73c085f
#398 changing data type for datetime
ecwood Jul 16, 2024
6bdee23
#398 have to save as a string afterwards
ecwood Jul 16, 2024
00d53f4
#398 handling an edge case
ecwood Jul 16, 2024
98b6ad5
#398 some debugging code
ecwood Jul 16, 2024
a6049fc
#398 more debugging code
ecwood Jul 16, 2024
0cc5805
#398 another (very strange) edge case
ecwood Jul 16, 2024
a163ae3
#393
ecwood Jul 16, 2024
65543c9
#398 remove debugging info
ecwood Jul 16, 2024
e5544dc
#140 architecture for versioning files
ecwood Jul 16, 2024
2796ecd
#140 comment out s3 command for ci
ecwood Jul 16, 2024
1270624
#140 a lot more changes to log file names and tsv output
ecwood Jul 16, 2024
f0aee45
#140 on the neo4j side
ecwood Jul 17, 2024
07c8549
#140 adding the name to the name of other build artifacts as well
ecwood Jul 17, 2024
184fa6e
#140 made sure its only defined once
ecwood Jul 17, 2024
49d75ed
#393 remove RepoDB from the build system
ecwood Jul 17, 2024
ce8d8de
#393 have to remove info from kg2_util as well
ecwood Jul 17, 2024
6bdb5a6
#400 first pass at this
ecwood Jul 17, 2024
31cb00a
#400 try this instead
ecwood Jul 17, 2024
4750665
#400 handling the expansion map (hopefully)
ecwood Jul 17, 2024
89591c3
#392 initial edge blocklist (no synonyms yet)
ecwood Jul 22, 2024
e33b320
#387 grouping together xml blocks
ecwood Jul 22, 2024
d7743bb
#392 autism synonyms
ecwood Jul 22, 2024
7c53a60
#392 full edge blocklist
ecwood Jul 22, 2024
f5c7274
#387 parses it into little dictionaries (generically)
ecwood Jul 25, 2024
6db935f
#387 corrected some bugs with the XML parsing
ecwood Jul 27, 2024
7b4ac97
#387 handling case where something is just one line and not in anothe…
ecwood Jul 27, 2024
31b4779
#404, testing it out on CI first
ecwood Aug 1, 2024
b7597b9
#404 predicate remapping for biolink 4.2.1
ecwood Aug 1, 2024
26f551b
Merge branch 'dependencypruning' of github.com:RTXteam/RTX-KG2 into m…
ecwood Aug 1, 2024
2e62525
#387 handle doctype special case from foodon
ecwood Aug 8, 2024
38634dd
#387 handle doctype special case from foodon
ecwood Aug 9, 2024
23ff6ea
#387 refactored for clarity
ecwood Aug 11, 2024
9b8dfc4
#387 more refactoring, but pre-sorting into classes
ecwood Aug 11, 2024
bab707c
#387 refactored into class form
ecwood Aug 12, 2024
a55212c
#387 added in output filing
ecwood Aug 13, 2024
e8d9e88
#387 moving bc of kg2_util
ecwood Aug 13, 2024
8d6668f
#387 slightly more efficient
ecwood Aug 13, 2024
b377ae9
#387 loads multiple files now
ecwood Aug 15, 2024
e735855
#387 save the name of the output file as well
ecwood Aug 17, 2024
462c0bf
#387 start of processing the ontologies JSON Lines file
ecwood Aug 23, 2024
b12e984
#387 additional weird sources due to FOODON
ecwood Aug 27, 2024
e9d6d68
#387 more additional weird source links due to FOODON
ecwood Aug 27, 2024
61c3f06
#387 even more additional weird source links due to FOODON
ecwood Aug 27, 2024
da66493
#387 final additional weird source links due to FOODON
ecwood Aug 27, 2024
0d40be8
#387 patch to get around weird ids showing up when trying to prefix m…
ecwood Aug 27, 2024
0a22964
#387 more weird prefixes
ecwood Aug 27, 2024
919d3b8
#387 today's work on the ontologies ETL
ecwood Aug 27, 2024
6613470
#387 finishing up the different edge types
ecwood Aug 27, 2024
0750062
#387 for testing purposes
ecwood Aug 29, 2024
7ed39cb
#387 don't need that print statement anymore
ecwood Aug 29, 2024
7d18400
#387 looks like we're not using this anymore
ecwood Aug 29, 2024
2eb8b00
#387 try out the new validate kg2 util
ecwood Aug 29, 2024
2375994
#387 drastic changes: REMOVAL OF ONTOBIO from kg2_util and validation
ecwood Aug 29, 2024
5508b9b
#387 the ACTUAL removal of ontobio
ecwood Aug 29, 2024
a890695
#387 addressing a name change
ecwood Aug 29, 2024
86492f5
#387 format adjustment
ecwood Aug 29, 2024
86d431f
#387 adjustments to the mapping file to go along with new comparison …
ecwood Aug 29, 2024
e63311b
#387 apparently these have to be different to match biolink
ecwood Aug 29, 2024
9d7213b
#387 kg2_util didn't previously commit correctly
ecwood Aug 29, 2024
e191988
#387 the recursive category picker is workingpython3 ontologies_jsonl…
ecwood Aug 29, 2024
314b54f
#387 some date restructuring, ontology node versioning, and changes t…
ecwood Sep 2, 2024
c36f46c
#387 need to have name for ontology node
ecwood Sep 2, 2024
d4ffa05
credits for me
ecwood Sep 2, 2024
7860506
#387 removing unnecessary curies
ecwood Sep 2, 2024
0f56540
#387 no longer want biolink as an ontology source due to the parsing …
ecwood Sep 2, 2024
b3c8cea
#387 we have this predicate again
ecwood Sep 2, 2024
c36d3c3
#387 remove sed-ing from validation tests now that biolink is gone
ecwood Sep 2, 2024
9a3c22f
#387 #390
ecwood Sep 2, 2024
e94e431
#387 ordo actually included in ETL
ecwood Sep 2, 2024
c8c63de
#387 moving to its permanent home
ecwood Sep 2, 2024
2ec98bd
#387 moving owlparser to its permanent home
ecwood Sep 2, 2024
82e6acb
#387 don't need this one anymore
ecwood Sep 2, 2024
192039c
#387 #405 rethreading the pipeline for new ETL
ecwood Sep 2, 2024
d09ce05
#387 forgot to add the new extract
ecwood Sep 2, 2024
76b996b
#387 adjusting some of the variables for new pipelining
ecwood Sep 2, 2024
168071e
#387 adjusting for new pipelining syntax error
ecwood Sep 2, 2024
8baa763
#387 adjusting for new pipelining naming error
ecwood Sep 2, 2024
58d1bdd
#387 cleaning up the formatting of the new files
ecwood Sep 2, 2024
59c6192
#387 comments about the inner workings of ontologies conversion
ecwood Sep 2, 2024
b085055
#387 archiving multi ont
ecwood Sep 2, 2024
321981c
#387 archiving build multi ont
ecwood Sep 2, 2024
cea05b7
updating executability for newer files
ecwood Sep 2, 2024
fd482ba
#387 comments through owlparser
ecwood Sep 2, 2024
7bd5e8f
#387 adjusting for CHEBI issues
ecwood Sep 2, 2024
0804556
#387 want to remove old ontologies
ecwood Sep 2, 2024
a1a7c6e
#387 have to fully handle CHEBI
ecwood Sep 2, 2024
b0aee1c
#387 revising predicate remap for new ontology etl
ecwood Sep 2, 2024
139b1d5
#387 updating provided by to infores for new ontologies etl
ecwood Sep 2, 2024
ee09f49
#387 adding biolink version node in and correcting source node category
ecwood Sep 2, 2024
4f17e56
#392 edge blocklist logic implemented
ecwood Sep 2, 2024
e3f0f8e
#387 correcting the pipeline for new ontologies input
ecwood Sep 2, 2024
0e96110
#392 restringing pipeline for edge blocklist
ecwood Sep 2, 2024
c499f55
#387 correcting biolink version number code
ecwood Sep 2, 2024
36c63c6
#387 use correct dictionary to map IRI
ecwood Sep 2, 2024
d54a3d6
#387 correct variable names
ecwood Sep 2, 2024
59347ab
#387 can use shortened link now that we don't actually have to downlo…
ecwood Sep 2, 2024
f718815
#387 actually just change the biolink link to the repo
ecwood Sep 2, 2024
0699d84
#140 correct the filename
ecwood Sep 2, 2024
e104912
#387 pipelining issue thwarted
ecwood Sep 2, 2024
bd93687
#405 umls cleanup issue
ecwood Sep 2, 2024
3488401
#408 #398 curl problems
ecwood Sep 2, 2024
848a24f
#408 issue with DisGeNET download
ecwood Sep 2, 2024
e7d76d6
#408 typo for download
ecwood Sep 2, 2024
7edd988
#408 download SMPDB while link is failing
ecwood Sep 2, 2024
5cef56a
#408 cURL issue with HMDB
ecwood Sep 3, 2024
893fb71
#408 build issue with knowledge_source node curies
ecwood Sep 4, 2024
b37ee73
#408 missing predicate
ecwood Sep 4, 2024
2579972
#408 bucket problem
ecwood Sep 8, 2024
08bd4c5
#408 kg2-versions entry for KG2.10.1
ecwood Sep 8, 2024
b158cc5
#408 rest of kg2-versions entry for KG2.10.1
ecwood Sep 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
#387 correcting biolink version number code
  • Loading branch information
ecwood committed Sep 2, 2024
commit c499f551efec02e74e3e5a506f21867d8775523e
3 changes: 3 additions & 0 deletions convert/ontologies_jsonl_to_kg_jsonl.py
Original file line number Diff line number Diff line change
@@ -588,6 +588,9 @@ def construct_nodes_and_edges(nodes_output, edges_output):
for ontology_item in input_data:
process_ontology_item(ontology_item)

# Save the Biolink node information before processing
save_biolink_information(biolink_version_number)

# Categorize every node and save the information in the information dictionary for the node
for node_id in SAVED_NODE_INFO:
categorize_node(node_id)