Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORD-19 Dataset antiviral compounds #26

Open
motey opened this issue Apr 3, 2020 · 8 comments
Open

CORD-19 Dataset antiviral compounds #26

motey opened this issue Apr 3, 2020 · 8 comments
Assignees
Labels
Priority: Low This issue can be dealt with in the long term or is on hold Status: Suggested This issue is a suggestion for doing something new or different in CovidGraph Tag: Help Wanted Extra attention is needed Type: Data Source To identify an issue as a data source

Comments

@motey
Copy link
Member

motey commented Apr 3, 2020

With version 5 of the CORD19 dataset a list of Anti-Viral Candidate Compounds was included.

Actually, i just realized, this data was attached from the team, working on a python tool around the CORD-19 dataset. https://github.com/josephsdavid/cord-19-tools
At the moment there is no documentation from where they have this data.

The attached readme says following:

CAS COVID-19 Anti-Viral Candidate Compounds Readme

* The dataset includes 49,437 anti-viral candidate compounds (known and similar) created from the  CAS REGISTRY of chemical substances;
* The dataset is being provided in SDfile format, which includes the complete Molfile representation and other information such as cas.rn, cas.index.name, molecular.formula, molecular.weight, melting.point.experimental, and other property data. 
* Dataset entities represent known anti-viral drugs and related chemical substances that are structurally similar to a known anti-viral.

We have one huge file containing a big list of compounds described in MOL format.
One entry looks like this

Cobicistat
C40H53N7O5S2
1004316-88-4 Copyright (C) 2020 ACS
 54 58  0  0  1  0  0  0  0  0999 V2000
48827.327914512.9573    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
48827.3279 9675.3049    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
44637.796116931.7835    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
53016.859716931.7835    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
40448.266814512.9573    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
57206.386614512.9573    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
36258.737416931.7835    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
40448.2668 9675.3049    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
61395.913516931.7835    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
36258.737421769.4359    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
32069.208114512.9573    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
61395.913521769.4359    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
65585.445314512.9573    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
32069.208124188.2622    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
27879.676316931.7835    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
69774.977116931.7835    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
23690.146914512.9598    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
27879.676321769.4359    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
73964.508914512.9573    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
69774.977121769.4359    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
19500.617616931.7860    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
23690.1469 9675.3073    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
78154.035816931.7835    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
 2843.498713391.2040    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  875.850117810.6207    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000 9477.4613    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
53016.8597 7256.4786    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
53016.8597 2418.8262    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
57206.3866 9675.3049    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
57206.3866    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
61395.9135 7256.4811    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
61395.9135 2418.8262    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
57206.386624188.2622    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
57206.386629025.9146    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
53016.859721769.4359    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
53016.859731444.7408    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
48827.327924188.2622    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
48827.327929025.9146    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
32069.208129025.9170    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
27879.676331444.7433    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
36258.737431444.7433    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
27879.676336282.3957    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
36258.737436282.3932    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
32069.208138701.2219    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
15311.088314512.9598    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
10891.671516480.6083    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
14805.4142 9701.8075    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
 7654.650912885.5324    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
10073.4772 8696.0031    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
82343.562714512.9573    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
86762.979416480.6083    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
82849.2367 9701.8075    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
90000.000012885.5324    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
87581.1738 8696.0031    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  6  0  0  0
  1  3  1  0  0  0  0
  1  4  1  0  0  0  0
  2 27  1  0  0  0  0
  3  5  1  0  0  0  0
  4  6  1  0  0  0  0
  5  7  1  0  0  0  0
  5  8  2  3  0  0  0
  6  9  1  0  0  0  0
  7 10  1  1  0  0  0
  7 11  1  0  0  0  0
  9 12  1  6  0  0  0
  9 13  1  0  0  0  0
 10 14  1  0  0  0  0
 11 15  1  0  0  0  0
 12 33  1  0  0  0  0
 13 16  1  0  0  0  0
 14 39  1  0  0  0  0
 15 17  1  0  0  0  0
 15 18  2  3  0  0  0
 16 19  1  0  0  0  0
 16 20  2  3  0  0  0
 17 21  1  0  0  0  0
 17 22  1  0  0  0  0
 19 23  1  0  0  0  0
 21 45  1  0  0  0  0
 23 50  1  0  0  0  0
 24 25  1  0  0  0  0
 24 26  1  0  0  0  0
 24 48  1  0  0  0  0
 27 28  2  0  0  0  0
 27 29  1  0  0  0  0
 28 30  1  0  0  0  0
 29 31  2  0  0  0  0
 30 32  2  0  0  0  0
 31 32  1  0  0  0  0
 33 34  2  0  0  0  0
 33 35  1  0  0  0  0
 34 36  1  0  0  0  0
 35 37  2  0  0  0  0
 36 38  2  0  0  0  0
 37 38  1  0  0  0  0
 39 40  1  0  0  0  0
 39 41  1  0  0  0  0
 40 42  1  0  0  0  0
 41 43  1  0  0  0  0
 42 44  1  0  0  0  0
 43 44  1  0  0  0  0
 45 46  1  0  0  0  0
 45 47  2  0  0  0  0
 46 48  2  0  0  0  0
 47 49  1  0  0  0  0
 48 49  1  0  0  0  0
 50 51  2  0  0  0  0
 50 52  1  0  0  0  0
 51 53  1  0  0  0  0
 52 54  1  0  0  0  0
 53 54  2  0  0  0  0
M  END
> <cas.rn>
1004316-88-4

> <cas.index.name>
2,7,10,12-Tetraazatridecanoic acid, 12-methyl-13-[2-(1-methylethyl)-4-thiazolyl]-9-[2-(4-morpholinyl)ethyl]-8,11-dioxo-3,6-bis(phenylmethyl)-, 5-thiazolylmethyl ester, (3R,6R,9S)-

> <molecular.formula>
C40H53N7O5S2

> <molecular.weight>
776.02

> <boiling.point.predicted>
974.5±65.0 °C    Press: 760 Torr

> <density.predicted>
1.228±0.06 g/cm3    Temp: 20 °C; Press: 760 Torr

> <pka.predicted>
11.86±0.46    Most Acidic Temp: 25 °C

Is this data interesting for our scope?
@mpreusse : "YES"

How can we connect this data to our graph?
As discussed with @mpreusse we could search for the molecule name ("C40H53N7O5S2" in the example above) in the :PatentAbstract{text} and connect them.
Other ideas are welcome.

@dkrizic
Copy link

dkrizic commented Apr 3, 2020

I worked with compounds with Neo4j. I one project we created a molecular substructure search as plugin for Neo4j, that supported MOL V2000 and SMILES/SMARTS. Is this what you are asking about? I already mentioned this to @mpreusse.

@motey
Copy link
Member Author

motey commented Apr 3, 2020

Sounds good. Is this plugin open source?

@dkrizic
Copy link

dkrizic commented Apr 3, 2020

No, the plugin is not open source. There are multiple versions and iterations. The first one is based on the EPAM Indigo library (https://lifescience.opensource.epam.com/indigo/) which worked fine, but we always had clashes with the dependent libraries and class loader issues in Neo4j. We had to switch to anther software vendor for a chemical library.

But... I suggest that we implement the following:

  • We store the compounds in Neo4j
    -- One node per compounds
    -- The whole MOL or SMILES string in a property of that node
  • I will implement a Microservice that uses the Indigo library and offers a REST interface to do substructure search and similarity
  • I will write a plugin for Neo4j that integrates that service

It would help me if I understand what we need "in a chemical way"

@motey
Copy link
Member Author

motey commented Apr 3, 2020

We do need/want a search function for substructures for the application/endusers, correct?
Or do we just want to fingerprint the molecues once and connect them?
If latter, what about using the indigo library, in a pre-process (in python for example with https://pypi.org/project/epam.indigo/)
From my (naive) view that looks a lot less complex, less error-prone and maybe faster as it would process data locally.

@seangrant82
Copy link

would this be helpful in linking between what is in CORD-19 and dataset: https://www.cas.org/covid-19-antiviral-compounds-dataset?utm_source=hootsuite&utm_medium=linkedin&utm_term=&utm_content=9a9f1234-6bd2-4673-9436-bb49800209ca&utm_campaign=COVID-19

@bramble50
Copy link

For processing of molecules I would recommend rdkit: https://github.com/rdkit/rdkit rather than using indigo it's much better supported and has a wider variety of functionality.

I suspect it will be enough just to connect the molecules to the existing data as compound nodes. Ideally by compound->publication_id. You can always add additional features like similarity/substructure searching in afterwards.

In terms of adding/searching for molecules I would stay away from using non-unique keys for these such as the chemical formula (unless just as metadata on the node). Even the CAS numbers used in the above example are not unique or persistent. If there is not a unique identifier from the database they come from/you want to add multiple chemical sources in future then the best way to do this is to calculate inchi's or inchi_key's from the mol_files/SDF's above using RDKit (ideally with a molecular standardisation process as well).

Other datasets I would look at for mapping include ChEMBL, pubchem (although it's a bit noisy) and SureCHEMBL (chembl for patents). I can create extra issues for these if needed.

@motey
Copy link
Member Author

motey commented Apr 8, 2020

Maybe interesting as well: https://github.com/rdkit/neo4j-rdkit
(did not read just stumpled and skimmed. Reminder: Have a closer look.)
@sarmbruster was involed in that too

@sarmbruster
Copy link

sarmbruster commented Apr 8, 2020

Maybe interesting as well: https://github.com/rdkit/neo4j-rdkit
(did not read just stumpled and skimmed. Reminder: Have a closer look.)
@sarmbruster was involed in that too

There's a lightning talk recording by the author if the plugin, see https://neo4j.com/online-summit/session/rdkit-neo4j-integration. I've just acted as a mentor.

@Jiros Jiros changed the title If/How to integrate CORD-19 Dataset antiviral compounts CORD-19 Dataset antiviral compounts Jul 10, 2020
@Jiros Jiros changed the title CORD-19 Dataset antiviral compounts CORD-19 Dataset antiviral compounds Jul 10, 2020
@Jiros Jiros transferred this issue from covidgraph/documentation Dec 7, 2020
@Jiros Jiros added Type: Data Source To identify an issue as a data source Status: Suggested This issue is a suggestion for doing something new or different in CovidGraph Priority: Low This issue can be dealt with in the long term or is on hold Tag: Help Wanted Extra attention is needed labels Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: Low This issue can be dealt with in the long term or is on hold Status: Suggested This issue is a suggestion for doing something new or different in CovidGraph Tag: Help Wanted Extra attention is needed Type: Data Source To identify an issue as a data source
Projects
None yet
Development

No branches or pull requests

7 participants