-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* updated READMEs & docs & new workflow figure * updated contributing guidelines * made great changes to catch and phrasematcher() is much faster w/ lemma * updated stopwords feature * wordcloud * tfidf plot nicer * highlevel file for functions that shouldn't be changed * added params py file for each section so users shouldn't need to edit core function files * cleaning up code for v3 release
- Loading branch information
Showing
69 changed files
with
1,258 additions
and
1,937 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# Changelog | ||
|
||
* **v1.0.0** [29/06/2020] | ||
- version presented in **JOSS** paper | ||
* **v2.0.0** [10/05/2021] | ||
- includes `spacy PhraseMatcher` | ||
- users can provide their own annotation tags | ||
- plotting tf-idf | ||
* **v3.0.0** [12/06/2024] | ||
- made version major change as whole repository has been redesigned | ||
- updated scripts for usability so users only need to edit a params file | ||
- high-level script to import text cleaning & stop words | ||
- updated annotation script for stop words and lemma | ||
- plotting wordcloud | ||
|
||
*** | ||
|
||
End of page |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# Contributing Guidelines / Issues for Jabberwocky :dragon_face: | ||
|
||
* Users are welcome to contribute work to this project and encouraged to create an [`Issue`](https://github.com/sap218/jabberwocky/issues). | ||
* In either circumstance, please ensure titles/descriptions have as much information as possible, e.g. if creating a Bug Issue, try and trace your steps w/ details & error messages. | ||
* The primary maintainer(s) - currently [@sap218](https://github.com/sap218) - will address the request! | ||
* Maintainers will always try their best to meet the needs of the user but also considering what is best for **Jabberwocky** | ||
|
||
## Contributing Code | ||
* Users intending to contribute to this repository can open a **Pull request** | ||
* Frequent contributors will be added to a contributors list for thanks and acknowledgement | ||
* **Note**: as much information as possible is desired including code having comments (w/ username to acknowledge contribution), e.g. | ||
|
||
``` | ||
print("hello world") # example comment as a reference - @yourusername | ||
``` | ||
|
||
## Issues | ||
* Please don't hesitate to create an **Issue** | ||
* Issues can relate to anything: error reporting, feature request, or questions for help, e.g. | ||
* If the `README` isn't clear, please do report this - I encourage suggestions! | ||
* Issues will be labelled accordingly - see below for [`label`](https://github.com/sap218/jabberwocky/labels) information: | ||
|
||
#### bug | ||
* error reporting - any user problems will be tagged with `bug` | ||
|
||
#### documentation | ||
* pertains to Issues that relate to documentation, e.g. a request to expand/clarify a point in the `README` | ||
|
||
#### duplicate | ||
* although a question may have been asked before, don't hesitate to ask anyway - this Issue will be labelled as `duplicate` and reference the solved/closed Issue | ||
|
||
#### help | ||
* can cover a wide range of Issues - from generic or specific code-related Qs | ||
|
||
#### request | ||
* this label highlights a user's request | ||
|
||
#### wontfix | ||
* there may be circumstances that an Issue *shouldn't* be fixed | ||
* for example, a user might run into a particular bug but a fix wouldn't make sense for Jabberwocky's scope | ||
* the maintainer(s) will comment why this label is applied, giving users some time (perhaps to rebuttal) before closing the Issue | ||
* any other help will be provided (e.g. linking to a different tool) | ||
|
||
*** | ||
|
||
End of page |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,21 +1,22 @@ | ||
MIT License | ||
|
||
Copyright (c) 2019 Samantha C Pendleton | ||
Copyright (c) 2020-present Samantha Pendleton | Jabberwocky | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
Permission is hereby granted, free of charge, to any person obtaining | ||
a copy of this software and associated documentation files (the | ||
"Software"), to deal in the Software without restriction, including | ||
without limitation the rights to use, copy, modify, merge, publish, | ||
distribute, sublicense, and/or sell copies of the Software, and to | ||
permit persons to whom the Software is furnished to do so, subject to | ||
the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
The above copyright notice and this permission notice shall be | ||
included in all copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. | ||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, | ||
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF | ||
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND | ||
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE | ||
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION | ||
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION | ||
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,114 +1,66 @@ | ||
# Jabberwocky | ||
|
||
[![DOI](https://joss.theoj.org/papers/10.21105/joss.02168/status.svg)](https://doi.org/10.21105/joss.02168) [![DOI](https://zenodo.org/badge/227571502.svg)](https://zenodo.org/badge/latestdoi/227571502) | ||
|
||
**see [Jabberwocky site](https://sap218.github.io/jabberwocky/) for in-depth explanation and working scenarios (including test files)** | ||
|
||
Jabberwocky is a toolkit for **ontologies**. Since we all know ontologies are "*nonsense*". Not enough tools existsing utilise the power of ontologies. Don't hesitate to create an [`issue`](https://github.com/sap218/jabberwocky/issues) or [`pull request`](https://github.com/sap218/jabberwocky/pulls) (see [**guidelines**](https://github.com/sap218/jabberwocky/blob/master/CONTRIBUTING.md) first). | ||
|
||
#### Version | ||
|
||
See `setup.py` in your local copy for version number | or `Releases`: | ||
* **v1.0.0.0** [29/06/2020] | ||
* **v2.0.0.0** [10/05/2021] | ||
- includes `spacy PhraseMatcher` | ||
- own synonym tags | ||
- plot output for tf-idf | ||
|
||
##### Install | ||
``` | ||
$ git clone https://github.com/sap218/jabberwocky | ||
$ cd jabberwocky | ||
$ python3 setup.py install --user | ||
``` | ||
**note**: if you are using a virtual environment you can avoid `--user` | ||
|
||
##### Prerequisites | ||
``` | ||
$ pip3 install click BeautifulSoup4 scikit-learn pandas lxml pytest spacy matplotlib | ||
``` | ||
or **after installing**, use the `requirements.txt` file: | ||
``` | ||
$ pip3 install -r requirements.txt | ||
``` | ||
|
||
#### Elements | ||
|
||
command | description | ||
------- | ----------- | ||
`bandersnatch` | extract synonyms from an RDF/XML syntax `OWL` ontology | ||
`catch` | extract elements / sentences of text using key words | ||
`bite` | run statistical tf-idf for important words from text | ||
`arise` | adding / updating new synonyms to an ontology | ||
|
||
#### Ontology formats | ||
`jabberwocky` works with the `OWL` ontology format: `RDF/XML` - for example, well-known biomedical ontologies such as `doid.owl`, `hpo.owl`, and `uberon.owl` will all work, plus your own created. | ||
|
||
#### Examples | ||
for examples of Jabberwocky's commands in use, please see the **[site](https://sap218.github.io/jabberwocky/SCENARIO.html)**. | ||
|
||
**OR** to run the automated tests (in the cloned directory): | ||
``` | ||
$ git submodule init | ||
$ git submodule update | ||
$ tox | ||
``` | ||
|
||
--- | ||
|
||
## bandersnatch | ||
`bandersnatch` curates synonyms for a list of key terms / or words of interest from an ontology of your choice, you provide a list of ontology synonym tags. **note**: it is recommended your list of keywords are exactly the classes from your chosen ontology (all in lowercase). | ||
``` | ||
$ jab-bandersnatch -o hpo.owl -s ontology_synonym_tags.txt -k words_of_interest.txt | ||
``` | ||
|
||
## catch | ||
`catch` essentially "catches" key elements / sentences from textual data using a `.json` of key terms and their synonyms, you can use the outcome from `bandersnatch`. A user will also provide a `.txt` or `.json` of the text data. **note**: if a `.json` of text data is provided, you need specify the parameter for the field that contains the textual data to process. | ||
``` | ||
$ jab-catch -k label_with_synonyms.json -t facebook_posts.json -p user-comment -i inner-user-comment-reply | ||
``` | ||
|
||
## bite | ||
`bite` runs a tf-idf statistical analysis: searching for important terms in a text corpus. a user can use a list of key terms to remove from the text in order to avoid being in the statistical model - meaning other terms may be ranked higher. **note**: again with `catch`, if you provide a `.json` of text data, you need specify the field that contains the textual data to process. Using `-g True` means you'll get a bar plot of the (default) 30-top terms. | ||
``` | ||
$ jab-bite -k label_with_synonyms.json -t twitter_posts.txt -g True | ||
``` | ||
|
||
## arise | ||
`arise` inserts synonyms in an ontology: **you** define these synonyms (e.g. "exact", "broad", "related", or "narrow") - these new synonyms may be based on the tf-idf statistical analysis from `bite`. | ||
``` | ||
$ jab-arise -o pocketmonsters.owl -f tfidf_new_synonyms.tsv | ||
``` | ||
|
||
--- | ||
|
||
## Thanks! :dragon: | ||
|
||
the poem "Jabberwocky" written by Lewis Carrol is described as a "nonsense" poem. | ||
|
||
**Contributors** - thank you! | ||
- [@majensen](https://github.com/majensen) for setting up automated testing w/ `pytest` - [see pull request #13 for more details](https://github.com/sap218/jabberwocky/pull/13) | ||
|
||
**Citing** | ||
``` | ||
@article{Pendleton2020, | ||
doi = {10.21105/joss.02168}, | ||
url = {https://doi.org/10.21105/joss.02168}, | ||
year = {2020}, | ||
publisher = {The Open Journal}, | ||
volume = {5}, | ||
number = {51}, | ||
pages = {2168}, | ||
author = {Samantha C. Pendleton and Georgios V. Gkoutos}, | ||
title = {Jabberwocky: an ontology-aware toolkit for manipulating text}, | ||
journal = {Journal of Open Source Software} | ||
} | ||
``` | ||
|
||
--- | ||
|
||
## ONE LAST THING... | ||
|
||
You can combine these commands together to form a process of steps of ontology synonym development and text analysis - see the [SCENARIO](https://sap218.github.io/jabberwocky/SCENARIO.html) for a working example of this process. | ||
|
||
![jabberwocky cycle](/images/cycle.jpg) | ||
# Jabberwocky | ||
|
||
[![DOI](https://joss.theoj.org/papers/10.21105/joss.02168/status.svg)](https://doi.org/10.21105/joss.02168) | ||
|
||
Jabberwocky is a toolkit for NLP and **ontologies**. Since we all know ontologies are *nonsense*. | ||
|
||
## Functionality | ||
|
||
Read the [documentation](https://sap218.github.io/jabberwocky/) for more detail. | ||
|
||
script | description | ||
------- | ----------- | ||
`bandersnatch` | extract annotations from an ontology (`OWL` RDF/XML syntax) | ||
`catch` | text mining (grep) a corpus using key words/phrases | ||
`bite` | TF-IDF for ranking important terms from corpus | ||
`arise` | adding / updating ontology concepts with new annotations | ||
|
||
## Running | ||
|
||
Within each directory, there is a file `params_*.py` which users can edit. | ||
This means users shouldn't need to edit the main/primary script. | ||
|
||
Check the individual `READMEs` for parameter information. | ||
|
||
## Workflow | ||
|
||
When combining these Jabberwocky functions, users can create an NLP workflow. | ||
|
||
![workflow](/docs/workflow.png) | ||
|
||
#### Prerequisites | ||
|
||
Check [`requirements.py`](https://github.com/sap218/jabberwocky/blob/master/requirements.py) for a list of packages and versions. | ||
|
||
#### AOB | ||
|
||
- Check the [**Changelog**](https://github.com/sap218/jabberwocky/blob/master/changelog.md) for version information | ||
- [License](https://github.com/sap218/jabberwocky/blob/master/LICENSE) is **MIT** | ||
- The poem, Jabberwocky, written by Lewis Carrol, is described as a "nonsense" poem :dragon: | ||
|
||
## Contributing | ||
|
||
Please read the [**Contributing Guidelines**](https://github.com/sap218/jabberwocky/blob/master/contributing.md). | ||
|
||
- [@majensen](https://github.com/majensen) set up automated testing w/ `pytest` in v1.0 - see [pull request #13](https://github.com/sap218/jabberwocky/pull/13) for more details | ||
|
||
## Citing | ||
|
||
``` | ||
@article{Pendleton2020, | ||
doi = {10.21105/joss.02168}, | ||
url = {https://doi.org/10.21105/joss.02168}, | ||
year = {2020}, | ||
publisher = {The Open Journal}, | ||
volume = {5}, | ||
number = {51}, | ||
pages = {2168}, | ||
author = {Samantha C. Pendleton and Georgios V. Gkoutos}, | ||
title = {Jabberwocky: an ontology-aware toolkit for manipulating text}, | ||
journal = {Journal of Open Source Software} | ||
} | ||
``` | ||
|
||
*** | ||
|
||
End of page |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# README - `arise` | ||
|
||
## `ontology_name` | ||
- ontology file (+path) | ||
|
||
## `annotation_file` | ||
- file of annotations | ||
- can be either `.tsv` or `.csv` | ||
|
||
*** | ||
|
||
End of page |
Empty file.
Oops, something went wrong.