Skip to content

Commit

Permalink
Jabberwocky v3.0 (#15)
Browse files Browse the repository at this point in the history
* updated READMEs & docs & new workflow figure
* updated contributing guidelines 
* made great changes to catch and phrasematcher() is much faster w/ lemma
* updated stopwords feature
* wordcloud
* tfidf plot nicer
* highlevel file for functions that shouldn't be changed
* added params py file for each section so users shouldn't need to edit core function files
* cleaning up code for v3 release
  • Loading branch information
sap218 authored Jun 12, 2024
1 parent 0320ca4 commit 7f09b2a
Show file tree
Hide file tree
Showing 69 changed files with 1,258 additions and 1,937 deletions.
3 changes: 0 additions & 3 deletions .gitmodules

This file was deleted.

52 changes: 0 additions & 52 deletions CONTRIBUTING.md

This file was deleted.

18 changes: 18 additions & 0 deletions Changelog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Changelog

* **v1.0.0** [29/06/2020]
- version presented in **JOSS** paper
* **v2.0.0** [10/05/2021]
- includes `spacy PhraseMatcher`
- users can provide their own annotation tags
- plotting tf-idf
* **v3.0.0** [12/06/2024]
- made version major change as whole repository has been redesigned
- updated scripts for usability so users only need to edit a params file
- high-level script to import text cleaning & stop words
- updated annotation script for stop words and lemma
- plotting wordcloud

***

End of page
46 changes: 46 additions & 0 deletions Contributing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Contributing Guidelines / Issues for Jabberwocky :dragon_face:

* Users are welcome to contribute work to this project and encouraged to create an [`Issue`](https://github.com/sap218/jabberwocky/issues).
* In either circumstance, please ensure titles/descriptions have as much information as possible, e.g. if creating a Bug Issue, try and trace your steps w/ details & error messages.
* The primary maintainer(s) - currently [@sap218](https://github.com/sap218) - will address the request!
* Maintainers will always try their best to meet the needs of the user but also considering what is best for **Jabberwocky**

## Contributing Code
* Users intending to contribute to this repository can open a **Pull request**
* Frequent contributors will be added to a contributors list for thanks and acknowledgement
* **Note**: as much information as possible is desired including code having comments (w/ username to acknowledge contribution), e.g.

```
print("hello world") # example comment as a reference - @yourusername
```

## Issues
* Please don't hesitate to create an **Issue**
* Issues can relate to anything: error reporting, feature request, or questions for help, e.g.
* If the `README` isn't clear, please do report this - I encourage suggestions!
* Issues will be labelled accordingly - see below for [`label`](https://github.com/sap218/jabberwocky/labels) information:

#### bug
* error reporting - any user problems will be tagged with `bug`

#### documentation
* pertains to Issues that relate to documentation, e.g. a request to expand/clarify a point in the `README`

#### duplicate
* although a question may have been asked before, don't hesitate to ask anyway - this Issue will be labelled as `duplicate` and reference the solved/closed Issue

#### help
* can cover a wide range of Issues - from generic or specific code-related Qs

#### request
* this label highlights a user's request

#### wontfix
* there may be circumstances that an Issue *shouldn't* be fixed
* for example, a user might run into a particular bug but a fix wouldn't make sense for Jabberwocky's scope
* the maintainer(s) will comment why this label is applied, giving users some time (perhaps to rebuttal) before closing the Issue
* any other help will be provided (e.g. linking to a different tool)

***

End of page
33 changes: 17 additions & 16 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,21 +1,22 @@
MIT License

Copyright (c) 2019 Samantha C Pendleton
Copyright (c) 2020-present Samantha Pendleton | Jabberwocky

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
1 change: 0 additions & 1 deletion MANIFEST.in

This file was deleted.

180 changes: 66 additions & 114 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,114 +1,66 @@
# Jabberwocky

[![DOI](https://joss.theoj.org/papers/10.21105/joss.02168/status.svg)](https://doi.org/10.21105/joss.02168) [![DOI](https://zenodo.org/badge/227571502.svg)](https://zenodo.org/badge/latestdoi/227571502)

**see [Jabberwocky site](https://sap218.github.io/jabberwocky/) for in-depth explanation and working scenarios (including test files)**

Jabberwocky is a toolkit for **ontologies**. Since we all know ontologies are "*nonsense*". Not enough tools existsing utilise the power of ontologies. Don't hesitate to create an [`issue`](https://github.com/sap218/jabberwocky/issues) or [`pull request`](https://github.com/sap218/jabberwocky/pulls) (see [**guidelines**](https://github.com/sap218/jabberwocky/blob/master/CONTRIBUTING.md) first).

#### Version

See `setup.py` in your local copy for version number | or `Releases`:
* **v1.0.0.0** [29/06/2020]
* **v2.0.0.0** [10/05/2021]
- includes `spacy PhraseMatcher`
- own synonym tags
- plot output for tf-idf

##### Install
```
$ git clone https://github.com/sap218/jabberwocky
$ cd jabberwocky
$ python3 setup.py install --user
```
**note**: if you are using a virtual environment you can avoid `--user`

##### Prerequisites
```
$ pip3 install click BeautifulSoup4 scikit-learn pandas lxml pytest spacy matplotlib
```
or **after installing**, use the `requirements.txt` file:
```
$ pip3 install -r requirements.txt
```

#### Elements

command | description
------- | -----------
`bandersnatch` | extract synonyms from an RDF/XML syntax `OWL` ontology
`catch` | extract elements / sentences of text using key words
`bite` | run statistical tf-idf for important words from text
`arise` | adding / updating new synonyms to an ontology

#### Ontology formats
`jabberwocky` works with the `OWL` ontology format: `RDF/XML` - for example, well-known biomedical ontologies such as `doid.owl`, `hpo.owl`, and `uberon.owl` will all work, plus your own created.

#### Examples
for examples of Jabberwocky's commands in use, please see the **[site](https://sap218.github.io/jabberwocky/SCENARIO.html)**.

**OR** to run the automated tests (in the cloned directory):
```
$ git submodule init
$ git submodule update
$ tox
```

---

## bandersnatch
`bandersnatch` curates synonyms for a list of key terms / or words of interest from an ontology of your choice, you provide a list of ontology synonym tags. **note**: it is recommended your list of keywords are exactly the classes from your chosen ontology (all in lowercase).
```
$ jab-bandersnatch -o hpo.owl -s ontology_synonym_tags.txt -k words_of_interest.txt
```

## catch
`catch` essentially "catches" key elements / sentences from textual data using a `.json` of key terms and their synonyms, you can use the outcome from `bandersnatch`. A user will also provide a `.txt` or `.json` of the text data. **note**: if a `.json` of text data is provided, you need specify the parameter for the field that contains the textual data to process.
```
$ jab-catch -k label_with_synonyms.json -t facebook_posts.json -p user-comment -i inner-user-comment-reply
```

## bite
`bite` runs a tf-idf statistical analysis: searching for important terms in a text corpus. a user can use a list of key terms to remove from the text in order to avoid being in the statistical model - meaning other terms may be ranked higher. **note**: again with `catch`, if you provide a `.json` of text data, you need specify the field that contains the textual data to process. Using `-g True` means you'll get a bar plot of the (default) 30-top terms.
```
$ jab-bite -k label_with_synonyms.json -t twitter_posts.txt -g True
```

## arise
`arise` inserts synonyms in an ontology: **you** define these synonyms (e.g. "exact", "broad", "related", or "narrow") - these new synonyms may be based on the tf-idf statistical analysis from `bite`.
```
$ jab-arise -o pocketmonsters.owl -f tfidf_new_synonyms.tsv
```

---

## Thanks! :dragon:

the poem "Jabberwocky" written by Lewis Carrol is described as a "nonsense" poem.

**Contributors** - thank you!
- [@majensen](https://github.com/majensen) for setting up automated testing w/ `pytest` - [see pull request #13 for more details](https://github.com/sap218/jabberwocky/pull/13)

**Citing**
```
@article{Pendleton2020,
doi = {10.21105/joss.02168},
url = {https://doi.org/10.21105/joss.02168},
year = {2020},
publisher = {The Open Journal},
volume = {5},
number = {51},
pages = {2168},
author = {Samantha C. Pendleton and Georgios V. Gkoutos},
title = {Jabberwocky: an ontology-aware toolkit for manipulating text},
journal = {Journal of Open Source Software}
}
```

---

## ONE LAST THING...

You can combine these commands together to form a process of steps of ontology synonym development and text analysis - see the [SCENARIO](https://sap218.github.io/jabberwocky/SCENARIO.html) for a working example of this process.

![jabberwocky cycle](/images/cycle.jpg)
# Jabberwocky

[![DOI](https://joss.theoj.org/papers/10.21105/joss.02168/status.svg)](https://doi.org/10.21105/joss.02168)

Jabberwocky is a toolkit for NLP and **ontologies**. Since we all know ontologies are *nonsense*.

## Functionality

Read the [documentation](https://sap218.github.io/jabberwocky/) for more detail.

script | description
------- | -----------
`bandersnatch` | extract annotations from an ontology (`OWL` RDF/XML syntax)
`catch` | text mining (grep) a corpus using key words/phrases
`bite` | TF-IDF for ranking important terms from corpus
`arise` | adding / updating ontology concepts with new annotations

## Running

Within each directory, there is a file `params_*.py` which users can edit.
This means users shouldn't need to edit the main/primary script.

Check the individual `READMEs` for parameter information.

## Workflow

When combining these Jabberwocky functions, users can create an NLP workflow.

![workflow](/docs/workflow.png)

#### Prerequisites

Check [`requirements.py`](https://github.com/sap218/jabberwocky/blob/master/requirements.py) for a list of packages and versions.

#### AOB

- Check the [**Changelog**](https://github.com/sap218/jabberwocky/blob/master/changelog.md) for version information
- [License](https://github.com/sap218/jabberwocky/blob/master/LICENSE) is **MIT**
- The poem, Jabberwocky, written by Lewis Carrol, is described as a "nonsense" poem :dragon:

## Contributing

Please read the [**Contributing Guidelines**](https://github.com/sap218/jabberwocky/blob/master/contributing.md).

- [@majensen](https://github.com/majensen) set up automated testing w/ `pytest` in v1.0 - see [pull request #13](https://github.com/sap218/jabberwocky/pull/13) for more details

## Citing

```
@article{Pendleton2020,
doi = {10.21105/joss.02168},
url = {https://doi.org/10.21105/joss.02168},
year = {2020},
publisher = {The Open Journal},
volume = {5},
number = {51},
pages = {2168},
author = {Samantha C. Pendleton and Georgios V. Gkoutos},
title = {Jabberwocky: an ontology-aware toolkit for manipulating text},
journal = {Journal of Open Source Software}
}
```

***

End of page
12 changes: 12 additions & 0 deletions arise/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# README - `arise`

## `ontology_name`
- ontology file (+path)

## `annotation_file`
- file of annotations
- can be either `.tsv` or `.csv`

***

End of page
Empty file removed arise/__init__.py
Empty file.
Loading

0 comments on commit 7f09b2a

Please sign in to comment.