Every single day, a large amount of text data is generated by different medical data sources, such as scientific literature, medical web pages, health related social media posts, clinical notes, and drug reviews. Processing this data in an efficient manner is a really daunting task without the help of clever computational strategies, and it makes text classification as an imperative and a major operation to big data text analytics. In this contribution, we developed an open-source software for big data text classification called bigNN. It implements a word2vec neural network model over Apache Spark to aim at big data sentence classification in a timely fashion. The software offers a graphical user interface, and it facilitates reproducible research in sentence analysis by allowing users to configure different sets of Apache Spark and word2vec neural network parameters. Furthermore, we introduce application of bigNN in medical informatics domain. bigNN is fully documented and it is publicly and freely available at https://github.com/bircatmcri/bigNN.
The bigNN includes the following packages:
Package Name | Description |
---|---|
edu.mfldclin.mcrf.bignn.gui | Implementation of the graphical user interface |
edu.mfldclin.mcrf.bignn.setting | Implementation of pre-defined and user-defined settings required to the system |
edu.mfldclin.mcrf.bignn.learning | Implementation of text pre-processing and neural network learning model |
edu.mfldclin.mcrf.bignn.evaluation | It evaluates the neural network predictive model |
- Apache Spark 2.10
- Java2SE 8
The bigNN software architectural model is shown in includes the following figure.
- Ahmad P. Tafti (Marshfield Clinic Research Institute)
- Ehsun Behravesh (IEEE Member)
- Mehdi Assefi (University of Georgia)
- Eric LaRose (Marshfield Clinic Research Institute)
- Jonathan Badger (Marshfield Clinic Research Institute)
- John Mayer (Marshfield Clinic Research Institute)
- AnHai Doan (University of Wisconsin-Madison)
- David Page (University of Wisconsin-Madison)
- Peggy Peissig (Marshfield Clinic Research Institute)
The project described was supported by the Clinical and Translational Science Award (CTSA) program, through the NIH National Center for Advancing Translational Sciences (NCATS), grant UL1TR000427. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
The workflow and architectural model of the bigNN is fully explained in [1]. Any publication using the bigNN would encourage to cite the two following papers. Thanks!
[1] Tafti, A.P., Behravesh, E., Assefi, M., LaRose, E., Badger, J., Mayer, J., Doan, A., Page, D., Peissig, P. 2017. bigNN: an open-source big data toolkit focused on biomedical sentence classification. IEEE BIG DATA 2017. [Paper]
[2] Tafti, A.P., Badger, J., LaRose, E., Shirzadi, E., Mahnke, A., Mayer, J., Ye, Z., Page, D. and Peissig, P., 2017. Adverse Drug Event Discovery Using Biomedical Literature: A Big Data Neural Network Adventure. JMIR medical informatics, 5(4), p.e51. [Paper]