Generates cliques using MDS-Clique from topics extracted by LDA from a corpus.
Uses Python 3 and various Python libraries (gensim
, networkx
, scikit-learn
, etc.)
- Use virtualenv to create a virtual Python 3 environment within the git repository (
virtualenv -p python3 venv
) - Activate your Python virtualenv
source venv/bin/activate
(you should now see something like(venv)
in your console) - Install the required Python libraries by running
pip install -r requirements.txt
- Clone https://github.com/hateno/scikit-learn.git and go to branch
v0.19.0
- Ensure you are still within the same virtualenv from the previous section and run
python setup.py build
and thenpython setup.py install
(this may take a while)
- Configure
config.ini
python corpus.py
python gen_topic.py
python sim.py -dim 2 -data corpus
Run corpus.py
with config.ini
filled out (see config.ini section) which reads a corpus (a directory of text documents) for pre-processing (e.g. stemming and tokenization). Then execute gen_topic.py
which uses the output artifacts of corpus.py
in order to perform LDA topic modeling over the pre-processed corpus. Then execute sim.py
which will either perform a specific experiment or simply execute MDS-Clique (see sections below).
Execute python sim.py --help
(make sure you are in your virtualenv) and go through the commands
Generate a random pre-computed dissimilarity matrix, run MDS-Clique using the standard deviation (named stress
) measure, it will write cliques to out/cliques_<num>
python sim.py -dim 2 -data random --matrix -clique stress
Use extracted LDA topics and run MDS-Clique using the distance measure, write cliques to out/cliques_<num>
, note that there -clusters <num>
needs to be higher than the number of topics extracted or an error will be thrown
python sim.py -dim 2 -data corpus -clusters 3 -clique distance
Run the RMDS experiment, set -data none
since each sample will generate its own random data set
python sim.py -dim 2 -data none --matrix -clique stress --rmds
Each experiment is denoted with a flag --<experiment_codename>
, by default an experiment will run 8 samples, you can manually specify number of samples with -e <num_samples>
, and utilize 1/4 of the max cores available on the system, you can manually specify number of cores with -c <num_cores>
Relative MDS experiment (k
-values are hard-coded)
python sim.py -dim 2 -data random --matrix --relative
MDS-Clique RMDS experiment
python sim.py -dim 2 -data none --matrix --rmds
MDS-Clique experiment
python sim.py -dim 2 -data none --matrix --rclique
Relative Online experiment
python sim.py -dim 2 -data none --matrix --relativeonline
Online Clique experiment
python sim.py -dim 2 -data none --matrix -clique stress --onlineclique
Online experiment
python sim.py -dim 2 -data none --matrix --online
corpus
: directory to the corpus (text documents)- A directory of plain text document(s) that will be pre-processed by
corpus.py
- A directory of plain text document(s) that will be pre-processed by
mds_seed
: set MDSrandom_state
- Optional, if blank the MDS algorithm (SMACOF) will start with a random configuration therefore most likely a different final result, set the seed value if you want a deterministic solution (useful for debugging), used by
sim.py
- Optional, if blank the MDS algorithm (SMACOF) will start with a random configuration therefore most likely a different final result, set the seed value if you want a deterministic solution (useful for debugging), used by
[Global]
corpus = /path/to/sample-corpus/
mds_seed = 7
Run in interactive debug mode
ipython -i -c "%run -dim 2 -data corpus" --pdb
You may need to manually create store
, out/final
, out/experiment
, and out/ident
directories