Skip to content

Opening-ETDs/ScanBank

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dataset

Our gold standard dataset for figure extraction from scanned ETDs can be downloaded using this link.

Original Readme

The original readme (readme from the original fork) can be found here.

Instructions to run this fork:

Create python environment

Use the requirements.txt from the repository's root to create your python environment.

Another quick way to set up the environment using Anaconda is:

ENV_NAME=deepfigures_3 && conda remove --name $ENV_NAME --all -y && conda create --name $ENV_NAME python=3.6 -y && source activate $ENV_NAME && pip install -r /home/sampanna/deepfigures-open/requirements.txt --no-cache-dir

Make the C++ dependencies

cd /home/sampanna/deepfigures-open/vendor/tensorboxresnet/tensorboxresnet/utils && make

Install texlive (optional, needed for data generation)

If you have sudo access, run:

sudo apt-get install texlive-latex-base \
    texlive-fonts-recommended \
    texlive-fonts-extra \
    texlive-latex-extra \
    texlive-font-utils

If you do not have sudo access, run:

wget http://mirror.ctan.org/systems/texlive/tlnet/install-tl-unx.tar.gz
tar -xvf install-tl-unx.tar.gz > untar.log
cd `head -1 untar.log`
./install-tl -profile texlive.profile

Set AWS credentials (optional, needed for data generation)

If you need you need to download data from AWS, please add your credentials to the credentials file. A sample of this file should look like:

[default]
aws_access_key_id=dummy_sample_credentials
aws_secret_access_key=dummy_sample_credentials_dummy_sample_credentials
aws_session_token=dummy_sample_credentials_dummy_sample_credentials_dummy_sample_credentials_dummy_sample_credentials_dummy_sample_credentials_dummy_sample_credentials_dummy_sample_credentials_

Also, don't forget to set the ARXIV_DATA_TMP_DIR and ARXIV_DATA_OUTPUT_DIR variables as mentioned in the README.md.

Test the pre-built Docker image:

sudo docker run --gpus all -it --volume /home/sampanna/deepfigures-results:/work/host-output --volume /home/sampanna/deepfigures-results/31219:/work/host-input sampyash/vt_cs_6604_digital_libraries:deepfigures_gpu_0.0.5 /bin/bash

This command will pull the sampyash/vt_cs_6604_digital_libraries:deepfigures_gpu_0.0.5 docker image from Docker Hub, run it and give us bash access to it. If this image is already pulled, this command will simply run it. sampyash/vt_cs_6604_digital_libraries:deepfigures_cpu_0.0.5 is also available for CPU use-cases.

Note: Please check the latest version before pulling.

In the above command, the first '--volume' argument connects the local output directory with the docker output directory. The second '--volume' argument does the same for the input directory. Please modify the local file paths as per your local host system. More info here.

Further, the --gpus all option tells docker to use all the GPUs available on the system. Try running nvidia-smi once inside the container to check if GPUs are accessible. The --gpus all option is not required when running the CPU docker image.

Generate data:

docker run --gpus all -it --volume /home/sampanna/deepfigures-results:/work/host-output --volume /home/sampanna/deepfigures-results:/work/host-input sampyash/vt_cs_6604_digital_libraries:deepfigures_gpu_0.0.6 python deepfigures/data_generation/arxiv_pipeline.py

This command will run the deepfigures/data_generation/arxiv_pipeline.py script from the source code which will:

  • Download data from AWS's requester-pays buckets using the credentials set above.
  • Cache this data in the directory /work/host-output/download_cache.
  • Unzip and generate the relevant training data.

Transform data:

docker run --gpus all -it --volume /home/sampanna/deepfigures-results:/work/host-output --volume /home/sampanna/deepfigures-results:/work/host-input sampyash/vt_cs_6604_digital_libraries:deepfigures_cpu_0.0.6 python figure_json_transformer.py
docker run --gpus all -it --volume /home/sampanna/deepfigures-results:/work/host-output --volume /home/sampanna/deepfigures-results:/work/host-input sampyash/vt_cs_6604_digital_libraries:deepfigures_cpu_0.0.6 python figure_boundaries_train_test_split.py

The data generated by the arxiv_pipeline.py is not in the format needed by tensorbox for training. Hence, this command will transform it. The second command will split the data in test and train split.

Train the model:

python manage.py train /work/host-input/weights/hypes.json /home/sampanna/deepfigures-results /home/sampanna/deepfigures-results

Here, the python environment created in one of the steps above should be activated.

  • The first argument to manage.py is the train command.
  • /work/host-input/weights/hypes.json is the path to the hyper-parameters as visible from inside the docker container.
  • /home/sampanna/deepfigures-results is the host's input directory for the container. This will be linked to /work/host-input.
  • /home/sampanna/deepfigures-results is the host's output directory for the container. This will be linked to /work/host-output.

Run detection:

python manage.py detectfigures '/home/sampanna/workspace/bdts2/deepfigures-results' '/home/sampanna/workspace/bdts2/deepfigures-results/LD5655.V855_1935.C555.pdf'

Here, the python environment created in one of the steps above should be activated.

  • The first argument to manage.py is the detectfigures command.
  • '/home/sampanna/workspace/bdts2/deepfigures-results' is the host path to the output directory to put the detection results in.
  • '/home/sampanna/workspace/bdts2/deepfigures-results/LD5655.V855_1935.C555.pdf' is the host path to the PDF file to be processes.

Instructions to run on ARC using Singularity:

Docker is not available on Virginia Tech's Advanced Research Computing HPC cluster. However, Singularity can be used to run pre-built Docker images on ARC using singularity.

Load the module:

Each time you ssh into either the login node or any of the compute nodes, please lode the Singularity module using:

module load singularity/3.3.0

Create the singularity directory:

mkdir /work/cascades/${USER}/singularity

Make the directory required for Singularity.

Pull the Docker image:

singularity pull docker://sampyash/vt_cs_6604_digital_libraries:deepfigures_gpu_0.0.6
  • This command will pull the given image from Docker Hub.
  • This command needs internet access and hence needs to be run on the login node.
  • This command will take some time.

Run the pulled image:

singularity run --nv -B /home/sampanna/deepfigures-results:/work/host-output -B /home/sampanna/deepfigures-results:/work/host-input /work/cascades/sampanna/singularity/vt_cs_6604_digital_libraries_deepfigures_cpu_0.0.6.sif /bin/bash
  • This command will run the pulled Docker image and give the user the shel access inside the container.
  • The --nv flag is analogous to the --gpus all option of Docker.
  • The -B flag is analogous to the --volume option of Docker.

The executions of the remaining commands is straightforward is left as an exercise to the reader.

Why was this fork made:

The master branch of the original repository was not working for me. So I debugged and made this fork. Following are the changes which were made.

Made changes to the Dockerfiles.

Docker-file was not building (both cpu and gpu). There was some error related to 'libjasper1 libjasper-dev not found'. Hence, added corresponding changes to the Dockerfile to make them buildable. Have also pushed the built images to Docker Hub. Link here. You can simply fetch the two images and re-tag them as deepfigures-cpu:0.0.6 and deepfigures-gpu:0.0.6. Further, added the functionality to make read AWS credentials from the ./credentials file.

Added the pre-built pdffigures jar.

pdffigures jar has been built and committed in the bin folder in this repository. Hence, you should not need to build it. Please have java 8 in your system to make it work.

scipy version downgrade

Version 1.3.0 of scipy does not have imread and imsave in scipy.misc. As a result, the import statement from scipy.misc import imread, imsave in detections.py was not working. Hence, downgraded the version of scipy to 1.1.0 in requirements.txt. The import worked as a result.

sp.optimize was not getting imported.

Imported it separately using from scipy import optimize and started using it like scipy.optimize().

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 77.7%
  • Shell 21.3%
  • Other 1.0%