Tools used for data processing for ACSAC 2017 & CCS 2017 & some other.
The scripts are usually directly runnable from the command line. Try invoking with --help
option for more info.
scripts are related to the HTTPS Ecosystem dataset processing.censys_sonarssl_*
scripts are related to the SonarSSL dataset processing.censys_*
others than above are general tools related to Censys TLS scans data processing.pgp_*
scripts are related to the PGP dataset processing. The main processing script
Some recoding scripts may need large amount of RAM (e.g., eco recode needs 80 GB RAM).
The processing is built on the PBSPro job scheduler. Jobs are generated as shell scripts and submitted to the scheduler. Jobs usually use the shared data mount to put the results in.
- For Censys scripts you need an user account to get the data.
- Convert links to JSON file: When logged in, download the HTML page with links,
python codesign/ downloadedpage.html >> censys_links.json
- Process the dataset on the fly (one file only):
stdbuf -eL --debug --link-file censys_links.json --link-idx 10 --data "/tmp" --continue --sec
For the real large scale processing one would need to generate jobs as described below and schedule them with PBSPro.
export DATADIR="/storage/praha1/home/$LOGNAME/results"
mkdir -p $DATADIR
export HOMEDIR="/storage/praha1/home/$LOGNAME"
export PYENV_ROOT="$HOMEDIR/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
pyenv local 2.7.13
mkdir -p jobs
cd jobs
python cas/codesign/ \
--home=$HOMEDIR \
--wrapper ${HOMEDIR}/cas/ \
--data=$DATADIR \
--jobs-dir=jobs \
# or
python ../cas/codesign/ \
--data /storage/brno3-cerit/home/ph4r05/fulltls \
--wrapper /storage/praha1/home/ph4r05/cas/ \
--home /storage/praha1/home/ph4r05/cas \
E.g., for debugging the script / env prepare. Frontends are quite slow.
qsub -l select=1:ncpus=1:mem=1gb:scratch_local=1gb -l walltime=48:00:00 -I
qsub -l select=1:ncpus=1:mem=1gb:scratch_local=1gb:vnode=tarkil3 -l walltime=48:00:00 -I
qsub -l select=1:ncpus=1:mem=1gb:scratch_local=1gb:cl_tarkil=True -l walltime=48:00:00 -I
qdel 1085540
curl -s 2>&1 | lz4 -d -c - | head -n 1
For that you may need to install lz4:
sudo apt-get install liblz4-tool
- One may use export scripts, but this is a bit slower
- Faster solution: mysqldump + sqlite import.
In the latter the schea is created by the export script.
Note the modified mysql2sqlite
script from this repo is needed for import of hex coded blobs.
mysqldump --skip-extended-insert --compact --hex-blob -u codesign -p codesign \
--tables maven_artifact maven_signature pgp_key > maven_dump.sql
./mysql2sqlite maven_dump.sql | sqlite3 maven.sqlite
Forwarding MySQL port from one machine to another via SSH tunnel.
Please note that SSH tunnel forwarding does not allow port binding on the / 0:: by default.
There are 2 ways to do the port binding.
- Connecting from Meta to DB server, using local tunneling. Client is in charge. Can open as many tunnels as desired.
- Connection from DB server to Meta, using remote tunelling. Server creates a single connection on some server hub.
Binding on the local interface only can be mitigated by using
- Create a new ssh key on the Meta, this key will be allowed to do only the port forward on the DB server
- DB server
command="echo 'This account can only be used for port forward'",no-agent-forwarding,no-X11-forwarding,permitopen="localhost:3306" ssh-rsa AAAAB3NzaC1y....
- Create a tunnel on the Meta. Ideally do that in the
ssh -nNT -L 60123:localhost:3306 klivm &
- Socat hack, forwarding local bound 60123 port to the global bound 60124. Ideally do that in the
:- Socat can be found here:
socat tcp-listen:60124,reuseaddr,fork tcp:localhost:60123 &
Use 60124 port for MySQL connection
Alternatively you can use another SSH from worker node on Meta to the frontend node (benefit: encrypted connection from worker node to the frontend).
ssh -nNT -L 60125:localhost:60123 tarkil &
# MPFR module load, or manual installation:
module add mpfr-3.1.4
export CWD=$HOME
# install GMP
tar -xjvf gmp-6.1.2.tar.bz2
cd gmp-6.1.2
./configure --prefix=$CWD
make && make install
cd $CWD
# install MPC
tar -xzvf mpc-1.0.3.tar.gz
cd mpc-1.0.3
./configure --prefix=$CWD
make && make install
cd $CWD
# Install gmpy2
env CFLAGS="-I${CWD}/include" LDFLAGS="-L${CWD}/lib" --global-option=build_ext --global-option="-I${CWD}/include" pip install gmpy2
Message passing interface enables effective parallel computation
Request nodes:
qsub -l select=4:ncpus=1:mem=1gb:scratch_local=1gb -l walltime=1:00:00 -l place=scatter
The script can contain something like this:
mpirun -machinefile $PBS_NODEFILE python
The mpirun
executable will execute the script on each node in the machine file. Examples:
pip install --upgrade --find-links=. .
pip install MySql-Python
pip install SQLAlchemy
sudo apt-get install python-pip python-dev libmysqlclient-dev build-essential
sudo apt-get install libsasl2-dev python-dev libldap2-dev libssl-dev libffi-dev libsqlite3-dev libreadline-dev lbzip2
sudo yum install gcc gcc-c++ make automake autoreconf libtool
sudo yum install python-devel python-pip gcc gcc-c++ make automake autoreconf libtool openssl-devel libffi-devel dialog
sudo yum install mysql-devel redhat-rpm-config readline-devel libzip-devel bzip2-devel
sudo yum install openldap-devel
pip install pyopenssl
pip install pycrypto
pip install git+
pip install --upgrade --find-links=. .
It is usually recommended to create a new python virtual environment for the project:
virtualenv ~/pyenv
source ~/pyenv/bin/activate
pip install --upgrade pip
pip install --upgrade --find-links=. .
module add cmake-3.6.2
module add gcc-4.8.2
It won't work with lower Python version. Use pyenv
to install a new Python version.
It internally downloads Python sources and installs it to ~/.pyenv
git clone ~/.pyenv
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
exec $SHELL
pyenv install 2.7.13
pyenv install 3.6.2
pyenv local 2.7.13
pip install -U pip setuptools twine
python sdist
twine upload dist/*