This repository contains the code and benchmarks for the SIGMOD 2019 paper: JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. Follow the steps here to run the experiments.
PostgreSQL and Go are required to run the experiments.
- Download and
install
from source. Make sure to use
--prefix=$HOME
to install to user home directory, and--with-pgport=5442
to set the port for both server and client. - Initialize a database directory:
initdb -D pg_data
. - Use the configuration file in
conf/postgresql.conf
to start a server:postgres -D pg_data -c config_file=conf/postgresql.conf
. - Create a new database same as your Unix user name:
createdb <dbname>
. - Test the client-server connection using
psql -p 5442
.
- Download and install the Go programming language
- Create a directory under your home directory
mkdir ~/go
, this will be your go path - Make sure you have set up
$GOPATH
in your bash environment by adding the following lines to your bash profile, then restart your bash session
export GOPATH=$HOME/go
export GOBIN=$GOPATH/bin
export PATH=$GOBIN:$PATH
- Important: check out this repository under your go path:
mkdir -p ~/go/src/github.com/ekzhu/josie
git clone git@github.com:ekzhu/josie.git ~/go/src/github.com/ekzhu/josie
Now go into the project directory at ~/go/src/github.com/ekzhu/josie
.
First download the benchmarks in the form of Postgres dumps.
Uncompress the dump files (use gzip -d
)
and run the SQL files (or use pg_restore
)
to load the benchmarks into Postgres.
Make sure to use the port setting you used when installing
Postgres earlier, so the dump files get imported into the
right database.
Then, run the SQL script create_indexes.sql
to create indexes for the
sets and posting lists tables.
We use the targets defined in Makefile
to run experiments.
First you need to generate a cost sample table to compute the read
cost of sets and posting lists.
make sample_cost_canada_us_uk
make sample_cost_webtable
To run experiments using the Open Data benchmark:
make canada_us_uk
Web Table benchmark:
make webtable
Notice: the experiments can take many hours or even days depending on your
hardware environment (SSD will be much faster than HDD).
To fine tune which experiments to run, you can modify
exp.go
.
Results are located in the results
directory. Use the targets defined
in the Makefile
to plot results:
make plot
The output plots are located in the plots
directory.