-
Notifications
You must be signed in to change notification settings - Fork 5
Python Demo
This page describes how to use the python demo (demo/rehashing.py) to
- Investigate how many samples on average are required to answer random queries from the dataset to a desired relative error.
- Tune the parameters of eLSH
- Visualize the local structure of queries from the dataset to gain intuition about the distance scales that contribute to the query density
- Prepare the config file to use the efficient C++ implementation
We assume that:
- the dataset in question is stored in a CSV file with each row corresponding to a vector (point).
- the kernel k(x,y) is a decreasing function of the Euclidean distance ||x-y|| and the bandwidth \sigma is known.
- the desired relative accuracy \epsilon in (0,1) is known
- a threshold \tau \in (0,1) that defines a lower bound on the densities for which we want to approximate with relative error at most \epsilon.
- Bandwidth selection for Kernel density estimation or kernel regression is still a topic of research and there are different approaches. We do not discuss here how to choose the bandwidth.
- If the dataset is large (e.g. > 500K rows) we recommend running the demo on a smaller CSV file, by sampling a certain number N = 18 / (\epsilon ^ 2 * \tau) of random rows (points).
- Recommended values for kernel density estimation is \epsilon = 0.25, \tau = 0.1 / \sqrt{n}.
The first step is to load the dataset and specify the kernel evaluation problem by instantiating:
- weights to be used w_i for i in [n]
- the kernel function k(x,y)
- the bandwidth \sigma
- the threshold \tau
- the target relative error \epsilon
In this demo we are going to apply our method to the covertype UCI dataset.
# example dataset
points = np.loadtxt(open("datasets/covtype.data", "rb"), delimiter=",", skiprows=0)
points = points.transpose()
(d,n) = points.shape
# specify kernel evaluation paramaters
weights = np.ones(n) / n # in general weights can be non-negative and sum to 1
sigma = np.sqrt(np.mean(np.mean(points**2))) / 5 # example bandwidth
# exponential kernel
kernel_fun = lambda x,y: np.exp(-np.linalg.norm(x-y) / sigma)
# lower bound of densities of interest suggested value is 0.1 / sqrt{n}
tau = 0.0001
# target multiplicative accuracy
eps = 0.2
# Initialize data structure
covtype = rehashing(points, weights, kernel_fun)
# Create sketch
covtype.create_sketch(method='random', accuracy=eps, threshold=tau,\
num_of_means=1)
# Set parameters for hashing through eLSH
hash_probs = []
The hashing scheme with label 0, corresponds to the setting of parameters that theory suggests. Schemes with label 1 and 2, correspond to using hashing schemes where the power (number of hash functions concatenated to produce a single sample) and hence the evaluation time is reduced by a factor of 5 and 10 respectively.
R = np.log(1/eps/tau) # effective diameter for exponential kernel
# For illustration we will apply our method for 2 hashing schemes
# but one can extend this to an arbitrary number of hashing schemes
# Hashing scheme #0
kappa_0 = int(np.ceil(np.sqrt(2*np.pi)*R*np.log(1/tau))) # ``power"
w0 = np.sqrt(2 / np.pi) * 2 * kappa_0 # set hash bucket width
p0 = partial(eLSH_prob, w=w0*sigma, k=kappa_0) # collision probability
hash_probs.append(p0) # add to list
# Hashing scheme #1
kappa_1 = kappa_0 / 5 # ``power"
w1 = np.sqrt(2 / np.pi) * 2 * kappa_1 # set hash bucket width
p1 = partial(eLSH_prob, w=w1*sigma, k=kappa_1) # collision probability
hash_probs.append(p1) # add to list
# Hashing scheme #2
kappa_2 = kappa_0 / 10 # ``power"
w2 = np.sqrt(2 / np.pi) * 2 * kappa_1 # set hash bucket width
p2 = partial(eLSH_prob, w=w2*sigma, k=kappa_2) # collision probability
hash_probs.append(p2) # add to list
# Run diagnostic and visualization
covtype.diagnostic(hash_probs, acc=eps, threshold=tau, \
num_queries=100, visualization_flag=True)
In this setting, we observe that the hashing scheme with label 1 has lower estimated relative variance and will be faster to evaluate. All hashing schemes have lower estimated relative variance (sample complexity) than uniform random sampling.
We can also use the diagnostic procedure to produce a visualization of the local query structure of the dataset. For more information see the Visualizations page.
After running our diagnostic procedure we have identified a good setting of parameters for the hashing scheme that we can use to create a config file.