Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update node2vec #58

Open
wants to merge 36 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
e36134c
full_graph_link_predictor
YueZhong-bio Jun 9, 2020
1db25a1
Merge pull request #1 from seqRep/full_graph_link_predictor
YueZhong-bio Jun 9, 2020
0b0120d
Merge branch 'master' into master
mufeili Jun 11, 2020
bfda28a
Create gcn_link_predictor.py
YueZhong-bio Jun 16, 2020
ced4414
Create sage_link_predictor.py
YueZhong-bio Jun 16, 2020
2a5c8ad
Create test_link_prediction.py
YueZhong-bio Jun 16, 2020
8ebcabd
Create full_graph_link_predictor.py
YueZhong-bio Jun 16, 2020
da93c7e
Create logger.py
YueZhong-bio Jun 16, 2020
6bb813c
Add files via upload
YueZhong-bio Jun 16, 2020
032f875
Add files via upload
YueZhong-bio Jun 16, 2020
b06fb19
Update README.md
YueZhong-bio Jun 16, 2020
7930972
Update README.md
YueZhong-bio Jun 16, 2020
1d310df
Update README.md
YueZhong-bio Jun 16, 2020
3bec551
Delete full_graph_link_predictor.py
YueZhong-bio Jun 16, 2020
9099236
Merge branch 'master' into link_predictor_zy
YueZhong-bio Jun 16, 2020
28d292f
Merge branch 'master' into link_predictor_zy
mufeili Jun 18, 2020
f949043
Merge branch 'master' into link_predictor_zy
mufeili Jun 22, 2020
d005beb
Update
mufeili Jun 22, 2020
b86f0cf
Update
mufeili Jun 22, 2020
c3c0438
Fix
mufeili Jun 23, 2020
2a744ac
Merge pull request #3 from YueZhong-bio/lmf
YueZhong-bio Jun 23, 2020
a2bde0f
Update (#5)
mufeili Jun 23, 2020
5435ffd
Fix (#6)
mufeili Jun 23, 2020
958d5b3
Try CI (#7)
mufeili Jun 24, 2020
fa27bf3
Merge branch 'master' into link_predictor_zy
mufeili Jun 24, 2020
b490b12
CI (#8)
mufeili Jun 24, 2020
03eb4b1
Update full_graph_link_predictor.py
YueZhong-bio Jul 25, 2020
53459ad
Update README.md
YueZhong-bio Jul 25, 2020
a518ea2
Update full_graph_link_predictor.py
YueZhong-bio Jul 25, 2020
0a0fbf0
Update full_graph_link_predictor.py
YueZhong-bio Jul 25, 2020
40750b0
Merge branch 'master' into link_predictor_zy
mufeili Jul 26, 2020
d9289f7
Merge branch 'master' into link_predictor_zy
YueZhong-bio Aug 13, 2020
ac3c394
Update README.md
YueZhong-bio Aug 13, 2020
cf92a7e
Update README.md
YueZhong-bio Aug 13, 2020
d390185
Add files via upload
YueZhong-bio Aug 13, 2020
cb31eb6
Update README.md
YueZhong-bio Aug 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ We provide various examples across 3 applications -- property prediction, genera
- [PubChem Aromaticity with DGL](../python/dgllife/data/pubchem_aromaticity.py)
- OGB [[paper]](https://arxiv.org/abs/2005.00687)
- [ogbl-ppa](link_prediction/ogbl-ppa)
- [ogbg-ppa](property_prediction/ogbg_ppa)
- AstraZeneca Experimental Solubility from ChEMBL [[record]](https://www.ebi.ac.uk/chembl/document_report_card/CHEMBL3301361/)
- [Dataset](../python/dgllife/data/astrazeneca_chembl_solubility.py)

Expand Down
39 changes: 30 additions & 9 deletions examples/link_prediction/ogbl-ppa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ The optional arguments are as follows:
```
--use_gpu, use gpu for computation
--use_sage, use GraphSAGE rather than GCN
--use_node_embedding, prepare node embeddings using node2vec
--num_layers, number of GNN layers to use as well as linear layers for final link prediction (default=3)
--hidden_feats, size for hidden representations (default=256)
--dropout, (default=0.0)
Expand All @@ -37,22 +38,42 @@ The optional arguments are as follows:
--runs, number of random experiments to perform (default=1)
```


Full-batch GCN training based on Node2Vec features.
To generate Node2Vec features, please run ```python node2vec.py```. This script requires node embeddings be saved in ```embedding.pt```.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No "be"


The optional arguments are as follows:

```
--embedding_dim, the size of each embedding vector (default=128)
--walk_length, the walk length (default=40)
--context_size, the actual context size which is considered for positive samples (default=20)
--walks_per_node, the number of walks to sample for each node (default=10)
--batch_size, batch size to use for sampling (default=256)
--lr, learning rate (default=0.01)
--epochs, number of epochs for training (default=2)
--log_steps, number of steps log (default=1)
```


## Performance

For model evaluation, we consider hits@100 -- ranking each true link against 3,000,000 randomly-sampled
negative edges, and counting the ratio of positive edges that are ranked at 100-th place or above.

Using the default parameters, the performance of 10 random runs is as follows.

| Method | Train hits@100 | Validation hits@100 | Test hits@100 |
| --------- | -------------- | ------------------- | ------------- |
| GCN | 12.87 ± 5.07 | 12.39 ± 4.85 | 11.65 ± 4.56 |
| GraphSAGE | 9.58 ± 0.99 | 9.44 ± 0.96 | 9.86 ± 1.21 |

| Method | Average Time (hour) / epoch |
| --------- | --------------------------- |
| GCN | 1.38 |
| GraphSAGE | 1.47 |
| Method | Train hits@100 | Validation hits@100 | Test hits@100 |
| ----------- | -------------- | ------------------- | ------------- |
| GCN | 23.95 ± 2.80 | 22.60 ± 2.59 | 21.30 ± 3.41 |
| GraphSAGE | 13.88 ± 1.73 | 13.06 ± 1.51 | 11.90 ± 1.34 |
| Node2vec+GCN | 27.98 ± 2.63 | 26.45 ± 2.49 | 25.81 ± 2.58 |

| Method | Average Time (hour) / epoch |
| ----------- | --------------------------- |
| GCN | 1.25 |
| GraphSAGE | 1.28 |
| Node2vec+GCN | 1.29 |

## References

Expand Down
12 changes: 11 additions & 1 deletion examples/link_prediction/ogbl-ppa/full_graph_link_predictor.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ def train(model, predictor, g, x, splitted_edge, optimizer, batch_size):
loss = pos_loss + neg_loss
optimizer.zero_grad()
loss.backward()
#gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
torch.nn.utils.clip_grad_norm_(predictor.parameters(), 1.0)
optimizer.step()

num_samples = pos_out.size(0)
Expand Down Expand Up @@ -129,6 +132,8 @@ def main():
help='Print training progress every {log_steps} epochs (default: 1)')
parser.add_argument('--use_sage', action='store_true',
help='Use GraphSAGE rather than GCN (default: False)')
parser.add_argument('--use_node_embedding', action='store_true',
help='Prepare node embeddings using node2vec (default: 128)')
parser.add_argument('--num_layers', type=int, default=3,
help='Number of GNN layers to use as well as '
'linear layers to use for final link prediction (default: 3)')
Expand Down Expand Up @@ -160,7 +165,12 @@ def main():
data.readonly(False)
data.add_edges(data.nodes(), data.nodes())
splitted_edge = dataset.get_edge_split()
x = data.ndata['feat'].float().to(device)

if args.use_node_embedding:
x = torch.load('embedding.pt')
x = x.to(device)
else:
x = data.ndata['feat'].float().to(device)

if args.use_sage:
model = GraphSAGE(in_feats=x.size(-1),
Expand Down
203 changes: 203 additions & 0 deletions examples/link_prediction/ogbl-ppa/node2vec.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
import argparse
import dgl

import torch
from torch.nn import Embedding
from torch.utils.data import DataLoader
from torch_sparse import SparseTensor
from sklearn.linear_model import LogisticRegression

from ogb.linkproppred import DglLinkPropPredDataset

def save_embedding(model):
torch.save(model.embedding.weight.data.cpu(), 'embedding.pt')

EPS = 1e-15


class Node2Vec(torch.nn.Module):
r"""The Node2Vec model from the
`"node2vec: Scalable Feature Learning for Networks"
<https://arxiv.org/abs/1607.00653>`_ paper where random walks of
length :obj:`walk_length` are sampled in a given graph, and node embeddings
are learned via negative sampling optimization.
Args:
data: The graph.
edge_index (LongTensor): The edge indices.
embedding_dim (int): The size of each embedding vector.
walk_length (int): The walk length.
context_size (int): The actual context size which is considered for
positive samples. This parameter increases the effective sampling
rate by reusing samples across different source nodes.
walks_per_node (int, optional): The number of walks to sample for each
node. (default: :obj:`1`)
p (float, optional): Likelihood of immediately revisiting a node in the
walk. (default: :obj:`1`)
q (float, optional): Control parameter to interpolate between
breadth-first strategy and depth-first strategy (default: :obj:`1`)
num_negative_samples (int, optional): The number of negative samples to
use for each positive sample. (default: :obj:`1`)
num_nodes (int, optional): The number of nodes. (default: :obj:`None`)
sparse (bool, optional): If set to :obj:`True`, gradients w.r.t. to the
weight matrix will be sparse. (default: :obj:`False`)
"""
def __init__(self, data,edge_index, embedding_dim, walk_length, context_size,
walks_per_node=1, p=1, q=1, num_negative_samples=1,
num_nodes=None, sparse=False):
super(Node2Vec, self).__init__()

self.data = data
N = num_nodes
row, col = edge_index
self.adj = SparseTensor(row=row, col=col, sparse_sizes=(N, N))
self.adj = self.adj.to('cpu')

assert walk_length >= context_size

self.embedding_dim = embedding_dim
self.walk_length = walk_length - 1
self.context_size = context_size
self.walks_per_node = walks_per_node
self.p = p
self.q = q
self.num_negative_samples = num_negative_samples

self.embedding = Embedding(N, embedding_dim, sparse=sparse)

self.reset_parameters()

def reset_parameters(self):
self.embedding.reset_parameters()

def forward(self, batch=None):
"""Returns the embeddings for the nodes in :obj:`batch`."""
emb = self.embedding.weight
return emb if batch is None else emb[batch]

def loader(self, **kwargs):
return DataLoader(range(self.adj.sparse_size(0)),
collate_fn=self.sample, **kwargs)

def pos_sample(self, batch):
batch = batch.repeat(self.walks_per_node)
seed = torch.cat([torch.LongTensor(batch)] * 1)
rw = (dgl.sampling.random_walk(dgl.graph(self.data.edges()), seed, length=self.walk_length))[0]

walks = []
num_walks_per_rw = 1 + self.walk_length + 1 - self.context_size
for j in range(num_walks_per_rw):
walks.append(rw[:, j:j + self.context_size])

return torch.cat(walks, dim=0)

def neg_sample(self, batch):
batch = batch.repeat(self.walks_per_node * self.num_negative_samples)

rw = torch.randint(self.adj.sparse_size(0),
(batch.size(0), self.walk_length))
rw = torch.cat([batch.view(-1, 1), rw], dim=-1)

walks = []
num_walks_per_rw = 1 + self.walk_length + 1 - self.context_size
for j in range(num_walks_per_rw):
walks.append(rw[:, j:j + self.context_size])
return torch.cat(walks, dim=0)


def sample(self, batch):
if not isinstance(batch, torch.Tensor):
batch = torch.tensor(batch)
return self.pos_sample(batch), self.neg_sample(batch)

def loss(self, pos_rw, neg_rw):
r"""Computes the loss given positive and negative random walks."""

# Positive loss.
start, rest = pos_rw[:, 0], pos_rw[:, 1:].contiguous()

h_start = self.embedding(start).view(pos_rw.size(0), 1,
self.embedding_dim)
h_rest = self.embedding(rest.view(-1)).view(pos_rw.size(0), -1,
self.embedding_dim)

out = (h_start * h_rest).sum(dim=-1).view(-1)
pos_loss = -torch.log(torch.sigmoid(out) + EPS).mean()

# Negative loss.
start, rest = neg_rw[:, 0], neg_rw[:, 1:].contiguous()

h_start = self.embedding(start).view(neg_rw.size(0), 1,
self.embedding_dim)
h_rest = self.embedding(rest.view(-1)).view(neg_rw.size(0), -1,
self.embedding_dim)

out = (h_start * h_rest).sum(dim=-1).view(-1)
neg_loss = -torch.log(1 - torch.sigmoid(out) + EPS).mean()

return pos_loss + neg_loss

def test(self, train_z, train_y, test_z, test_y, solver='lbfgs',
multi_class='auto', *args, **kwargs):
r"""Evaluates latent space quality via a logistic regression downstream
task."""
clf = LogisticRegression(solver=solver, multi_class=multi_class, *args,
**kwargs).fit(train_z.detach().cpu().numpy(),
train_y.detach().cpu().numpy())
return clf.score(test_z.detach().cpu().numpy(),
test_y.detach().cpu().numpy())

def __repr__(self):
return '{}({}, {})'.format(self.__class__.__name__,
self.embedding.weight.size(0),
self.embedding.weight.size(1))

def main():
parser = argparse.ArgumentParser(description='OGBL-PPA (Node2Vec)')
parser.add_argument('--device', type=int, default=0)
parser.add_argument('--embedding_dim', type=int, default=128)
parser.add_argument('--walk_length', type=int, default=40)
parser.add_argument('--context_size', type=int, default=20)
parser.add_argument('--walks_per_node', type=int, default=10)
parser.add_argument('--batch_size', type=int, default=256)
parser.add_argument('--lr', type=float, default=0.01)
parser.add_argument('--epochs', type=int, default=2)
parser.add_argument('--log_steps', type=int, default=1)
args = parser.parse_args()

device = f'cuda:{args.device}' if torch.cuda.is_available() else 'cpu'
device = torch.device(device)

dataset = DglLinkPropPredDataset(name='ogbl-ppa')
data = dataset[0]
edge_index=torch.stack((data.edges()[0],data.edges()[1]),dim=0)

model = Node2Vec(data, edge_index, args.embedding_dim, args.walk_length,
args.context_size, args.walks_per_node,num_nodes=data.number_of_nodes(),
sparse=True).to(device)

loader = model.loader(batch_size=args.batch_size, shuffle=True,
num_workers=4)
optimizer = torch.optim.SparseAdam(model.parameters(), lr=args.lr)

model.train()
for epoch in range(1, args.epochs + 1):
for i, (pos_rw, neg_rw) in enumerate(loader):

optimizer.zero_grad()
loss = model.loss(pos_rw.to(device), neg_rw.to(device))
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

optimizer.step()

if (i + 1) % args.log_steps == 0:
print(f'Epoch: {epoch:02d}, Step: {i+1:03d}/{len(loader)}, '
f'Loss: {loss:.4f}')

if (i + 1) % 100 == 0: # Save model every 100 steps.
save_embedding(model)
save_embedding(model)


if __name__ == "__main__":
main()
3 changes: 3 additions & 0 deletions tests/model/test_gnn.py
Original file line number Diff line number Diff line change
Expand Up @@ -353,6 +353,9 @@ def test_graphsage():
activation=[F.relu, F.relu],
dropout=[0.2, 0.2],
aggregator_type=['gcn', 'gcn']).to(device)
assert gnn(g, node_feats).shape == torch.Size([3, 1])
assert gnn(bg, batch_node_feats).shape == torch.Size([8, 1])

gnn.reset_parameters()
assert gnn(g, node_feats).shape == torch.Size([3, 1])
assert gnn(bg, batch_node_feats).shape == torch.Size([8, 1])
Expand Down