Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TokenGT model #9834

Open
wants to merge 33 commits into
base: master
Choose a base branch
from
Open

Conversation

michailmelonas
Copy link

@michailmelonas michailmelonas commented Dec 9, 2024

PyG implementation of the Tokenized Graph Transformer following "Pure Transformers are Powerful Graph Learners" (https://arxiv.org/pdf/2207.02505). Includes support for both Laplacian eigenvectors and ORF node identifiers (implemented via a simple data Transform object). A graph regression example is also included.

For a detailed blog post about the implementation, see https://medium.com/stanford-cs224w/pyg-implementation-tokengt-e4aa74dc867b.

@michailmelonas
Copy link
Author

@wsad1 @EdisonLeeeee @akihironitta any thoughts on when this contribution will get reviewed? :)

@puririshi98
Copy link
Contributor

@michailmelonas this is cool, ill review and help merge soon as my time allows,

@puririshi98 puririshi98 self-requested a review January 14, 2025 20:56
@puririshi98
Copy link
Contributor

this looks good, will do a deep review soon

@puririshi98
Copy link
Contributor

puririshi98 commented Jan 15, 2025

this is good at a high level. however i want to see how it compares to existing work. Can you please update this example:
https://github.com/pyg-team/pytorch_geometric/blob/master/examples/ogbn_train.py#L31
to have a "--gnn-choice" arg parse option, with choices ["sage, gat, tokengt_graph_transformer"]. and run all 3 in your environment to see how they compare. Please make the highest test acc the default. I can review a little closer once that initial test is done

@michailmelonas
Copy link
Author

@puririshi98 sure thing, will do asap.

@puririshi98
Copy link
Contributor

@michailmelonas lmk when ready for further review

@michailmelonas
Copy link
Author

@puririshi98 apologies for only getting back to you now - have been swamped at work.

TokenGT requires specifying n_nodes orthogonal vectors ("node identifiers"). This is infeasible for the ogbn-papers100M dataset which has over 100M nodes. Therefore, rather than amending ogbn_train.py, I instead added token_gt_ogbn.py: a script that makes it easy to benchmark TokenGT against GCN on the ogbg-molhiv dataset (ideally, I'd like to run the model on PCQM4Mv2 as in the paper, but given my computational resources this was the best I could do). Running said script, I get slightly worse (but comparable) results for TokenGT vs GCN: the former has a validation ROC-AUC of 0.774 and the latter has 0.819.

@puririshi98
Copy link
Contributor

puririshi98 commented Jan 28, 2025

i think as a sanity check to get this merged, you should make an example which uses some opensource dataset(check relbench or ogb) to show higher accuracy than gcn and sage (with an argparser to choose between the three, defaulting to your graphtransformer). it will be a good research experience for you

@michailmelonas
Copy link
Author

Okay, will do. Will most likely only get to this next week. Apologies that this is dragging.

Copy link

codecov bot commented Jan 31, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.39%. Comparing base (aa6cf80) to head (e5db369).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9834      +/-   ##
==========================================
- Coverage   86.79%   86.39%   -0.41%     
==========================================
  Files         490      492       +2     
  Lines       32436    32594     +158     
==========================================
+ Hits        28154    28159       +5     
- Misses       4282     4435     +153     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@puririshi98
Copy link
Contributor

Okay, will do. Will most likely only get to this next week. Apologies that this is dragging.

no problem, looking forward to seeing what you can do :)

@puririshi98
Copy link
Contributor

checking in @michailmelonas hows it going?

@puririshi98
Copy link
Contributor

puririshi98 commented Mar 1, 2025

c2bbb41

please ensure you follow this

@michailmelonas
Copy link
Author

@puririshi98 really sorry for the late response - between coursework (https://web.stanford.edu/class/cs234/) and working full time I've not had a chance to get to this. I think the PCQM4Mv2 dataset (https://ogb.stanford.edu/docs/lsc/pcqm4mv2/) would be best to benchmark TokenGT against GCN/GraphSAGE (this is what was used in the original paper). I've obtained some GPU credits to run this experiment. Realistically, I won't be able to start with this in the next two weeks, but will get to it asap post 18/03.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants