Code fragments from developer forums often migrate to applications due to the code reuse practice. Owing to the incomplete nature of such programs, analyzing them to early determine the presence of potential vulnerabilities is challenging. In this work, we introduce NeuralPDA, a neural network-based program dependence analysis tool for both complete and partial programs. Our tool efficiently incorporates intra-statement and inter-statement contextual features into statement representations, thereby modeling program dependence analysis as a statement-pair dependence decoding task. In the empirical evaluation, we report that NeuralPDA predicts the CFG and PDG edges in complete Java and C/C++ code with combined F-scores of 94.29% and 92.46%, respectively. The F-score values for partial Java and C/C++ code range from 94.29%–97.17% and 92.46%–96.01%, respectively. We also test the usefulness of the PDGs predicted by NeuralPDA (i.e., PDG*) on the downstream task of method-level vulnerability detection. We discover that the performance of the vulnerability detection tool utilizing PDG* is only 1.1% less than that utilizing the PDGs generated by a program analysis tool. We also report the detection of 14 realworld vulnerable code snippets from StackOverflow by a machine learning-based vulnerability detection tool that employs the PDGs predicted by NeuralPDA for these code snippets.
Here are the links for datasets used in this paper:
- Java dataset for intrinsic evaluation: link
- C/C++ dataset for intrinsic evaluation: link
- Java dataset for method-level vulnerability detection: link
Here are the links for pre-trained RobertaTokenizer objects for Java and C/C++: link
Pre-trained NeuralPDA model weights (w/o statement types): link
$ python run.py [options]
Options:
-h, --help show this help message and exit
--data_dir DATA_DIR Path to datasets directory.
--output_dir OUTPUT_DIR
The output directory where the model saves model checkpoints
--lang {c,java} Programming language.
--expt_name EXPT_NAME
Name of experiment to log in Weights and Biases.
--max_tokens MAX_TOKENS
Maximum number of tokens in a statement
--max_stmts MAX_STMTS
Maxmimum number of statements
--num_layers NUM_LAYERS
Number of layers for Transformer encoder
--num_layers_stmt NUM_LAYERS_STMT
Number of layers for Self-Attention Network
--forward_activation FORWARD_ACTIVATION
Non-linear activation function in encoder
--hidden_size HIDDEN_SIZE
Hidden size of decoding MLP
--intermediate_size INTERMEDIATE_SIZE
Dimensionality of feed-forward layer in Transformer
--embedding_size EMBEDDING_SIZE
Dimensionality of encoder layers
--num_heads NUM_HEADS
Number of attention heads
--vocab_size VOCAB_SIZE
Vocabulary size
--use_stmt_types Use statement type information.
--no_ssan Do not use self-attention network for statement
encoding.
--no_pe Do not use statement-level position encoding.
--no_tr Do not use transformer encoder.
--load_model_path LOAD_MODEL_PATH
Path to trained model: Should contain the .bin files
--do_train Whether to run training.
--do_eval Whether to run eval on the dev set.
--do_eval_top_k Whether to run eval on the partitioned dev set.
--do_predict Whether to predict on given dataset.
--log_interval LOG_INTERVAL
--train_batch_size TRAIN_BATCH_SIZE
Batch size per GPU/CPU for training.
--eval_batch_size EVAL_BATCH_SIZE
Batch size per GPU/CPU for evaluation.
--learning_rate LEARNING_RATE
The initial learning rate for Adam.
--weight_decay WEIGHT_DECAY
Weight deay if we apply some.
--dropout_rate DROPOUT_RATE
Dropout rate.
--adam_epsilon ADAM_EPSILON
Epsilon for Adam optimizer.
--num_train_epochs NUM_TRAIN_EPOCHS
Total number of training epochs to perform.
--seed SEED random seed for initialization
- Training
$ python run.py --data_dir ./datasets/ --output_dir ./outputs/intrinsic/java_8 --lang java --do_train --use_stmt_types --max_stmts 8 --expt_name intrinsic-java-8
- Inference
$ python run.py --data_dir ./datasets/ --output_dir ./no_output --lang java --do_eval --use_stmt_types --max_stmts 8 --load_model_path ./outputs/intrinsic/java_8/Epoch_4/model.ckpt
- Make Predictions
$ python run.py --lang java --do_predict --use_stmt_types --max_stmts 8 --load_model_path ./outputs/intrinsic/java_8/Epoch_4/model.ckpt
$ python infer.py --lang java -i <path-to-input(s)> -o {json|html}