This project re-implements the VarNaming task model described in the paper Learning to Represent Programs with Graphs, which can predict the name of a variable based on it's usage.
Furthermore, this project includes functionality for applying the VarNaming model to the MethodNaming task (predicting the name of a method from it's usage or definition).
If you use the provided implementation in your research, please cite the Learning to Represent Programs with Graphs paper, and include a link to this repository as a footnote.
Ensure you have the following packages installed (these can all be installed with pip3):
- numpy == 1.16.3
- pyYAML == 5.1
- tensorflow-gpu (or tensorflow) == 1.13.1
- dpu_utils == 0.1.28
- protobuf == 3.7.1
The corpus pre-processing functions are designed to work with .proto graph files, which can be extracted from program source code using the feature extractor available here.
Once you have obtained a corpus of .proto graph files, it is possible to use the corpus_extractor.py file located in the data_processing folder.
- Create empty directories for training, validation and test datasets
- Specify their paths, as well as the corpus path, in the config.yml file:
corpus_path: "path-to-corpus"
train_path: "path-to-train-data-output"
val_path: "path-to-val-data-output"
test_path: "path-to-test-data-output"
- Navigate into the repository directory
- Run corpus_extractor.py:
python3 ./data_processing/corpus_extractor.py
This will extract all samples from the corpus, randomly shuffle them, split them into train/val/test partitions, and copy these partitions into the specified train, val and test folders.
In order to train the model:
- Prepare training and validation dataset directories, as described in the Dataset Parsing section above
- Specify their paths in the config.yml file:
train_path: "path-to-train-data"
val_path: "path-to-val-data"
- Specify the token file path (where the extracted token vocabulary will be saved) and the checkpoint folder path (where the model checkpoint will be saved) in the config.yml file (note the fixed specification of the 'train.ckpt' file):
checkpoint_path: "path-to-checkpoint-folder/train.ckpt"
token_path: "path-to-vocabulary-txt-file"
- Navigate into the repository directory
- Run train.py:
python3 ./train.py
In order to use the model for inference:
- Prepare the test dataset directory as described in the Dataset Parsing section above
- Specify it's path in the config.yml file:
test_path: "path-to-test-data"
- Specify the token file path (where the extracted token vocabulary will be loaded from) and the checkpoint path (where the trained model will be loaded from) in the config.yml file:
checkpoint_path: "path-to-checkpoint-folder/train.ckpt"
token_path: "path-to-vocabulary-txt-file"
- Navigate into the repository directory
- Run infer.py:
python3 ./infer.py
In order to use the model for inference, as well as for computing extra sample information (including variable usage information and type information):
- Prepare the test dataset directory as described in the Dataset Parsing section above
- Specify it's path in the config.yml file:
test_path: "path-to-test-data"
- Specify the token file path (where the extracted token vocabulary will be loaded from) and the checkpoint path (where the trained model will be loaded from) in the config.yml file:
checkpoint_path: "path-to-checkpoint-folder/train.ckpt"
token_path: "path-to-vocabulary-txt-file"
- Navigate into the repository directory
- Run detailed_infer.py
python3 ./detailed_infer.py
The type of task you want the model to run can be specified by passing appropriate input arguments as follows:
- To run training/inference using the VarNaming task (computing variable usage information) no input arguments are required
- To run training/inference using the MethodNaming usage task (computing method usage information) add the string "mth_usage" as an input argument when calling the scripts
- To run training/inference using the MethodNaming definition task (computing method body information) add the string "mth_def" as an input argument when calling the scripts
For example, in order to train the model for the MethodNaming task using definition information, the script call will be the following:
python3 ./train.py mth_def
Similarly, for running inference using the MethodNaming definition task, the script call will be the following:
python3 ./infer.py mth_usage
The saved_models directory includes pre-trained models, which can be used to run inference directly, without any training. The paths to the saved checkpoint and vocabulary files need to be specified in the config.yml file in the usual way, as described in the "Inference" section above.
- data_processing: includes code for processing graph samples and corpus files
- model: includes the implementation of the VarNaming model
- saved_models: pre-trained models for the VarNaming and MethodNaming tasks
- utils: auxiliary code implementing various functionality, such as input argument parsing and vocabulary extraction
- train.py, infer.py, detailed_infer.py: files for running training and inference using the model, as described in the previous sections
- config.yml: configuration file storing string properties
- graph_pb2.py: used for parsing .proto sample files