In this study, we explore the suitability of current state-of-the-art code summarization metrics for evaluating the similarity between two pieces of text: the reference and the generated summary. Traditionally, this evaluation is based on comparing the word tokens in the reference with those predicted by a code summarizer. However, we identify two scenarios where this assumption may fall short:
(i) The reference summary, often extracted from code comments in software repositories, may be of low quality due to outdated or less relevant information.
(ii) The generated summary might use different wording compared to the reference but still convey the same meaning, making it suitable to document the code snippet effectively.
To address these limitations, we introduce a new dimension that assesses the extent to which the generated summary aligns with the semantics of the documented code snippet independently of the reference summary. By empirically evaluating this aspect, we propose a new metric called SIDE (Summary alIgnment to coDe sEmantics) that leverages contrastive learning to capture better the developers' assessment of the automatically generated summaries' quality.
To build SIDE (Summary alIgnment to coDe sEmantics), we relied on Contrastive learning to learn an embedding space where similar sample pairs (i.e., pairs sharing specific features) are clustered together while dissimilar pairs are set apart. In our context, we want to use contrastive learning to teach the model to discriminate textual summaries suitable for a given code snippet vs. unsuitable summaries.
In our study, we employ the Triplet loss, which has been shown to better encode the positive/negative samples as compared to other contrastive losses. The triplet loss function has been proposed by Schroff et. al and introduces the concept of "anchor". Given an anchor
In our case, the anchor is the code to document, with a suitable summary representing
-
The datasets used for training and evaluation are available here
-
The model we trained to develop SIDE is publicly available at the following link: Models
-
The scripts needed to conduct the analysis are available here
-
- Create a new virtual env using:
python3 -m venv venv
- Activate the new env:
source venv/bin/activate
- Install all the dependencies using pip:
pip install -r requirements.txt
- Run the following script:
from sentence_transformers import SentenceTransformer, models, InputExample, losses, util from transformers import AutoTokenizer, AutoModel import torch import sys,os import torch.nn.functional as F from sentence_transformers import evaluation from tqdm import tqdm DEVICE = "cpu" # Mean Pooling - Take attention mask into account for correct averaging def mean_pooling(model_output, attention_mask): # First element of model_output contains all token embeddings token_embeddings = model_output[0] input_mask_expanded = attention_mask.unsqueeze( -1).expand(token_embeddings.size()).float() return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9) checkPointFolder = "models/triplet-loss/..." #specify the path to the best-performing checkpoint tokenizer = AutoTokenizer.from_pretrained(checkPointFolder) model = AutoModel.from_pretrained(checkPointFolder).to(DEVICE) method = """ public Object pop() throws EmptyStackException { try { Object aObject = this.stackElements.get(stackElements.size() - 1); stackElements.remove(stackElements.size() - 1); return aObject; } catch (Exception e) { throw new EmptyStackException(e); } } """ codeSummary = "pop the top of the stack" pair = [method,codeSummary] encoded_input = tokenizer( pair, padding=True, truncation=True, return_tensors='pt').to(DEVICE) # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) # Perform pooling sentence_embeddings = mean_pooling( model_output, encoded_input['attention_mask']) # Normalize embeddings sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1) sim = util.pytorch_cos_sim( sentence_embeddings[0], sentence_embeddings[1]).item() print(sim) #0.836042046546936
- Create a new virtual env using: