Skip to main content

Challenges & Ideas related to automated bug detection

An overview of the process of creating an algorithm to identify bugs

Detecting bugs in code is a challenging task for various reasons.

Main Challenges​

Lack of training data​

To use machine learning models to predict bugs, one needs some kind of training data. There are two main sources of exploitable data. Code itself and Meta information about the project such as Tickets, QRQCs and other documentation.

QRQC​

The main issues with using QRQC is that their style and structure is highly dependent on the project. There is no standard template used by every project in BAM. Moreover, the content can differ a lot. For example, some project, don't put code inside their QRQC, other put screenshots of their code and others will like to a pull request that is on a private and unaccessible GitHub repository. Moreover, the organisation of the Notion is chaotic which makes it hard to find projects, their boards and their QRQCs.

Git commits​

Another idea would be to use Git commits to train the model. While this has more potential, a major issue is that projects using "Squash and Merge" or "Squash and Rebase" strategies will not have detailed commit information about don't allow us to see unit style commits that we need to understand the workflow of a developer. We could use other projects, but this limits training data.

Confidentiality and Access issues​

A lot of project keep their data in Private Notions and repositories and accessing them is not trivial. Moreover, requesting the access to this data can take some time and has confidentiality risks.

Possible solutions​

We can train on projects that are public as a proof of concept and later extend the training to existing projects after we get access to them. We just need to keep in mind that QRQC data is probably not usable as cleaning it would be a full-time job.

To get started, we'll probably use flashlight-cloud, react-native-warehouse or react-native-enablers, for several reasons:

  • They all use commitizen style commits making fixs and bugs easy to identify
  • They are all BAM internal projects, so we face little risk from privacy issues.
  • They all use a merge based strategy without squashing

Charles could also be considered, but is uses a squash strategy, making individual commits larger and less detailed. The advantage of Charles is that it is a realistic project

Other public / BAM projects that are usable are:

Numerous strategies to try out​

Another challenge is that there are a lot of different approches one could take to solve the issue of identifying bugs. To compare the approches, we'll need a benchmark.

Our Approach​

Fine-tuning an embeddings model​

Embedding models are able to convert a document into a vector. They can be used for search, but also for classification or for checking how similar 2 documents are. There are many pre-trained embeddings model available online that are specialized in representing English language.

To train an embedding model, one can use several formats of datasets. The easiest format is simply a collection of pairs of documents that are known to be similar to one another.

Because we are working with files commited to Git, we use the commit messages with the file diffs as our pairs of text for the training.

To work with models, we use Jupyter notebooks. Moreover, as we are working with Macs, we can use torch with the Metal backend (MPS) to speed up computations.

if torch.backends.mps.is_available():
mps_device = torch.device("mps")
x = torch.ones(1, device=mps_device)
print (x)
else:
print ("MPS device not found.") # Not a mac. Or an old model.

We can easily load model from hugging face

model_id = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(model_id, device="mps").to("mps")
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Needed for training / fine-tuning
train_loss = losses.MultipleNegativesRankingLoss(model=model)

We can evaluate the model on a given sentence

def getSentenceEmbeddings(sentences: List[str]) -> Tensor:
encoded_input = tokenizer(
sentences, padding=True, truncation=True, return_tensors="pt"
).to("mps")

with torch.no_grad():
model_output = model(encoded_input)

sentence_embeddings = F.normalize(model_output["sentence_embedding"],
p=2, dim=1)
return sentence_embeddings
v1 = getSentenceEmbeddings(["Hello, everybody !"])[0]
v2 = getSentenceEmbeddings(["Goodbye, everybody !"])[0]
v3 = getSentenceEmbeddings(["I like cheesecakes !"])[0]

# 1 is closer to 2 than to 3, as one could expect.
print(torch.norm(v1 - v2))
print(torch.norm(v1 - v3))

More details on how to train the models are available inside a notebook inside the github repository of the project.

Fine-tuning the model is challenging as one needs quality data to yield good results. My first tests seem to indicate that code and commit messages are not good data to help the model understand code better.

Generating data​

Due to the incredibly large amounts of data required to perform adequate training, generating training data is the only realistic solution. Not only does it greatly increase the amount of data available, but it also allows the network to gain a finer understanding of the code as it can see very similar pieces of code and understand specifically what makes a bit of code good or bad.

To make sure that code generated is actually buggy after a modification, one can use the principles from Mutation testing. We mutate the code at random and we make sure that it breaks a very strict testing suite before adding it to the set of buggy code for training.