Finding and Fixing bugs with Machine Learning

A review of a microsoft blog post published in 2021.

Source: https://www.microsoft.com/en-us/research/blog/finding-and-fixing-bugs-with-deep-learning/

They mention the lack of training data to find bugs. They use a GAN style approach to solve this problem. First, they start with code that is assumed to be correct. Then, they train 2 models at the same time. The goal of the first model is to introduce subtle differences inside the code and the goal of the second model is to find them. They are called the bug selector and the bug detector.

This solves the data problem. Because the selector can only introduce subtle differences, they focus on bugs such as a "+" being replaced with a "-" as they consider that more complex bugs are out of reach of current models.

While they found bugs in existing code and outperformed approches that used code labeled as buggy, they still had a lot of false positives.

To represent code for their model, they used token based embeddings that they merge using max pooling. This is strange as I think they are losing a lot of data doing this.

LLM based Bug detection / Bug fixing system are at their infancy and have low accuracy overall: https://www.alibabacloud.com/blog/how-to-fix-bugs-automatically-through-machine-learning_599186

I think that one could combine the LLM techniques and the one used earliers (GAN like techniques for example), to increase the detection accuracy.