Deep learning to find bugs

A review of the following technical report published in 2017: https://software-lab.org/publications/DeepBugs_TR_Nov2017.pdf

The idea behind this framework is to apply transformations to code so that the code produced is buggy. This allows the system to have access to a large amount of training data. The bugs detected might not come from the "real world" and the system is closer to a linter, but it has the ability to take into account variable and function names that are usually ignored by linters.

The paper confirms that they faced the same challenges as us when attempting to create a bug detector:

A key reason for the current lack of learning-based bug detectors is that effective machine learning requires large amounts of training data

Among the transformations used to create buggy code, the authors propose:

swapping the arguments of a function
Changing comparison operators (replacing <= with < for example)
Mixing x and y (height - x instead of height - y for example)

We find that the learned bug detectors have an accuracy between 84.23% and 94.53%

While this is quite accurate, developers don't like false positives (when using linters for example), so there will be a UX challenge on how to present the warnings so that the developer can receive feedback from the tool without feeling overwhelmed by notifications and warnings.

Internally, they use a simple word embedding model, with a deep-forward neural network. We can definitely improve on their approach by using an LLM instead of a word2vec style model to generate embeddings at first.

model = Sequential()
model.add(Dropout(0.2, input_shape=(x_length,)))
model.add(Dense(200, input_dim=x_length,
				activation="relu", kernel_initializer='normal'))
model.add(Dropout(0.2))
#model.add(Dense(200, activation="relu"))
model.add(Dense(1, activation="sigmoid", kernel_initializer='normal'))

In their model, the data is simply the sum of the vectors that represent the function name and the arguments given. This approach might struggle for functions with a large number of arguments. A bit of the relevant code to generate training data:

# for all xy-pairs: y value = probability that incorrect
x_keep = callee_vector + argument0_vector + argument1_vector
x_keep += base_vector + argument0_type_vector + argument1_type_vector
x_keep += parameter0_vector + parameter1_vector  # + file_name_vector
y_keep = [0] # The output is a probability. For training, it's 0 or 1
xs.append(x_keep) # The list used to store the training data.

Another limitation of their approach is that the model will struggle to learn new types of bug as their encoding is highly dependent on the type of bugs they are trying to catch.