Skip to main content

Creating a dataset of bug examples

Question​

Can we create a list of bug examples, to test our models against it?

Short answer​

Yes, but they lack of context.

Idea​

We can build a set of bug examples by extracting for each commit of type fix (see Collect examples by extracting fix commits) the files modified in this commit in their state just before the commit.

How to?​

from git.repo import Repo

repo = Repo.clone_from(url, outputDir)
commits = repo.iter_commits()

for c in commits:
if len(c.parents) == 0:
continue

mainParent = c.parents[0]
globalDiff = c.diff(mainParent)

for fileDiff in globalDiff:
filename = fileDiff.a_path
content = mainParent.tree[filename].data_stream.read().decode("utf-8")
print(content)