Creating a dataset of bug examples
Questionβ
Can we create a list of bug examples, to test our models against it?
Short answerβ
Yes, but they lack of context.
Ideaβ
We can build a set of bug examples by extracting for each commit of type fix (see Collect examples by extracting fix commits) the files modified in this commit in their state just before the commit.
How to?β
from git.repo import Repo
repo = Repo.clone_from(url, outputDir)
commits = repo.iter_commits()
for c in commits:
if len(c.parents) == 0:
continue
mainParent = c.parents[0]
globalDiff = c.diff(mainParent)
for fileDiff in globalDiff:
filename = fileDiff.a_path
content = mainParent.tree[filename].data_stream.read().decode("utf-8")
print(content)