Use sub-files to add more context to the prompt
Strategyβ
Use the AST for resolving sub-files of the file to test.
ADR\ - Explore AST with babel vs typescript
=> We choose to use typescript for resolving sub-files.
Why?β
To improve test generation by giving more context: Hub Test Generation
Resultsβ
Testing strategyβ
- Our test data: files of the warehouse package in the
react-native-enablersrepo. - Idea: generate tests with AI for all warehouse files which already have a test (so we can compare the generated tests with the human-written tests)
- Methodology:
- Get all files to test
- For each file to test, get all the sub-files imported in the file
- Create a prompt: A. Without the sub-files (previous version) B. With the sub-files (new version)
- Make the request to OpenAI/GPT for each file
- Clean the response (ie remove the backquotes or verbatims)
- Replace the
rendermethod by the method from utilsrenderWithProviders(because we know it is a common mistake from the AI) - Run jest in the
warehousefolder to know the number of successful tests
- Limitations: we crop the prompt to β3000 tokens to avoid exceeding the maximum number of tokens per request
Benchmarkβ
| Experience | Success rate (after cleaning) | Success rate (after replacing render by renderWithProviders) manually |
|---|---|---|
| A: Without sub-files in the prompt | 2/72 (2,7%) | 27/72 (37,5%) |
| B: With sub-files in the prompt | 5/63 (7,9%) | 29/59 (49,2%) |
Legend: X/Y :
- X is the number of tests written by AI that passed,
- Y is the total number of tests where jest has been able to parse the file.
For the full test procedure and benchmark, see the ai-research repo.
Analysis of the resultsβ
π A better success rate with sub-filesβ
The tests are more accurate, thanks to the sub-files.
π Less tests in totalβ
We have more test files that are not semantically correct, and so can't be parsed by jest. Hypothesis of cause:
- We give the file and sub-files in the same way in the prompt, and thus the AI tries to create a test file for each of the sub-files. This result in multiple files written in the same answer, and so in the same file, and so we have duplicated imports in the generated test file (which breaks the file).