Understanding Code with Open AI
Goal: Determine which model to use for bam.AIβ
Idea 1: Use Codex by training it on BAM code.β
- Codex is deprecated and no longer available, but fine-tuning with OpenAI can be used to train the model on BAM code.
Idea 2: Use model fine-tuning with Open AIβ
| Model | Training | Usage |
|---|---|---|
| Ada | $0.0004Β / 1K tokens | $0.0016Β / 1K tokens |
| Babbage | $0.0006Β / 1K tokens | $0.0024Β / 1K tokens |
| Curie | $0.0030Β / 1K tokens | $0.0120Β / 1K tokens |
| Davinci | $0.0300Β / 1K tokens | $0.1200Β / 1K tokens |
https://platform.openai.com/docs/guides/fine-tuning
Alternatives to Open AIβ
If we want to choose a text embedding model to calculate distance and display function models, as currently done:
Text-embedding models are ranked in a leaderboard here:
-
With OpenAI, it is necessary to use text-embedding-ada-002 : 0.4c per 1K token
-
Less effective than other models but accepts 8191 tokens input instead of 512
-
No newer version available from OpenAI
Common Limitations of Modelsβ
- Often a token limit: 512 or even less for most, compared to 8191 for text-embedding-ada-002 and possibly 32k for ChatGPT
How to Overcome These Limitations?β
We could consider splitting the code properly using the AST. A bit complicated / slow, and some functions are probably longer than 500 tokens.
Actions Taken Following the Articleβ
- Try to calculate tokens to have OpenAI pricing figures / compare devops costs
- The tool is available at https://github.com/bamlab/bamia/blob/main/research/token_counter/project_token_counter.py