Skip to main content

Understanding Code with Open AI

Goal: Determine which model to use for bam.AI

Idea 1: Use Codex by training it on BAM code.

Codex is deprecated and no longer available, but fine-tuning with OpenAI can be used to train the model on BAM code.

Idea 2: Use model fine-tuning with Open AI

https://openai.com/pricing :

Model	Training	Usage
Ada	$0.0004 / 1K tokens	$0.0016 / 1K tokens
Babbage	$0.0006 / 1K tokens	$0.0024 / 1K tokens
Curie	$0.0030 / 1K tokens	$0.0120 / 1K tokens
Davinci	$0.0300 / 1K tokens	$0.1200 / 1K tokens

https://platform.openai.com/docs/guides/fine-tuning

Alternatives to Open AI

If we want to choose a text embedding model to calculate distance and display function models, as currently done:

Text-embedding models are ranked in a leaderboard here:

https://huggingface.co/spaces/mteb/leaderboard
With OpenAI, it is necessary to use text-embedding-ada-002 : 0.4c per 1K token
Less effective than other models but accepts 8191 tokens input instead of 512
No newer version available from OpenAI

Common Limitations of Models

Often a token limit: 512 or even less for most, compared to 8191 for text-embedding-ada-002 and possibly 32k for ChatGPT

How to Overcome These Limitations?

We could consider splitting the code properly using the AST. A bit complicated / slow, and some functions are probably longer than 500 tokens.

Actions Taken Following the Article

Try to calculate tokens to have OpenAI pricing figures / compare devops costs
The tool is available at https://github.com/bamlab/bamia/blob/main/research/token_counter/project_token_counter.py

Goal: Determine which model to use for bam.AI
Idea 1: Use Codex by training it on BAM code.
Idea 2: Use model fine-tuning with Open AI
Alternatives to Open AI
- Common Limitations of Models
Actions Taken Following the Article