Skip to main content

Open Source Language Models for Code Generation, An Overview

Problem & Objective​

We aim to develop AI tools. There is no BAM standard on AI models to be used and their characteristics.

We study only open-source models, not those accessible via an API.

List of Studied Models​

Codegen2​

Codegen2 is a model published by Salesforce on May 3, 2023. The publication associated with the model is available here (it is very interesting to read and well written compared to what is usually found in research). It is an improvement on the Codegen model (which was mainly trained on Java and C++)

The model exists in four versions: 1M, 3.7M, 7M, and 16M parameters (M = billion)

It was trained on the Stack dataset consisting of 3 TB of code under permissive license (MIT, Apache but not GPL). JavaScript is the most present language in this dataset, followed by Java and C. However, it contains 200 GB of TypeScript.

The model is capable of generating code in the middle of existing code. (infil)

An example of use:

from transformers import AutoTokenizer, AutoModelForCausalLM

# We use the version of the model with 16 billion parameters.
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen2-16B")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen2-16B", trust_remote_code=True, revision="main")

def format(prefix, suffix):
return prefix + "<mask_1>" + suffix + "" + "<sep>" + "<mask_1>"

prefix = "def fibonacci(n):\\n"
suffix = " return f"
# We want the model to complete the middle of the function.
# We put a comment so the model doesn't write the version with exponential complexity
text = format(prefix, suffix)
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=256)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=False)[len(text):])

Bloom​

Bloom is a model published by a collaboration between Microsoft, Nvidia, Pytorch, and the French government on July 6, 2022. The publication is available here. The model is described in detail here

It is available in 6 versions with 560m, 1M, 1.7M, 7M, and 176M parameters. It is the most comparable open-source model to ChatGPT as it was trained on the 1.7 TB ROOT dataset containing natural language text (the 3 most present languages are English, Chinese, and French) and programming languages (the most present are Java, C++, PHP, but it contains JavaScript and TypeScript.)

Bloom is also available in a finetuned version called Bloomz designed to follow prompt instructions.

SantaCoder​

SantaCoder is a model published by the BigCode project on February 24, 2023. The publication associated with the model is available here.

The model is capable of generating code in the middle of existing code. It has 1.1M parameters. It was trained on a filtered version of the Stack dataset with only Java, JavaScript, and Python. It thus has no knowledge of TypeScript. Its peculiarity was that it only uses code from GitHub projects with many stars for its training. Given the model's performance compared to the previous model, the paper's title is "Don't reach for the stars".

The model is less efficient than Codex (the model used by Copilot) or Codegen, however, unlike others, it is capable of completing code "in the middle", which was not feasible with other models available at the time of publication.

Recommendations for Establishing Benchmarks and Tests​

The Concept of Token​

To function, generation models convert words into tokens. These are numbers that represent one or several characters, or even small words. This can sometimes make certain models complicated to compare because not all models tokenize in the same way. To simplify, we can consider that 1 word is composed of an average of 1.3 tokens and a line of code contains on average high 10 words. Thus, if a model

accepts 2048 tokens in its context window, it can read files of at least 150 lines.

The Importance of the Prompt​

The tested models generally have not been finetuned. This means they are designed only to complete text and not to follow instructions. Keep this in mind when constructing prompts for these models.

Managing the Length of the Maximum Text to Generate with max_length​

We can define in advance the number of tokens we want to produce with the max_length property. It is preferable to set max_length to a small value like 16 and call the generation function several times in order to display the result of the generation progressively and not all at once, to increase the impression of the model's generation speed.

Do Not Forget the GPU Memory Capacity​

To be able to run a model on one's computer, the GPU memory capacity (which is different from the computer's general RAM!) must be greater than the size of the model.

On M1 Macs which use a unified memory architecture, the GPU and CPU RAM are shared and we can run models with a size of up to 16 GB. This is largely sufficient for most use cases.

However, this is something to keep in mind, especially if you want to deploy these models on Linux servers. The size of the GPU chosen must be well-dimensioned to not be too expensive but capable of running the model.

A good heuristic for the size of the model is the number of parameters times 16, as each parameter occupies 16 bits and these parameters represent the vast majority of the memory used by the model.

For Bloom, the parameters are stored on 8 bits, so only half the memory space is required.

Summary​

Model NameModel SizeNumber of ParametersMax Prompt Size (in tokens)Generation Time for 50 Tokens on a Mac M1Generated Text (code / fr / en / multimodal)
Bloom-560m1.12 GB560m20483 seconds45 languages and 12 programming languages
Bloom-1M2.13 GB1M20485 seconds45 languages and 12 programming languages
SantaCoder4.6 GB1.1M20486.1 secondsPython, Java, and JavaScript
Codegen2-1M4.13 GB1M2048 (cg1)7.6 secondsOver 300 programming languages

Larger models (e.g., the Codegen2 version with 3.7M parameters) take an excessive amount of time to load and execute locally and are not suitable for prototyping. Each time the Python application is launched, the model must be loaded into the computer's RAM, which slows down the iteration speed when 10 GB must be allocated each time.

Prompt Results​

Each of the previous models was prompted with export const Button = ({onPress, children}: Props) => { to estimate their ability to write React-native in TypeScript from scratch.

They were also prompted with:

import { ApolloError, useQuery } from '@apollo/client';
import { GET_USERNAME } from '@modules/progress/infra/getUsername.query';
import { useUserStore } from '@modules/user/domain/user.store';
import type {
GetUsernameQuery,
GetUsernameQueryVariables,
} from '@shared/infra/queries.types';

export const useGetUsername = () => {
const userToken = useUserStore((state) => state.sessionToken);

const { data, loading, error, refetch } = useQuery

To estimate their ability to complete "real" code, this example being an adapted version of a real codebase.

Below are the completions of the models with 128 tokens for the first prompt and 256 for the second. Here's the translation of the article into English, keeping the Markdown format:

Bloom-560m​

export const Button = ({onPress, children}: Props) => {
const onClick = prop => {
if (!props.onClick) {
props.onClick = () => {
}
}
if (props.onClick) {
onClick(props.onClick)
}
}

return (
<div onClick={props.onClick}>
{props.children.map((button) => {
return button.toElement().button ||
<Button key={button.id} onClick={props.onClick} />
})
})
;
}

export const ButtonGroup = (
import { ApolloError, useQuery } from '@apollo/client';
import { GET_USERNAME } from '@modules/progress/infra/getUsername.query';
import { useUserStore } from '@modules/user/domain/user.store';
import type {
GetUsernameQuery,
GetUsernameQueryVariables,
} from '@shared/infra/queries.types';

export const useGetUsername = () => {
const userToken = useUserStore((state) => state.sessionToken);

const { data, loading, error, refetch } = useQuery<
{
data: GetUsernameQueryVariables,
loading,
error: error,
refetch,
}>(query);

if (!query.status.ok || query.status.type === Error.NotFound) {
error = ApolloError.invalidRequest('GET', data);
} else {
const queryString = query.data.fullQueryString;
if (queryString.isArray()) {
const results = [];
for (const item of queryString) {
results.push({
key: item.value,
label: item.value.label,
text: item.value.text,
type: item.value.type,
date

Bloom-1M​

export const Button = ({onPress, children}: Props) => {
return {
onPress: () => {
const {onClick} = this.props;
onClick();
},

children: this.props.children,
onPressHandler: () => {
// TODO : Remove all onClick method which is not in the component
// and make it an array method with the onClick event in it.
// Useful for components that don't need to call the onClick event.
onClick.push(this.onPressHandler.bind(this));
},
onClickHandler: () =>
import { ApolloError, useQuery } from '@apollo/client';
import { GET_USERNAME } from '@modules/progress/infra/getUsername.query';
import { useUserStore } from '@modules/user/domain/user.store';
import type {
GetUsernameQuery,
GetUsernameQueryVariables,
} from '@shared/infra/queries.types';

export const useGetUsername = () => {
const userToken = useUserStore((state) => state.sessionToken);

const { data, loading, error, refetch } = useQuery<
GET_USERNAME, GetUsernameQueryVariables>(userToken)
.then(result => {
return new Promise((resolve, reject) => {
if (loading) {
const status = await useQuery<
ApolloError<any>,
ApolloError<any>
>(userToken).then((response) => {
reject(response.error());
});
const data = response.data;
loading = false;
}

if (loading) {
const error = useQuery<
ApolloError<any>,
ApolloError<any>
>(userToken).then((response) => {
reject(response

Santacoder​

export const Button = ({onPress, children}: Props) => {
const [isVisible, setIsVisible] = useState(false);
const [loading, setLoading] = useState(false);
const [loadingMore, setLoadingMore] = useState(false);
const [loadingMoreEnd, setLoadingMoreEnd] = useState(false);
const {height} = useDimensions();

const styles = StyleSheet.create({
container: {
width: '100%',
height: height / 3,
backgroundColor: '#F997A5',
alignItems: 'center',
justifyContent: 'center',
export const useGetUsername = () => {
const userToken = useUserStore((state) => state.sessionToken);

const { data, loading, error, refetch } = useQuery<
GetUsernameQueryVariables,
GetUsernameQuery>(GET_USERNAME, {
variables: {
token: userToken,
},
});
const getusername: GetUsernameQueryVariables = data || {};

if (loading) {
return <span>loading</span>;
}

if (error) {
return <span>Error! Please refresh page to try again.</span>;
}

const { username } = getusername;

// Get user data from Apollo cache
const userData = useQuery(GET_USER_INFO, {
variables:

{
id: userToken,
},
});

return userData

Codegen2-1M​

export const Button = ({onPress, children}: Props) => {
const [, setButtonState] = useState(false)
const onPressAction = useCallback((e) => onPress(e), [onPress])
const isDisabled = useCallback(() => setButtonState(!onPressAction()), [onPressAction])
const {isInline} = useContext(LayoutContext)
const inline = isInline
return (
<Box style={[isInline? styles.block : styles.inline]}>
import { ApolloError, useQuery } from '@apollo/client';
import { GET_USERNAME } from '@modules/progress/infra/getUsername.query';
import { useUserStore } from '@modules/user/domain/user.store';
import type {
GetUsernameQuery,
GetUsernameQueryVariables,
} from '@shared/infra/queries.types';

export const useGetUsername = () => {
const userToken = useUserStore((state) => state.sessionToken);

const { data, loading, error, refetch } = useQuery<
GetUsernameQuery,
GetUsernameQueryVariables
>(GET_USERNAME, {
variables: { userToken },
pollInterval: 0,
});

useEffect(() => {
refetch();
}, [userToken, refetch]);

if (loading) {
return <Loading />;
}

if (error) {
return <APOLLO_ERROR>{error.message}</APOLLO_

We now see that these models are not very effective for writing code from scratch but are capable of auto-completing code across several tokens and could be useful for accelerating the writing of states or completing them.

You should also bear in mind that these models are not fine-tuned and can only complete code to resemble their training data, without following instructions or trying to write functional code.

The Bloom models are not trained extensively on TypeScript and are designed more for text generation.

Code to test the models​

from transformers import AutoTokenizer, AutoModelForCausalLM
import time

prompt = "export const Button = ({onPress, children}: Props) => {"

models = [
{
"name": "Bloom 560m",
"prompt": prompt,
"model_name": "bigscience/bloom-560m",
},
{
"name": "Bloom 1M",
"prompt": prompt,
"model_name": "bigscience/bloom-1b1",
},
{
"name": "Codegen2 1M",
"prompt": prompt,
"model_name": "Salesforce/codegen2-1B",
},
# More models can be added here.
]

for m in models:
print("-------------")
print("Testing model: ", m["name"])
print("-------------")

tokenizer = AutoTokenizer.from_pretrained(m["model_name"])
model = AutoModelForCausalLM.from_pretrained(
m["model_name"], trust_remote_code=True, revision="main"
)
start_time = time.time()
textToComplete = m["prompt"]
tokens = tokenizer(textToComplete, return_tensors="pt")
generated_ids = model.generate(
tokens.input_ids,
max_length=50,
attention_mask=tokens.attention_mask,
do_sample=True,
top_k=50,
top_p=0.9,
)
result = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
end_time = time.time()
delta = end_time - start_time
print(result)
print("Generation time: ", round(delta * 100) / 100, "s")
print("-------------")