Open Source Language Models for Code Generation, An Overview
Problem & Objectiveβ
We aim to develop AI tools. There is no BAM standard on AI models to be used and their characteristics.
We study only open-source models, not those accessible via an API.
List of Studied Modelsβ
Codegen2β
Codegen2 is a model published by Salesforce on May 3, 2023. The publication associated with the model is available here (it is very interesting to read and well written compared to what is usually found in research). It is an improvement on the Codegen model (which was mainly trained on Java and C++)
The model exists in four versions: 1M, 3.7M, 7M, and 16M parameters (M = billion)
It was trained on the Stack dataset consisting of 3 TB of code under permissive license (MIT, Apache but not GPL). JavaScript is the most present language in this dataset, followed by Java and C. However, it contains 200 GB of TypeScript.
The model is capable of generating code in the middle of existing code. (infil)
An example of use:
from transformers import AutoTokenizer, AutoModelForCausalLM
# We use the version of the model with 16 billion parameters.
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen2-16B")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen2-16B", trust_remote_code=True, revision="main")
def format(prefix, suffix):
return prefix + "<mask_1>" + suffix + "" + "<sep>" + "<mask_1>"
prefix = "def fibonacci(n):\\n"
suffix = " return f"
# We want the model to complete the middle of the function.
# We put a comment so the model doesn't write the version with exponential complexity
text = format(prefix, suffix)
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=256)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=False)[len(text):])
Bloomβ
Bloom is a model published by a collaboration between Microsoft, Nvidia, Pytorch, and the French government on July 6, 2022. The publication is available here. The model is described in detail here
It is available in 6 versions with 560m, 1M, 1.7M, 7M, and 176M parameters. It is the most comparable open-source model to ChatGPT as it was trained on the 1.7 TB ROOT dataset containing natural language text (the 3 most present languages are English, Chinese, and French) and programming languages (the most present are Java, C++, PHP, but it contains JavaScript and TypeScript.)
Bloom is also available in a finetuned version called Bloomz designed to follow prompt instructions.
SantaCoderβ
SantaCoder is a model published by the BigCode project on February 24, 2023. The publication associated with the model is available here.
The model is capable of generating code in the middle of existing code. It has 1.1M parameters. It was trained on a filtered version of the Stack dataset with only Java, JavaScript, and Python. It thus has no knowledge of TypeScript. Its peculiarity was that it only uses code from GitHub projects with many stars for its training. Given the model's performance compared to the previous model, the paper's title is "Don't reach for the stars".
The model is less efficient than Codex (the model used by Copilot) or Codegen, however, unlike others, it is capable of completing code "in the middle", which was not feasible with other models available at the time of publication.
Recommendations for Establishing Benchmarks and Testsβ
The Concept of Tokenβ
To function, generation models convert words into tokens. These are numbers that represent one or several characters, or even small words. This can sometimes make certain models complicated to compare because not all models tokenize in the same way. To simplify, we can consider that 1 word is composed of an average of 1.3 tokens and a line of code contains on average high 10 words. Thus, if a model
accepts 2048 tokens in its context window, it can read files of at least 150 lines.
The Importance of the Promptβ
The tested models generally have not been finetuned. This means they are designed only to complete text and not to follow instructions. Keep this in mind when constructing prompts for these models.
Managing the Length of the Maximum Text to Generate with max_lengthβ
We can define in advance the number of tokens we want to produce with the
max_length property. It is preferable to set max_length to a small value
like 16 and call the generation function several times in order to display the
result of the generation progressively and not all at once, to increase the
impression of the model's generation speed.
Do Not Forget the GPU Memory Capacityβ
To be able to run a model on one's computer, the GPU memory capacity (which is different from the computer's general RAM!) must be greater than the size of the model.
On M1 Macs which use a unified memory architecture, the GPU and CPU RAM are shared and we can run models with a size of up to 16 GB. This is largely sufficient for most use cases.
However, this is something to keep in mind, especially if you want to deploy these models on Linux servers. The size of the GPU chosen must be well-dimensioned to not be too expensive but capable of running the model.
A good heuristic for the size of the model is the number of parameters times 16, as each parameter occupies 16 bits and these parameters represent the vast majority of the memory used by the model.
For Bloom, the parameters are stored on 8 bits, so only half the memory space is required.
Summaryβ
| Model Name | Model Size | Number of Parameters | Max Prompt Size (in tokens) | Generation Time for 50 Tokens on a Mac M1 | Generated Text (code / fr / en / multimodal) |
|---|---|---|---|---|---|
| Bloom-560m | 1.12 GB | 560m | 2048 | 3 seconds | 45 languages and 12 programming languages |
| Bloom-1M | 2.13 GB | 1M | 2048 | 5 seconds | 45 languages and 12 programming languages |
| SantaCoder | 4.6 GB | 1.1M | 2048 | 6.1 seconds | Python, Java, and JavaScript |
| Codegen2-1M | 4.13 GB | 1M | 2048 (cg1) | 7.6 seconds | Over 300 programming languages |
Larger models (e.g., the Codegen2 version with 3.7M parameters) take an excessive amount of time to load and execute locally and are not suitable for prototyping. Each time the Python application is launched, the model must be loaded into the computer's RAM, which slows down the iteration speed when 10 GB must be allocated each time.
Prompt Resultsβ
Each of the previous models was prompted with
export const Button = ({onPress, children}: Props) => { to estimate their
ability to write React-native in TypeScript from scratch.
They were also prompted with:
import { ApolloError, useQuery } from '@apollo/client';
import { GET_USERNAME } from '@modules/progress/infra/getUsername.query';
import { useUserStore } from '@modules/user/domain/user.store';
import type {
GetUsernameQuery,
GetUsernameQueryVariables,
} from '@shared/infra/queries.types';
export const useGetUsername = () => {
const userToken = useUserStore((state) => state.sessionToken);
const { data, loading, error, refetch } = useQuery
To estimate their ability to complete "real" code, this example being an adapted version of a real codebase.
Below are the completions of the models with 128 tokens for the first prompt and 256 for the second. Here's the translation of the article into English, keeping the Markdown format:
Bloom-560mβ
export const Button = ({onPress, children}: Props) => {
const onClick = prop => {
if (!props.onClick) {
props.onClick = () => {
}
}
if (props.onClick) {
onClick(props.onClick)
}
}
return (
<div onClick={props.onClick}>
{props.children.map((button) => {
return button.toElement().button ||
<Button key={button.id} onClick={props.onClick} />
})
})
;
}
export const ButtonGroup = (
import { ApolloError, useQuery } from '@apollo/client';
import { GET_USERNAME } from '@modules/progress/infra/getUsername.query';
import { useUserStore } from '@modules/user/domain/user.store';
import type {
GetUsernameQuery,
GetUsernameQueryVariables,
} from '@shared/infra/queries.types';
export const useGetUsername = () => {
const userToken = useUserStore((state) => state.sessionToken);
const { data, loading, error, refetch } = useQuery<
{
data: GetUsernameQueryVariables,
loading,
error: error,
refetch,
}>(query);
if (!query.status.ok || query.status.type === Error.NotFound) {
error = ApolloError.invalidRequest('GET', data);
} else {
const queryString = query.data.fullQueryString;
if (queryString.isArray()) {
const results = [];
for (const item of queryString) {
results.push({
key: item.value,
label: item.value.label,
text: item.value.text,
type: item.value.type,
date
Bloom-1Mβ
export const Button = ({onPress, children}: Props) => {
return {
onPress: () => {
const {onClick} = this.props;
onClick();
},
children: this.props.children,
onPressHandler: () => {
// TODO : Remove all onClick method which is not in the component
// and make it an array method with the onClick event in it.
// Useful for components that don't need to call the onClick event.
onClick.push(this.onPressHandler.bind(this));
},
onClickHandler: () =>
import { ApolloError, useQuery } from '@apollo/client';
import { GET_USERNAME } from '@modules/progress/infra/getUsername.query';
import { useUserStore } from '@modules/user/domain/user.store';
import type {
GetUsernameQuery,
GetUsernameQueryVariables,
} from '@shared/infra/queries.types';
export const useGetUsername = () => {
const userToken = useUserStore((state) => state.sessionToken);
const { data, loading, error, refetch } = useQuery<
GET_USERNAME, GetUsernameQueryVariables>(userToken)
.then(result => {
return new Promise((resolve, reject) => {
if (loading) {
const status = await useQuery<
ApolloError<any>,
ApolloError<any>
>(userToken).then((response) => {
reject(response.error());
});
const data = response.data;
loading = false;
}
if (loading) {
const error = useQuery<
ApolloError<any>,
ApolloError<any>
>(userToken).then((response) => {
reject(response
Santacoderβ
export const Button = ({onPress, children}: Props) => {
const [isVisible, setIsVisible] = useState(false);
const [loading, setLoading] = useState(false);
const [loadingMore, setLoadingMore] = useState(false);
const [loadingMoreEnd, setLoadingMoreEnd] = useState(false);
const {height} = useDimensions();
const styles = StyleSheet.create({
container: {
width: '100%',
height: height / 3,
backgroundColor: '#F997A5',
alignItems: 'center',
justifyContent: 'center',
export const useGetUsername = () => {
const userToken = useUserStore((state) => state.sessionToken);
const { data, loading, error, refetch } = useQuery<
GetUsernameQueryVariables,
GetUsernameQuery>(GET_USERNAME, {
variables: {
token: userToken,
},
});
const getusername: GetUsernameQueryVariables = data || {};
if (loading) {
return <span>loading</span>;
}
if (error) {
return <span>Error! Please refresh page to try again.</span>;
}
const { username } = getusername;
// Get user data from Apollo cache
const userData = useQuery(GET_USER_INFO, {
variables:
{
id: userToken,
},
});
return userData
Codegen2-1Mβ
export const Button = ({onPress, children}: Props) => {
const [, setButtonState] = useState(false)
const onPressAction = useCallback((e) => onPress(e), [onPress])
const isDisabled = useCallback(() => setButtonState(!onPressAction()), [onPressAction])
const {isInline} = useContext(LayoutContext)
const inline = isInline
return (
<Box style={[isInline? styles.block : styles.inline]}>
import { ApolloError, useQuery } from '@apollo/client';
import { GET_USERNAME } from '@modules/progress/infra/getUsername.query';
import { useUserStore } from '@modules/user/domain/user.store';
import type {
GetUsernameQuery,
GetUsernameQueryVariables,
} from '@shared/infra/queries.types';
export const useGetUsername = () => {
const userToken = useUserStore((state) => state.sessionToken);
const { data, loading, error, refetch } = useQuery<
GetUsernameQuery,
GetUsernameQueryVariables
>(GET_USERNAME, {
variables: { userToken },
pollInterval: 0,
});
useEffect(() => {
refetch();
}, [userToken, refetch]);
if (loading) {
return <Loading />;
}
if (error) {
return <APOLLO_ERROR>{error.message}</APOLLO_
We now see that these models are not very effective for writing code from scratch but are capable of auto-completing code across several tokens and could be useful for accelerating the writing of states or completing them.
You should also bear in mind that these models are not fine-tuned and can only complete code to resemble their training data, without following instructions or trying to write functional code.
The Bloom models are not trained extensively on TypeScript and are designed more for text generation.
Code to test the modelsβ
from transformers import AutoTokenizer, AutoModelForCausalLM
import time
prompt = "export const Button = ({onPress, children}: Props) => {"
models = [
{
"name": "Bloom 560m",
"prompt": prompt,
"model_name": "bigscience/bloom-560m",
},
{
"name": "Bloom 1M",
"prompt": prompt,
"model_name": "bigscience/bloom-1b1",
},
{
"name": "Codegen2 1M",
"prompt": prompt,
"model_name": "Salesforce/codegen2-1B",
},
# More models can be added here.
]
for m in models:
print("-------------")
print("Testing model: ", m["name"])
print("-------------")
tokenizer = AutoTokenizer.from_pretrained(m["model_name"])
model = AutoModelForCausalLM.from_pretrained(
m["model_name"], trust_remote_code=True, revision="main"
)
start_time = time.time()
textToComplete = m["prompt"]
tokens = tokenizer(textToComplete, return_tensors="pt")
generated_ids = model.generate(
tokens.input_ids,
max_length=50,
attention_mask=tokens.attention_mask,
do_sample=True,
top_k=50,
top_p=0.9,
)
result = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
end_time = time.time()
delta = end_time - start_time
print(result)
print("Generation time: ", round(delta * 100) / 100, "s")
print("-------------")