How to create your own LLM

Arnaud
3 min readMay 1, 2023

--

Recently, firstly with ChatGPT and now with LLaMa or Bloom, LLM models become very popular. LLM (Large Language Model) main purpose is to predict the next token of the input. Most of them are trained on English or on multi-languages and target generic task. However, why use a huge multilingual, multitask model when we want only process one task in one language ? Let’s create our own optimized model.

First tasks to do

  1. Choose Model
  2. Choose Dataset
  3. Create a tokenizer
  4. Training
  5. Interface

Models

Most of the models are described in research paper, and we can find their implantation on the web. In this article, we will check use a model present on Hugging Face.

Hugging Face

Hugging Face provides datasets, models, but also python libraries: Transformers, Datasets, Tokenizers. Using these libraries, we can create models and train them very easily.

Settings

In most models, we can set up many parameters. Below the setting of my small Bloom model.

from transformers import BloomConfig, AutoModelForCausalLM

config = BloomConfig(vocab_size=, hidden_size=1024, n_head=8, n_layer=12)
model = AutoModelForCausalLM.from_config(config)

Datasets

To train our models, we will need 3 types of dataset.

  1. Generic language dataset: Common Crawl’s web crawl corpus
  2. Comportment dataset: Alpaca
  3. Fine-tuning dataset: You have to choose it according to your task

Tokenizer

Now days, most of the models use BPE (Byte-Pair Encoding tokenization). These kinds of tokenizer need to be trained on a corpus. Instead of using our whole dataset, we just need a corpus which represent our target language(s).

Below the code I use to create a BPE tokenizer:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

lang:str = "fr"
file = '{}.tsv'.format(lang)
charset = set(open(file, encoding="utf8").read())
print(len(charset))

#https://pypi.org/project/tokenizers/
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

trainer = trainers.BpeTrainer(vocab_size=10000, min_frequency=2, initial_alphabet=list(charset),
special_tokens=["[PAD]", "[BOS]", "[EOS]"])
tokenizer.train([file], trainer=trainer)

tokenizer.save("byte-level-bpe_{}.tokenizer.json".format(lang), pretty=True)

We also need to create 3 different tags for Padding, Beginning Of Sentence and End Of Sentence: [PAD], [BOS], [EOS]

These tags will have indexes: 0, 1, 2

Training

We will need a lot of computing power to train our model, but we can start on Google Colab or Kaggel to test the first phase.

Below the code I use to train a Bloom model in French:

import torch
import datasets
import transformers
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments, BloomConfig, PreTrainedTokenizerFast, DataCollatorForLanguageModeling, AutoTokenizer
import os

device:str = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device", device)

lang:str="fr"
hidden_size:int = 1024
model_name:str = "final_model_{}.pth".format(lang)
tokenizer_path:str = "byte-level-bpe_{}.tokenizer.json".format(lang)
checkpoint_path:str = 'checkpoint_{}'.format(lang)
output_dir:str = "bloom_work_{}".format(lang)
project_name:str = "BLOOM_{}".format(lang)

tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path, padding_side='left')
tokenizer.pad_token_id = 0
tokenizer.bos_token_id = 1
tokenizer.eos_token_id = 2
config = BloomConfig(vocab_size=tokenizer.vocab_size, hidden_size=hidden_size, n_head=8, n_layer=12)
model = AutoModelForCausalLM.from_config(config)

max_length = 256
dataset_wiki = datasets.load_dataset('bigscience-data/roots_fr_wikipedia', split='train')
dataset_wiki = dataset_wiki.filter(lambda data: len(data['text']) > 10)
dataset_wiki = dataset_wiki.train_test_split(test_size=0.1, seed=33)
print(dataset_wiki)

def group_texts(data):
text = tokenizer(data['text'], add_special_tokens=True, truncation=True, max_length=max_length)
return {'input_ids':text['input_ids']}

lm_dataset = dataset_wiki.map(group_texts, batched=True, num_proc=4)
print("lm_dataset", lm_dataset)

training_args = TrainingArguments(
output_dir,
save_total_limit = 5,
save_steps=2000,
logging_steps=2000,
evaluation_strategy="steps",
fp16 = device=="cuda",
per_device_train_batch_size=32,
learning_rate=2e-5,
weight_decay=0.01,
run_name=project_name,
num_train_epochs=8)

class PrinterCallback(transformers.TrainerCallback):
def on_evaluate(self, args, state, control, **kwargs):
print("on_evaluate")
testPrompt = ["Je suis vraiment ", "Comment ", "Il fait "]
result = []
for p in testPrompt:
print("Prompt:", p)
prompt = tokenizer(p, return_tensors='pt').to(device)
output = model.generate(prompt['input_ids'], attention_mask=prompt['attention_mask'], max_length=64, early_stopping=True, pad_token_id=tokenizer.eos_token_id)
string = tokenizer.decode(output[0], skip_special_tokens=True)
result.append(string)
print("Output:", string)

trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=lm_dataset['train'],
eval_dataset=lm_dataset['test'],
callbacks=[PrinterCallback],
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False))

print("Train " + 50*'#')
print(checkpoint_path)
if os.path.exists(checkpoint_path):
print("Resume")
trainer.train(resume_from_checkpoint=checkpoint_path)
else:
trainer.train()
torch.save(model, model_name)

You can adjust per_device_train_batch_size and max_length according to available memory.

Output

{'loss': 7.7644, 'learning_rate': 1.999934999935e-05, 'epoch': 0.0}
{'eval_loss': 7.048653602600098, 'eval_runtime': 2412.6769, 'eval_samples_per_second': 28.34, 'eval_steps_per_second': 3.543, 'epoch': 0.0}
Prompt: Je suis vraiment
Output: Je suis vraiment
Prompt: Comment
Output: Comment
Prompt: Il fait
Output: Il fait

Interface

Now we can build an interface using Hugging Face Space and Streamlit.

Here is the interface where you can test my small model for French and Japanese.

Conclusion

Stop using big companies expensive and huge model, just make your needs optimized model and be free.

If you want I do it for you, feel free to contact me.

Note: Article will be improved later

--

--

Arnaud
Arnaud

Written by Arnaud

Working in computer science.

Responses (1)