How to create your own LLM

3 min readMay 1, 2023

Recently, firstly with ChatGPT and now with LLaMa or Bloom, LLM models become very popular. LLM (Large Language Model) main purpose is to predict the next token of the input. Most of them are trained on English or on multi-languages and target generic task. However, why use a huge multilingual, multitask model when we want only process one task in one language ? Let’s create our own optimized model.

First tasks to do

Choose Model
Choose Dataset
Create a tokenizer
Training
Interface

Models

Most of the models are described in research paper, and we can find their implantation on the web. In this article, we will check use a model present on Hugging Face.

Hugging Face

Hugging Face provides datasets, models, but also python libraries: Transformers, Datasets, Tokenizers. Using these libraries, we can create models and train them very easily.

Settings

In most models, we can set up many parameters. Below the setting of my small Bloom model.

from transformers import BloomConfig, AutoModelForCausalLM

config = BloomConfig(vocab_size=, hidden_size=1024, n_head=8, n_layer=12)
model = AutoModelForCausalLM.from_config(config)

Datasets

To train our models, we will need 3 types of dataset.

Generic language dataset: Common Crawl’s web crawl corpus
Comportment dataset: Alpaca
Fine-tuning dataset: You have to choose it according to your task

Tokenizer

Now days, most of the models use BPE (Byte-Pair Encoding tokenization). These kinds of tokenizer need to be trained on a corpus. Instead of using our whole dataset, we just need a corpus which represent our target language(s).

Below the code I use to create a BPE tokenizer:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

lang:str = "fr"
file = '{}.tsv'.format(lang)
charset = set(open(file, encoding="utf8").read())
print(len(charset))

#https://pypi.org/project/tokenizers/
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

trainer = trainers.BpeTrainer(vocab_size=10000, min_frequency=2, initial_alphabet=list(charset),
    special_tokens=["[PAD]", "[BOS]", "[EOS]"])
tokenizer.train([file], trainer=trainer)

tokenizer.save("byte-level-bpe_{}.tokenizer.json".format(lang), pretty=True)

We also need to create 3 different tags for Padding, Beginning Of Sentence and End Of Sentence: [PAD], [BOS], [EOS]

These tags will have indexes: 0, 1, 2

Training

We will need a lot of computing power to train our model, but we can start on Google Colab or Kaggel to test the first phase.

Below the code I use to train a Bloom model in French:

import torch
import datasets
import transformers
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments, BloomConfig, PreTrainedTokenizerFast, DataCollatorForLanguageModeling, AutoTokenizer
import os

device:str = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device", device)

lang:str="fr"
hidden_size:int = 1024
model_name:str = "final_model_{}.pth".format(lang)
tokenizer_path:str = "byte-level-bpe_{}.tokenizer.json".format(lang)
checkpoint_path:str = 'checkpoint_{}'.format(lang)
output_dir:str = "bloom_work_{}".format(lang)
project_name:str = "BLOOM_{}".format(lang)

tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path, padding_side='left')
tokenizer.pad_token_id = 0
tokenizer.bos_token_id = 1
tokenizer.eos_token_id = 2
config = BloomConfig(vocab_size=tokenizer.vocab_size, hidden_size=hidden_size, n_head=8, n_layer=12)
model = AutoModelForCausalLM.from_config(config)

max_length = 256
dataset_wiki = datasets.load_dataset('bigscience-data/roots_fr_wikipedia', split='train')
dataset_wiki = dataset_wiki.filter(lambda data: len(data['text']) > 10)
dataset_wiki = dataset_wiki.train_test_split(test_size=0.1, seed=33)
print(dataset_wiki)

def group_texts(data):
    text = tokenizer(data['text'], add_special_tokens=True, truncation=True, max_length=max_length)
    return {'input_ids':text['input_ids']}

lm_dataset = dataset_wiki.map(group_texts, batched=True, num_proc=4)
print("lm_dataset", lm_dataset)

training_args = TrainingArguments(
    output_dir,
    save_total_limit = 5,
    save_steps=2000,
    logging_steps=2000,
    evaluation_strategy="steps", 
    fp16 = device=="cuda",
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    run_name=project_name,
    num_train_epochs=8)

class PrinterCallback(transformers.TrainerCallback):
    def on_evaluate(self, args, state, control, **kwargs):
        print("on_evaluate")
        testPrompt = ["Je suis vraiment ", "Comment ", "Il fait "]
        result = []
        for p in testPrompt:
            print("Prompt:", p)
            prompt = tokenizer(p, return_tensors='pt').to(device)
            output = model.generate(prompt['input_ids'], attention_mask=prompt['attention_mask'], max_length=64, early_stopping=True, pad_token_id=tokenizer.eos_token_id)
            string = tokenizer.decode(output[0], skip_special_tokens=True)
            result.append(string)
            print("Output:", string)
            
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=lm_dataset['train'],
    eval_dataset=lm_dataset['test'],
    callbacks=[PrinterCallback],
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False))

print("Train " + 50*'#')
print(checkpoint_path)
if os.path.exists(checkpoint_path):
    print("Resume")
    trainer.train(resume_from_checkpoint=checkpoint_path)
else:
    trainer.train()
torch.save(model, model_name)

You can adjust per_device_train_batch_size and max_length according to available memory.

Output

{'loss': 7.7644, 'learning_rate': 1.999934999935e-05, 'epoch': 0.0}
{'eval_loss': 7.048653602600098, 'eval_runtime': 2412.6769, 'eval_samples_per_second': 28.34, 'eval_steps_per_second': 3.543, 'epoch': 0.0}
Prompt: Je suis vraiment 
Output:  Je suis vraiment 
Prompt: Comment 
Output:  Comment 
Prompt: Il fait 
Output:  Il fait

Interface

Now we can build an interface using Hugging Face Space and Streamlit.

Here is the interface where you can test my small model for French and Japanese.

Conclusion

Stop using big companies expensive and huge model, just make your needs optimized model and be free.

If you want I do it for you, feel free to contact me.

Note: Article will be improved later