Sentimental Analysis - Training Roberta Model
This blog post showcases how you can train Roberta Large Language Model on your local machine using your preferred dataset.
AI IN THE CONSTRUCTION INDUSTRY
Mohamed Ashour
12/23/20237 min read


Overview
One can manage to deploy Roberta Large Language Model locally for the purpose of text classification and understanding possible trends within a text dataset. I am always a strong advocate of deployment of Large Language Models on localized strong computing machines rather than being fully dependent on cloud computing services.
Dataset description
The data available at hand comprises 2 columns: Sentences and Sentiments. The dataset is collected from a data lake comprising client feedback. The dataset size is circa 6,000 rows presented in a csv file. The dataset is confidential and thus would not be shared within this post.
Computing specifications
The large language modelling training requires extensive amount of graphical computing power. The specifications of the personal computer that I am using are as follows:
CPU: core i9 12900K
GPU: Nvidia RTX 3090 24 GB GDDR6
RAM: 2x16 GB DDR5 4800 MHZ
Windows: 11 Home Edition
Python: version 3.9.4
IDE: VScode
Model Training
The localized training of the large language model passes by the following steps:
Step 1 - Importing required libraries and sorting out any dependencies
Step 2 - Loading the pre-trained model with its default weights and its checkpoints if applicable
Step 3 - Loading the user dataset, understand the data profile, undertake the required cleansing and presenting the final format of the data.
Step 4 - Importing the tokenizer related to the loaded model
Step 5 - Defining the model training arguments and their parameters as well as the model trainer arguments and relevant parameters
Step 6 - Trainig the model on the dataset
Step 7 - Deployment of the trained large language model locally on your machine
Step 1: Importing The required libraries
Dealing with large language models requires importing the following libraries:
Pandas & Numpy --> dealing with arrays, matrices, data profiling, cleansing and formatting.
Tensorflow --> is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks.
Transformers --> provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.
Scikit learn -->a library in Python that provides many unsupervised and supervised learning algorithms. It's built upon some of the technology you might already be familiar with, like NumPy, pandas, and Matplotlib.
Tokenizers --> provides a lexical scanner for Python source code, implemented in Python.
The full code for this step is as shown below:
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
import tensorflow as tf
import tensorflow.keras.backend as K
from sklearn.model_selection import StratifiedKFold
from transformers import *
import tokenizers
Step 2 : Loading the required pre-trained Large Language Model
This step is about a comprehensive setup for a natural language processing (NLP) task utilizing the RoBERTa model, a variant of the BERT model known for its effectiveness in NLP. This begins by importing essential classes from the `transformers` library: `RobertaTokenizer` for converting text into tokens compatible with the RoBERTa model, and `RobertaForMaskedLM` specifically tailored for masked language modeling tasks.
The tokenizer is initialized with a pre-trained `roberta-base` model, which is the foundational version of RoBERTa trained on a vast text corpus. This tokenizer effectively formats text for the RoBERTa model. Additionally, the `RobertaForMaskedLM.from_pretrained('roberta-base')` command loads a pre-trained RoBERTa model configured for masked language modeling, where the model predicts hidden words within a text.
Finally, the `LineByLineTextDataset` class needs to imported for creating and managing datasets where each text line is an independent data point, essential for training or evaluating the language model. This setup is fundamental for executing sophisticated NLP tasks using the advanced capabilities of the RoBERTa model.
The full code for this step is as shown below:
from transformers import RobertaTokenizer, RobertaForMaskedLM
from transformers import LineByLineTextDataset
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaForMaskedLM.from_pretrained('roberta-base')
Step 3: Loading the dataset:
This step relies on creating an instance of the `LineByLineTextDataset` class, which is part of the `transformers` library, specifically designed for processing textual data in natural language processing tasks.
This class is initialized with three key parameters: `tokenizer`, `file_path`, and `block_size`. The `tokenizer` parameter is an instance of a tokenizer, here presumably one compatible with the RoBERTa model, used to convert the raw text into a format suitable for model processing.
The `file_path` parameter specifies the location of the dataset file, which contains the textual data to be processed; this dataset is expected to have one sentence per line, aligning with the line-by-line processing nature of the class. Lastly, `block_size` sets the maximum length of the sequences to be tokenized, in this case, 512 tokens.
This means the tokenizer will process each line of the dataset up to a maximum of 512 tokens, a common practice to maintain consistency in input size for training language models like RoBERTa.
The full code for this step is as shown below:
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="PATH TO THE DATASET FILE",
block_size=512,
)
Step 4: Importing the related tokenizer
This steps allows for the DataCollatorForLanguageModeling class from the transformers library, which is used to prepare batches of data for training language models, particularly for masked language modeling (MLM) tasks.
The DataCollatorForLanguageModeling class is initialized with two key parameters: tokenizer and mlm_probability. The tokenizer is an instance of a tokenizer compatible with the model being trained, and it is used to tokenize input text. The mlm parameter set to True specifies that the data collator should prepare data for a masked language modeling task, where certain tokens in the input are randomly masked and the model is trained to predict these masked tokens. The mlm_probability parameter, set here to 0.15, defines the probability with which tokens in the input will be masked.
This setup is commonly used in training BERT-like models, where understanding context and predicting missing words is a crucial part of the learning process.
The full code for this step is as shown below:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
Step 5: Collating the required parameters for the model and the training process
This step is used to set up and initiate the training process of a language model using the transformers library, specifically for retraining or fine-tuning a model like RoBERTa.
Initially, Trainer and TrainingArguments classes nned to be imported. TrainingArguments is then instantiated with various parameters: output_dir specifies the directory where the output (like trained model files) will be saved, overwrite_output_dir=True allows overwriting in the specified directory, num_train_epochs=25 defines the number of training epochs, per_device_train_batch_size=48 sets the batch size for each training device, save_steps=500 determines after how many steps the model should be saved, save_total_limit=2 limits the number of saved model checkpoints to two, and seed=1 sets the random seed for reproducibility.
Finally, a Trainer object is created, which takes the previously defined model, training arguments, data collator (for batching data), and training dataset. This setup is integral for training or fine-tuning transformer-based models like RoBERTa, particularly when handling large datasets and aiming for efficient and effective model performance.
The full code for this step is as shown below:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./roberta-retrained",
overwrite_output_dir=True,
num_train_epochs=25,
per_device_train_batch_size=48,
save_steps=500,
save_total_limit=2,
seed=1
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset
)
Step 6: Undertaking the actual training of the model using the loaded dataset
This step involves two primary operations using the Trainer class from the Hugging Face transformers library, commonly used for training and fine-tuning transformer models. First, trainer.train() initiates the training process of a pre-defined model (likely a RoBERTa or a similar transformer model).
This training uses the parameters, dataset, and data collator specified during the Trainer object's initialization. The process involves going through the dataset in multiple epochs, adjusting the model's weights to minimize the loss function.
After training, trainer.save_model("./roberta-retrained") saves the trained model to a specified directory, in this case, ./roberta-retrained. This function ensures that all components of the trained model, including its configuration, weights, and possibly the tokenizer, are saved in a way that they can be easily reloaded for future inference or continued training.
This approach is crucial for preserving the state of a model post-training, allowing for its deployment or further fine-tuning.
The full code for this step is as shown below:
trainer.train()
trainer.save_model("./roberta-retrained")
Step 7: Deploying the large language model locally on your machine
This step sets up and uses a masked language model pipeline from the Hugging Face transformers library, specifically using a retrained RoBERTa model. The pipeline function is a high-level utility that abstracts away much of the complexity involved in setting up and using transformer models.
In this instance, it is creating a fill-mask pipeline, which is designed for tasks where certain words in a sentence are masked, and the model predicts the most likely replacements for these masks. The pipeline is configured with a custom model located at ./roberta-retrained, likely a version of RoBERTa that has been retrained on a specific dataset or for a particular task, and the tokenizer="roberta-base", which is responsible for preprocessing the text into a format suitable for the model.
Finally, the fill_mask(" The desired prompt<mask>") function call uses the pipeline to predict the word that best completes the sentence, replacing the <mask> token with the model's prediction. This setup is particularly useful for applications involving contextual word prediction and understanding language nuances.
The full code for this step is as shown below:
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model="./roberta-retrained",
tokenizer="roberta-base"
)
fill_mask(" The desired prompt<mask>")
Conclusion
In this instance the model accruacy did not change much from the original result found. I undertook previously extensive work with Roberta Model and obtained an accuracy of circa 65% with the same dataset. The accuracy in this instance was circa the same. It is worthy to mention that the training took around 3 and a half hours on a dataset that is composed of circa 6000 rows of data.
From this the training exercise, I noticed the following points in the dataset which may have adversely impacted the accuracy:
The clients were working in different sectors which could dictate a massive disparity between their levels of satisfaction/dissatisfaction.
The dataset contained a large number of lengthy feedbacks which contained both positive and negative points in the same sentence which made it harder to train the model.
There was a wide variety of unreadable characters within the sentences column which made it quite difficult to formulate one common cleansing formula for all rows.
There was no timestamp for the sentences which did not allow me to understand the analysis timeframe.
The Large Languge Model used (Roberta in this case) is considered as one of the smallest models available thus the model accuracy would not be able to create a commercialised solution.
Resources
Multiple resources allowed me to undertake such exercise. The circumstances are going to be quite different from one person to another due to the nature of python dependencies that should be installed on the user's machine. For clarity purposes, the original link that helped me to do such exercise is :https://towardsdatascience.com/transformers-retraining-roberta-base-using-the-roberta-mlm-procedure-7422160d5764