Published on
March 28, 2023
Introduction to NLP
To understand BERT thoroughly one needs to understand NLP, Transformers and how do transformers work in NLP. Computers have historically had difficulty “understanding” language. Although computers are capable of reading, storing, and collecting textual inputs, they lack basic language context. As a result, Natural Language Processing, a branch of artificial intelligence, was developed that aims to make computers capable of reading, analysing, interpreting, and making sense of written and spoken language. In order to help computers in “understanding” human language, this practice combines linguistics, statistics, and machine learning.
Overview of Transformer in NLP
BERT was created on the Transformer architecture, a family of Neural Network architectures. Transformer architecture is built on the principle of self-attention, and the paper that first introduced it is titled “Attention Is All You Need“. Learning to weigh the significance of each item or word in relation to other words in the input sequence is known as self-attention. Attention, a powerful deep-learning technique first used in computer vision models, is the key to how transformers operate. Let’s consider an example to better comprehend attention: Can a human remember everything he saw on a given day? Certainly not! Our brains have limited but valuable memory. Our capacity to forget minor inputs aids our recall. In a similar vein, models of machine learning need to acquire the ability to concentrate solely on the relevant information rather than utilise computational resources to process irrelevant data. Differential weight signals are made by transformers to indicate which words in a sentence are most important for processing. This is accomplished by a transformer where it includes two separate mechanisms- an encoder that reads the text input and a decoder that produces a prediction for the task. However, BERT does not employ a decoder.
How does BERT work?
The training using BERT model completely depends on various ‘pre-trained BERT’ models. These models are offered with a range of parameters, from 110 million, known as “BERT-BASE” to 340 million, known as “BERT-LARGE”. The number of encoder layers ranging from 2 to 12 and the large number of hidden layers ranging from 128 to 768 vary greatly across these pre-trained models.
A series of tokens is the input to the BERT encoder. The tokens are first turned into vectors and then processed by the neural network. However, before processing can begin BERT requires the input to be manipulated and embellished with additional metadata such as:
To use the BERT tokenisation approach, the inputs must first be tokenised by the BERT tokeniser. The following two NLP tasks are utilised in BERT pre-training:
Words are converted into numbers by BERT and this is the important process, because machine learning models takes input in numbers rather than words. This allows you to train machine learning models on your textual data. In other words, BERT models are used to transform your text data so that it can be combined with other types of data in a machine learning model to make predictions.
What makes BERT different-
Unlike other large learning models like GPT-3, BERT’s source code (view BERT’s code on Github) is publicly accessible allowing BERT to be more widely used all around the world. Now developers can quickly get a cutting-edge model like BERT up and running without spending a lot of time or money, instead they can focus on fine-tuning the BERT to tailor the model’s performance to their specific tasks. It’s important to keep in mind that if one don’t want to fine-tune BERT, thousands of open-source and free, pre-trained BERT models are currently offered for particular use cases. The fact that BERT is a pre-trained model and hence can be fine tuned with key features such as BERT needs much less data, choose relevant layers to tune, and it can perform transfer learning. Metrics can be fine-tuned and be used immediately. The BERT model is available and pre-trained in more than 100 languages and which can be useful for projects that are not English-based.
Implementation of BERT for Psychological Stress Detection
Let’s take a look at a real-world example now that we understand the fundamental ideas behind BERT. For this guide the dataset has been collected from the website Tweet Sentiment to CSV where its based on ‘Stress’ and ‘Relax’ moods reflected in the tweet.
The model has been trained using pre-trained BERT model in the following manner:
Tokenisation
The tokeniser for BERT has been created using the pre-trained BERT model, namely “bert-base-uncased”. In this context, BertTokeniser has been used.
Data Encoding
The data (train and test) has been encoded using the tokeniser, and the final train and test data have been generated. In this context, truncation=True and padding=True have been ued as the encoding parameters.
Model Training
The model has been trained using the generated train data and using the following parameters:
learning_rate=5e-5 epsilon=1e-07 optimizer=Adam loss=model_type.compute_loss metrics=‘accuracy' training data batch size = 16 validation data batch size = 16 epochs=10 batch_size=20
Install necessary libraries and packages such as:
!pip install transformers **from** transformers **import** * **from** transformers **import** BertTokenizer, TFBertModel, BertConfig, TFBertForSequenceClassification
Using pre-trained BERT model
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=2)
Implementation of attention masking
1input_ids=[] attention_masks=[] sent=str(sent) bert_inp=bert_tokenizer.encode_plus(sent,add_special_tokens = True,max_length =64,pad_to_max_length = True,return_attention_mask = True) input_ids.append(bert_inp['input_ids']) attention_masks.append(bert_inp['attention_mask']) input_ids=np.asarray(input_ids) attention_masks=np.array(attention_masks) labels=np.array(labels)
Model training and fitting:
callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath=model_save_path,save_weights_only=True,monitor='val_loss',mode='min',save_best_only=True),keras.callbacks.TensorBoard(log_dir=log_dir)] print('\nBert Model',bert_model.summary()) loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy') optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5,epsilon=1e-08) bert_model.compile(loss=loss,optimizer=optimizer,metrics=[metric]) history=bert_model.fit( [train_inp,train_mask], train_label, batch_size=32, epochs=10, validation_data=([val_inp,val_mask],val_label), callbacks=callbacks ) model_save_path='bert_model.h5' trained_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=2) trained_model.compile(loss=loss,optimizer=optimizer, metrics=[metric]) trained_model.load_weights(model_save_path) preds = trained_model.predict([val_inp,val_mask],batch_size=32) all_preds=[] for i in preds[0]: all_preds.append(np.argmax(np.array(i),axis=0) pred_labels = all_preds f1 = f1_score(val_label,pred_labels) print('F1 score',round(f1,4)*100,"%") print('Classification Report') print(classification_report(val_label,pred_labels,target_names=['Stres','Relax']))
After classifying the tweets, the visualisation has been done for the confusion matrix. It has been seen that out of 334 observations, the prediction of 325 is correct, and 9 are misclassifications.
Conclusion
Undoubtedly, BERT represents a milestone in machine learning’s application to natural language processing. Future practical applications are anticipated to be numerous given how easy it is to use and how quickly it can be fine-tuned. It’s not an exaggeration to say that BERT has significantly altered the NLP landscape.