Home  >  Article  >  Backend Development  >  Code example of Huggingface fine-tuning BART: WMT16 data set to train new tags for translation

Code example of Huggingface fine-tuning BART: WMT16 data set to train new tags for translation

王林
王林forward
2023-04-10 14:41:061287browse

If you want to test a new architecture on translation tasks, such as training a new tag on a custom dataset, it will be cumbersome to handle, so in this article, I will introduce the pre-processing of adding new tags. Processing steps and introduce how to fine-tune the model.

Because Huggingface Hub has many pre-trained models, it is easy to find pre-trained taggers. But it may be a little tricky to add a marker. Let's fully introduce how to implement it. First, load and preprocess the data set.

Loading Dataset

We use the WMT16 dataset and its Romanian-English subset. The load_dataset() function will download and load any available dataset from Huggingface.

import datasets
 
 dataset = datasets.load_dataset("stas/wmt16-en-ro-pre-processed", cache_dir="./wmt16-en_ro")

Code example of Huggingface fine-tuning BART: WMT16 data set to train new tags for translation

The contents of the data set can be seen in Figure 1 above. We need to "flatten" it so we can better access the data and save it to the hard drive.

def flatten(batch):
 batch['en'] = batch['translation']['en']
 batch['ro'] = batch['translation']['ro']
 
 return batch
 
 # Map the 'flatten' function
 train = dataset['train'].map( flatten )
 test = dataset['test'].map( flatten )
 validation = dataset['validation'].map( flatten )
 
 # Save to disk
 train.save_to_disk("./dataset/train")
 test.save_to_disk("./dataset/test")
 validation.save_to_disk("./dataset/validation")

As you can see in Figure 2 below, the "translation" dimension has been deleted from the data set.

Code example of Huggingface fine-tuning BART: WMT16 data set to train new tags for translation

Tagger

Tagger provides all the work required to train a tokenizer. It consists of four basic components: (but not all four are necessary)

Models: How the tokenizer will break down each word. For example, given the word "playing": i) BPE model decomposes it into two tokens "play" and "ing", ii) WordLevel treats it as one token.

Normalizers: Some transformations that need to happen on the text. There are filters to change Unicode, lowercase letters or remove content.

Pre-Tokenizers: Functions that provide greater flexibility for operating text. For example, how to work with numbers. Should the number 100 be considered "100" or "1", "0", "0"?

Post-Processors: The specific situation of post-processing depends on the choice of pre-trained model. For example, add [BOS] (beginning of sentence) or [EOS] (end of sentence) tokens to BERT input.

The code below uses the BPE model, lowercase Normalizers and blank Pre-Tokenizers. Then initialize the trainer object with default values, mainly including

1. The vocabulary size uses 50265 to be consistent with BART's English tagger

2. Special tags, such as and ,

3. Initial vocabulary, this is a predefined list for each model startup process.

from tokenizers import normalizers, pre_tokenizers, Tokenizer, models, trainers
 
 # Build a tokenizer
 bpe_tokenizer = Tokenizer(models.BPE())
 bpe_tokenizer.normalizer = normalizers.Lowercase()
 bpe_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
 
 trainer = trainers.BpeTrainer(
 vocab_size=50265,
 special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"],
 initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
 )

The final step in using Huggingface is to connect the Trainer and BPE model and pass the data set. Depending on the source of the data, different training functions can be used. We will use train_from_iterator().

def batch_iterator():
 batch_length = 1000
 for i in range(0, len(train), batch_length):
 yield train[i : i + batch_length]["ro"]
 
 bpe_tokenizer.train_from_iterator( batch_iterator(), length=len(train), trainer=trainer )
 
 bpe_tokenizer.save("./ro_tokenizer.json")

BART fine-tuning

The new tokenizer is now available.

from transformers import AutoTokenizer, PreTrainedTokenizerFast
 
 en_tokenizer = AutoTokenizer.from_pretrained( "facebook/bart-base" );
 ro_tokenizer = PreTrainedTokenizerFast.from_pretrained( "./ro_tokenizer.json" );
 ro_tokenizer.pad_token = en_tokenizer.pad_token
 
 def tokenize_dataset(sample):
 input = en_tokenizer(sample['en'], padding='max_length', max_length=120, truncation=True)
 label = ro_tokenizer(sample['ro'], padding='max_length', max_length=120, truncation=True)
 
 input["decoder_input_ids"] = label["input_ids"]
 input["decoder_attention_mask"] = label["attention_mask"]
 input["labels"] = label["input_ids"]
 
 return input
 
 train_tokenized = train.map(tokenize_dataset, batched=True)
 test_tokenized = test.map(tokenize_dataset, batched=True)
 validation_tokenized = validation.map(tokenize_dataset, batched=True)

In line 5 of the above code, it is necessary to set the padding tag for the Romanian tokenizer. As it will be used on line 9, the tokenizer uses padding so that all inputs are the same size.

The following is the training process:

from transformers import BartForConditionalGeneration
 from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
 
 model = BartForConditionalGeneration.from_pretrained("facebook/bart-base" )
 
 training_args = Seq2SeqTrainingArguments(
 output_dir="./",
 evaluation_strategy="steps",
 per_device_train_batch_size=2,
 per_device_eval_batch_size=2,
 predict_with_generate=True,
 logging_steps=2,# set to 1000 for full training
 save_steps=64,# set to 500 for full training
 eval_steps=64,# set to 8000 for full training
 warmup_steps=1,# set to 2000 for full training
 max_steps=128, # delete for full training
 overwrite_output_dir=True,
 save_total_limit=3,
 fp16=False, # True if GPU
 )
 
 trainer = Seq2SeqTrainer(
 model=model,
 args=training_args,
 train_dataset=train_tokenized,
 eval_dataset=validation_tokenized,
 )
 
 trainer.train()

The process is also very simple. Load the bart basic model (line 4), set the training parameters (line 6), and use the Trainer object to bind all content (Line 22), and start the process (Line 29). The above hyperparameters are for testing purposes, so if you want to get the best results, you need to set the hyperparameters. We can run using these parameters.

Inference

The inference process is also very simple. Just load the fine-tuned model and use the generate() method to convert it. However, you need to pay attention to the source (En) and target (RO). ) sequence using an appropriate tokenizer.

Summary

While tokenization may seem like a basic operation when using natural language processing (NLP), it is a critical step that should not be overlooked. The emergence of HuggingFace makes it easy for us to use, which makes it easy for us to forget the basic principles of tokenization and only rely on pre-trained models. But when we want to train a new model ourselves, understanding the tokenization process and its impact on downstream tasks is essential, so it is necessary to be familiar with and master this basic operation.

Code of this article: https://github.com/AlaFalaki/tutorial_notebooks/blob/main/translation/hf_bart_translation.ipynb

The above is the detailed content of Code example of Huggingface fine-tuning BART: WMT16 data set to train new tags for translation. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete