Home > Article > Backend Development > Code example of Huggingface fine-tuning BART: WMT16 data set to train new tags for translation
If you want to test a new architecture on translation tasks, such as training a new tag on a custom dataset, it will be cumbersome to handle, so in this article, I will introduce the pre-processing of adding new tags. Processing steps and introduce how to fine-tune the model.
Because Huggingface Hub has many pre-trained models, it is easy to find pre-trained taggers. But it may be a little tricky to add a marker. Let's fully introduce how to implement it. First, load and preprocess the data set.
We use the WMT16 dataset and its Romanian-English subset. The load_dataset() function will download and load any available dataset from Huggingface.
import datasets dataset = datasets.load_dataset("stas/wmt16-en-ro-pre-processed", cache_dir="./wmt16-en_ro")
The contents of the data set can be seen in Figure 1 above. We need to "flatten" it so we can better access the data and save it to the hard drive.
def flatten(batch): batch['en'] = batch['translation']['en'] batch['ro'] = batch['translation']['ro'] return batch # Map the 'flatten' function train = dataset['train'].map( flatten ) test = dataset['test'].map( flatten ) validation = dataset['validation'].map( flatten ) # Save to disk train.save_to_disk("./dataset/train") test.save_to_disk("./dataset/test") validation.save_to_disk("./dataset/validation")
As you can see in Figure 2 below, the "translation" dimension has been deleted from the data set.
Tagger provides all the work required to train a tokenizer. It consists of four basic components: (but not all four are necessary)
Models: How the tokenizer will break down each word. For example, given the word "playing": i) BPE model decomposes it into two tokens "play" and "ing", ii) WordLevel treats it as one token.
Normalizers: Some transformations that need to happen on the text. There are filters to change Unicode, lowercase letters or remove content.
Pre-Tokenizers: Functions that provide greater flexibility for operating text. For example, how to work with numbers. Should the number 100 be considered "100" or "1", "0", "0"?
Post-Processors: The specific situation of post-processing depends on the choice of pre-trained model. For example, add [BOS] (beginning of sentence) or [EOS] (end of sentence) tokens to BERT input.
The code below uses the BPE model, lowercase Normalizers and blank Pre-Tokenizers. Then initialize the trainer object with default values, mainly including
1. The vocabulary size uses 50265 to be consistent with BART's English tagger
2. Special tags, such as and
3. Initial vocabulary, this is a predefined list for each model startup process.
from tokenizers import normalizers, pre_tokenizers, Tokenizer, models, trainers # Build a tokenizer bpe_tokenizer = Tokenizer(models.BPE()) bpe_tokenizer.normalizer = normalizers.Lowercase() bpe_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace() trainer = trainers.BpeTrainer( vocab_size=50265, special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"], initial_alphabet=pre_tokenizers.ByteLevel.alphabet(), )
The final step in using Huggingface is to connect the Trainer and BPE model and pass the data set. Depending on the source of the data, different training functions can be used. We will use train_from_iterator().
def batch_iterator(): batch_length = 1000 for i in range(0, len(train), batch_length): yield train[i : i + batch_length]["ro"] bpe_tokenizer.train_from_iterator( batch_iterator(), length=len(train), trainer=trainer ) bpe_tokenizer.save("./ro_tokenizer.json")
The new tokenizer is now available.
from transformers import AutoTokenizer, PreTrainedTokenizerFast en_tokenizer = AutoTokenizer.from_pretrained( "facebook/bart-base" ); ro_tokenizer = PreTrainedTokenizerFast.from_pretrained( "./ro_tokenizer.json" ); ro_tokenizer.pad_token = en_tokenizer.pad_token def tokenize_dataset(sample): input = en_tokenizer(sample['en'], padding='max_length', max_length=120, truncation=True) label = ro_tokenizer(sample['ro'], padding='max_length', max_length=120, truncation=True) input["decoder_input_ids"] = label["input_ids"] input["decoder_attention_mask"] = label["attention_mask"] input["labels"] = label["input_ids"] return input train_tokenized = train.map(tokenize_dataset, batched=True) test_tokenized = test.map(tokenize_dataset, batched=True) validation_tokenized = validation.map(tokenize_dataset, batched=True)
In line 5 of the above code, it is necessary to set the padding tag for the Romanian tokenizer. As it will be used on line 9, the tokenizer uses padding so that all inputs are the same size.
The following is the training process:
from transformers import BartForConditionalGeneration from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer model = BartForConditionalGeneration.from_pretrained("facebook/bart-base" ) training_args = Seq2SeqTrainingArguments( output_dir="./", evaluation_strategy="steps", per_device_train_batch_size=2, per_device_eval_batch_size=2, predict_with_generate=True, logging_steps=2,# set to 1000 for full training save_steps=64,# set to 500 for full training eval_steps=64,# set to 8000 for full training warmup_steps=1,# set to 2000 for full training max_steps=128, # delete for full training overwrite_output_dir=True, save_total_limit=3, fp16=False, # True if GPU ) trainer = Seq2SeqTrainer( model=model, args=training_args, train_dataset=train_tokenized, eval_dataset=validation_tokenized, ) trainer.train()
The process is also very simple. Load the bart basic model (line 4), set the training parameters (line 6), and use the Trainer object to bind all content (Line 22), and start the process (Line 29). The above hyperparameters are for testing purposes, so if you want to get the best results, you need to set the hyperparameters. We can run using these parameters.
The inference process is also very simple. Just load the fine-tuned model and use the generate() method to convert it. However, you need to pay attention to the source (En) and target (RO). ) sequence using an appropriate tokenizer.
While tokenization may seem like a basic operation when using natural language processing (NLP), it is a critical step that should not be overlooked. The emergence of HuggingFace makes it easy for us to use, which makes it easy for us to forget the basic principles of tokenization and only rely on pre-trained models. But when we want to train a new model ourselves, understanding the tokenization process and its impact on downstream tasks is essential, so it is necessary to be familiar with and master this basic operation.
Code of this article: https://github.com/AlaFalaki/tutorial_notebooks/blob/main/translation/hf_bart_translation.ipynb
The above is the detailed content of Code example of Huggingface fine-tuning BART: WMT16 data set to train new tags for translation. For more information, please follow other related articles on the PHP Chinese website!