2024 is a year of rapid development for large language models (LLM). In the training of LLM, alignment methods are an important technical means, including supervised fine-tuning (SFT) and reinforcement learning with human feedback that relies on human preferences (RLHF). These methods have played a crucial role in the development of LLM, but alignment methods require a large amount of manually annotated data. Faced with this challenge, fine-tuning has become a vibrant area of ​​research, with researchers actively working to develop methods that can effectively exploit human data. Therefore, the development of alignment methods will promote further breakthroughs in LLM technology.

The University of California recently conducted a study and introduced a new technology called SPIN (Self Play fIne tuNing). SPIN draws on the successful self-playing mechanism in games such as AlphaGo Zero and AlphaZero to enable LLM (Language Learning Model) to participate in self-playing. This technology eliminates the need for professional annotators, whether humans or more advanced models (such as GPT-4). SPIN's training process involves training a new language model and, through a series of iterations, distinguishing between its own-generated responses and human-generated responses. The ultimate goal is to develop a language model that generates responses that are indistinguishable from human responses. The purpose of this research is to further improve the self-learning ability of the language model and make it closer to human expression and thinking. The results of this research are expected to bring new breakthroughs to the development of natural language processing.


Self-game is a learning technique that increases the challenge and complexity of the learning environment by playing against copies of oneself. This approach allows an agent to interact with different versions of itself, thereby improving its capabilities. AlphaGo Zero is a successful case of self-game.

#Self-game has been proven to be an effective method in multi-agent reinforcement learning (MARL). However, applying it to the augmentation of large language models (LLMs) is a new approach. By applying self-game to large language models, their ability to generate more coherent and information-rich text can be further improved. This method is expected to promote the further development and improvement of language models.

Self-play can be applied in competitive or cooperative settings. In competition, copies of the algorithm compete with each other to achieve a goal; in cooperation, copies work together to achieve a common goal. It can be combined with supervised learning, reinforcement learning and other technologies to improve performance.


SPIN is like a two player game. In this game:

The role of the master model (the new LLM) is to learn to distinguish between responses generated by a language model (LLM) and responses created by humans. In each iteration, the master model is actively training the LLM to improve its ability to recognize and distinguish responses.

The adversary model (old LLM) is tasked with generating responses similar to those produced by humans. It is generated through the LLM of the previous iteration, using a self-game mechanism to generate output based on past knowledge. The goal of the adversary model is to create a response so realistic that the new LLM cannot be sure it was machine-generated.

Is this process very similar to GAN, but it is still different

The dynamics of SPIN involve the use of a supervised fine-tuning (SFT) data set, which consists of input (x) and output (y ) pair composition. These examples are annotated by humans and serve as the basis for training the main model to recognize human-like responses. Some public SFT datasets include Dolly15K, Baize, Ultrachat, etc.

Training of the main model

In order to train the main model to distinguish between language models (LLM) and human responses, SPIN uses an objective function. This function measures the expected value gap between the real data and the response produced by the adversary model. The goal of the main model is to maximize this expected value gap. This involves assigning high values ​​to cues paired with responses from real data, and assigning low values ​​to response pairs generated by the adversary model. This objective function is formulated as a minimization problem.

The master model's job is to minimize the loss function, which measures the difference between the pairwise assignment values ​​from the real data and the pairwise assignment values ​​from the opponent model's responses. Throughout the training process, the master model adjusts its parameters to minimize this loss function. This iterative process continues until the master model is proficient in effectively distinguishing LLM responses from human responses.

Update of the adversary model

Updating the adversary model involves improving the ability of the master model, which has learned to distinguish between real data and language model responses during training. As the master model improves and its understanding of specific function classes is improved, we also need to update parameters such as the adversary model. When the master player is faced with the same prompts, it uses its learned discrimination to evaluate their value.

The goal of the opponent model player is to enhance the language model so that its responses are indistinguishable from the master player's real data. This requires setting up a process to adjust the parameters of the language model. The goal is to maximize the master model's evaluation of the language model's response while maintaining stability. This involves a balancing act, ensuring that improvements don't stray too far from the original language model.

It sounds a bit confusing, let’s briefly summarize:

There is only one model during training, but the model is divided into the previous round of models (old LLM/opponent model) and the main model (being trained), use the output of the model being trained and the output of the previous round of models as a comparison to optimize the training of the current model. But here we are required to have a trained model as the opponent model, so the SPIN algorithm is only suitable for fine-tuning the training results.

SPIN algorithm

SPIN generates synthetic data from pre-trained models. This synthetic data is then used to fine-tune the model on new tasks.

Optimization of LLM using SPIN technology for self-game fine-tuning training

The above is the pseudocode of the Spin algorithm in the original paper. It seems a bit difficult to understand. We reproduce it in Python to better explain how it works.

1. Initialization parameters and SFT data set

The original paper uses Zephyr-7B-SFT-Full as the basic model. For the dataset, they used a subset of the larger Ultrachat200k corpus, which consists of approximately 1.4 million conversations generated using OpenAI’s Turbo API. They randomly sampled 50k cues and used a base model to generate synthetic responses.

# Import necessary libraries from datasets import load_dataset import pandas as pd  # Load the Ultrachat 200k dataset ultrachat_dataset = load_dataset("HuggingFaceH4/ultrachat_200k")  # Initialize an empty DataFrame combined_df = pd.DataFrame()  # Loop through all the keys in the Ultrachat dataset for key in ultrachat_dataset.keys():# Convert each dataset key to a pandas DataFrame and concatenate it with the existing DataFramecombined_df = pd.concat([combined_df, pd.DataFrame(ultrachat_dataset[key])])  # Shuffle the combined DataFrame and reset the index combined_df = combined_df.sample(frac=1, random_state=123).reset_index(drop=True)  # Select the first 50,000 rows from the shuffled DataFrame ultrachat_50k_sample = combined_df.head(50000)

The author's prompt template "

Instruction: {prompt}\n\n


# for storing each template in a list templates_data = []  for index, row in ultrachat_50k_sample.iterrows():messages = row['messages'] # Check if there are at least two messages (user and assistant)if len(messages) >= 2:user_message = messages[0]['content']assistant_message = messages[1]['content'] # Create the templateinstruction_response_template = f"### Instruction: {user_message}\n\n### Response: {assistant_message}" # Append the template to the listtemplates_data.append({'Template': instruction_response_template})  # Create a new DataFrame with the generated templates (ground truth) ground_truth_df = pd.DataFrame(templates_data)

Then I got data similar to the following:Optimization of LLM using SPIN technology for self-game fine-tuning training

The SPIN algorithm iteratively updates the parameters of a language model (LLM) to make it consistent with the ground-truth response. This process continues until it is difficult to distinguish the generated response from the ground truth, thus achieving a high level of similarity (reduced loss).

The SPIN algorithm has two cycles. The inner loop was run based on the number of samples we were using, and the outer loop was run for a total of 3 iterations, as the authors found that the model's performance did not change after this. The Alignment Handbook library is used as the code library for fine-tuning methods, combined with the DeepSpeed ​​module, to reduce training costs. They trained Zephyr-7B-SFT-Full with the RMSProp optimizer, without weight decay for all iterations, as is typically used to fine-tune llm. The global batch size is set to 64, using bfloat16 precision. The peak learning rate for iterations 0 and 1 is set to 5e-7, and the peak learning rate for iterations 2 and 3 decays to 1e-7 as the loop approaches the end of self-playing fine-tuning. Finally β = 0.1 is chosen and the maximum sequence length is set to 2048 tokens. The following are these parameters

 # Importing the PyTorch library import torch  # Importing the neural network module from PyTorch import torch.nn as nn  # Importing the DeepSpeed library for distributed training import deepspeed  # Importing the AutoTokenizer and AutoModelForCausalLM classes from the transformers library from transformers import AutoTokenizer, AutoModelForCausalLM  # Loading the zephyr-7b-sft-full model from HuggingFace tokenizer = AutoTokenizer.from_pretrained("alignment-handbook/zephyr-7b-sft-full") model = AutoModelForCausalLM.from_pretrained("alignment-handbook/zephyr-7b-sft-full")  # Initializing DeepSpeed Zero with specific configuration settings deepspeed_config = deepspeed.config.Config(train_batch_size=64, train_micro_batch_size_per_gpu=4) model, optimizer, _, _ = deepspeed.initialize(model=model, config=deepspeed_config, model_parameters=model.parameters())  # Defining the optimizer and setting the learning rate using RMSprop optimizer = deepspeed.optim.RMSprop(optimizer, lr=5e-7)  # Setting up a learning rate scheduler using LambdaLR from PyTorch scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lambda epoch: 0.2 ** epoch)  # Setting hyperparameters for training num_epochs = 3 max_seq_length = 2048 beta = 0.1

2. Generate synthetic data (SPIN algorithm inner loop)

This inner loop is responsible for generating responses that need to be consistent with real data, which is the code of a training batch

# zephyr-sft-dataframe (that contains output that will be improved while training) zephyr_sft_output = pd.DataFrame(columns=['prompt', 'generated_output'])  # Looping through each row in the 'ultrachat_50k_sample' dataframe for index, row in ultrachat_50k_sample.iterrows():# Extracting the 'prompt' column value from the current rowprompt = row['prompt'] # Generating output for the current prompt using the Zephyr modelinput_ids = tokenizer(prompt, return_tensors="pt").input_idsoutput = model.generate(input_ids, max_length=200, num_beams=5, no_repeat_ngram_size=2, top_k=50, top_p=0.95) # Decoding the generated output to human-readable textgenerated_text = tokenizer.decode(output[0], skip_special_tokens=True) # Appending the current prompt and its generated output to the new dataframe 'zephyr_sft_output'zephyr_sft_output = zephyr_sft_output.append({'prompt': prompt, 'generated_output': generated_text}, ignore_index=True)

This is an example of the true value of a hint and the model output. Optimization of LLM using SPIN technology for self-game fine-tuning training

New df zephyr_sft_output containing hints and their corresponding outputs generated by the base model Zephyr-7B-SFT-Full.

3. Update Rules

Before coding the minimization problem, it is crucial to understand how to calculate the conditional probability distribution of the output generated by llm. The original paper uses a Markov process, where the conditional probability distribution pθ (y∣x) can be expressed by decomposition as: Optimization of LLM using SPIN technology for self-game fine-tuning training

This decomposition means the output of a given input sequence The probability of a sequence can be calculated by multiplying each output token of a given input sequence by the probability of the previous output token. For example, the output sequence is "I enjoy reading books" and the input sequence is "I enjoy". Given the input sequence, the conditional probability of the output sequence can be calculated as: Optimization of LLM using SPIN technology for self-game fine-tuning training

Markov process conditional probability will be used to calculate the probability distribution of the true value and Zephyr LLM response, which is then used to calculate the loss function. But first we need to encode the conditional probability function.

 # Conditional Probability Function of input text def compute_conditional_probability(tokenizer, model, input_text):# Tokenize the input text and convert it to PyTorch tensorsinputs = tokenizer([input_text], return_tensors="pt") # Generate text using the model, specifying additional parametersoutputs = model.generate(**inputs, return_dict_in_generate=True, output_scores=True) # Assuming 'transition_scores' is the logits for the generated tokenstransition_scores = model.compute_transition_scores(outputs.sequences, outputs.scores, normalize_logits=True) # Get the length of the input sequenceinput_length = inputs.input_ids.shape[1] # Assuming 'transition_scores' is the logits for the generated tokenslogits = torch.tensor(transition_scores) # Apply softmax to obtain probabilitiesprobs = torch.nn.functional.softmax(logits, dim=-1) # Extract the generated tokens from the outputgenerated_tokens = outputs.sequences[:, input_length:] # Compute conditional probabilityconditional_probability = 1.0for prob in probs[0]:token_probability = prob.item()conditional_probability *= token_probability return conditional_probability

The loss function contains four important conditional probability variables. Each of these variables depends on underlying real data or previously created synthetic data. Optimization of LLM using SPIN technology for self-game fine-tuning training



 def LSPIN_loss(model, updated_model, tokenizer, input_text, lambda_val=0.01):# Initialize conditional probability using the original model and input textcp = compute_conditional_probability(tokenizer, model, input_text) # Update conditional probability using the updated model and input textcp_updated = compute_conditional_probability(tokenizer, updated_model, input_text) # Calculate conditional probabilities for ground truth datap_theta_ground_truth = cp(tokenizer, model, input_text)p_theta_t_ground_truth = cp(tokenizer, model, input_text) # Calculate conditional probabilities for synthetic datap_theta_synthetic = cp_updated(tokenizer, updated_model, input_text)p_theta_t_synthetic = cp_updated(tokenizer, updated_model, input_text) # Calculate likelihood ratioslr_ground_truth = p_theta_ground_truth / p_theta_t_ground_truthlr_synthetic = p_theta_synthetic / p_theta_t_synthetic # Compute the LSPIN lossloss = lambda_val * torch.log(lr_ground_truth) - lambda_val * torch.log(lr_synthetic) return loss




# Training loop for epoch in range(num_epochs): # Model with initial parametersinitial_model = AutoModelForCausalLM.from_pretrained("alignment-handbook/zephyr-7b-sft-full") # Update the learning ratescheduler.step() # Initialize total loss for the epochtotal_loss = 0.0 # Generating Synthetic Data (Inner loop)for index, row in ultrachat_50k_sample.iterrows(): # Rest of the code ... # Output == prompt response dataframezephyr_sft_output # Computing loss using LSPIN functionfor (index1, row1), (index2, row2) in zip(ultrachat_50k_sample.iterrows(), zephyr_sft_output.iterrows()):# Assuming 'prompt' and 'generated_output' are the relevant columns in zephyr_sft_outputprompt = row1['prompt']generated_output = row2['generated_output'] # Compute LSPIN lossupdated_model = model # It will be replacing with updated modelloss = LSPIN_loss(initial_model, updated_model, tokenizer, prompt) # Accumulate the losstotal_loss += loss.item() # Backward passloss.backward() # Update the parametersoptimizer.step() # Update the value of betaif epoch == 2:beta = 5.0

我们运行3个epoch,它将进行训练并生成最终的Zephyr SFT LLM版本。官方实现还没有在GitHub上开源,这个版本将能够在某种程度上产生类似于人类反应的输出。我们看看他的运行流程

Optimization of LLM using SPIN technology for self-game fine-tuning training



Optimization of LLM using SPIN technology for self-game fine-tuning training


Optimization of LLM using SPIN technology for self-game fine-tuning training

