Fine-Tune Open-Source LLMs Using Lamini - Analytics Vidhya-AI-php.cn

Home

Technology peripherals

Fine-Tune Open-Source LLMs Using Lamini - Analytics Vidhya

Joseph Gordon-Levitt

Apr 12, 2025 am 10:20 AM

Recently, with the rise of large language models and AI, we have seen innumerable advancements in natural language processing. Models in domains like text, code, and image/video generation have archived human-like reasoning and performance.These models perform exceptionally well in general knowledge-based questions. Models like GPT-4o, Llama 2, Claude, and Gemini are trained on publicly available datasets. They fail to answer domain or subject-specific questions that may be more useful for various organizational tasks.

Fine-tuning helps developers and businesses adapt and train pre-trained models to a domain-specific dataset that archives high accuracy and coherency on domain-related queries. Fine-tuning enhances the model’s performance without requiring extensive computing resources because pre-trained models have already learned the general text from the vast public data.

This blog will examine why we must fine-tune pre-trained models using the Lamini platform. This allows us to fine-tune and evaluate models without using much computational resources.

So, let’s get started!

Learning Objectives

To explore the need toFine-Tune Open-Source LLMs UsingLamini
To find out the use of Lamini and under instructions on fine-tuned models
To get a hands-on understanding of the end-to-end process of fine-tuning models.

This article was published as a part of theData Science Blogathon.

Fine-Tune Open-Source LLMs Using Lamini - Analytics Vidhya

Learning Objectives
Why Should One Fine-Tune Large Language Models?
How to Fine-Tune Open-Source LLMs Using Lamini?
- Data Preparation
- Tokenize the Dataset
- Fine-Tuning Process
- Setting up an Environment
- Load Dataset
- Setup Training to Fine-Tune, the Model
Conclusion
Frequently Asked Questions

Why Should One Fine-Tune Large Language Models?

Pre-trained models are primarily trained on vast general data with a high chance of lack of context or domain-specific knowledge. Pre-trained models can also result in hallucinations and inaccurate and incoherent outputs. Most popular large language models based on chatbots like ChatGPT, Gemini, and BingChat have repeatedly shown that pre-trained models are prone to such inaccuracies. This is where fine-tuning comes to the rescue, which can help to adapt pre-trained LLMs to subject-specific tasks and questions effectively. Other ways to align models to your objectives include prompt engineering and few-shot prompt engineering.

Still, fine-tuning remains an outperformer when it comes to performance metrics. Methods such as Parameter efficient fine-tuning and Low adaptive ranking fine-tuning have further improved the model fine-tuning and helped developers generate better models. Let’s look at how fine-tuning fits in a large language model context.

# Load the fine-tuning dataset
filename = "lamini_docs.json"
instruction_dataset_df = pd.read_json(filename, lines=True)
instruction_dataset_df

# Load it into a python's dictionary
examples = instruction_dataset_df.to_dict()

# prepare a samples for a fine-tuning 
if "question" in examples and "answer" in examples:
  text = examples["question"][0]   examples["answer"][0]
elif "instruction" in examples and "response" in examples:
  text = examples["instruction"][0]   examples["response"][0]
elif "input" in examples and "output" in examples:
  text = examples["input"][0]   examples["output"][0]
else:
  text = examples["text"][0]

# Using a prompt template to create instruct tuned dataset for fine-tuning
prompt_template_qa = """### Question:
{question}

### Answer:
{answer}"""

The above code shows that instruction tuning uses a prompt template to prepare a dataset for instruction tuning and fine-tune a model for a specific dataset. We can fine-tune the pre-trained model to a specific use case using such a custom dataset.

The next section will examine how Lamini can help fine-tune large language models (LLMs) for custom datasets.

How to Fine-Tune Open-Source LLMs UsingLamini?

The Lamini platform enables users to fine-tune and deploy models seamlessly without much cost and hardware setup requirements. Lamini provides an end-to-end stack to develop, train, tune,e, and deploy models at user convenience and model requirements. Lamini provides its own hosted GPU computing network to train models cost-effectively.

Fine-Tune Open-Source LLMs Using Lamini - Analytics Vidhya

Lamini memory tuning tools and compute optimization help train and tune models with high accuracy while controlling costs. Models can be hosted anywhere, on a private cloud or through Lamini’s GPU network. Next, we will see a step-by-step guide to prepare data to fine-tune large language models (LLMs) using the Lamini platform.

Data Preparation

Generally, we need to select a domain-specific dataset for data cleaning, promotion, tokenization, and storage to prepare data for any fine-tuning task. After loading the dataset, we preprocess it to convert it into an instruction-tuned dataset. We format each sample from the dataset into an instruction, question, and answer format to better fine-tune it for our use cases. Check out the source of the dataset using the link given here. Let’s look at the code example instructions on tuning with tokenization for training using the Lamini platform.

import pandas as pd

# load the dataset and store it as an instruction dataset
filename = "lamini_docs.json"
instruction_dataset_df = pd.read_json(filename, lines=True)
examples = instruction_dataset_df.to_dict()

if "question" in examples and "answer" in examples:
  text = examples["question"][0]   examples["answer"][0]
elif "instruction" in examples and "response" in examples:
  text = examples["instruction"][0]   examples["response"][0]
elif "input" in examples and "output" in examples:
  text = examples["input"][0]   examples["output"][0]
else:
  text = examples["text"][0]

prompt_template = """### Question:
{question}

### Answer:"""

# Store fine-tuning examples as an instruction format
num_examples = len(examples["question"])
finetuning_dataset = []
for i in range(num_examples):
  question = examples["question"][i]
  answer = examples["answer"][i]
  text_with_prompt_template = prompt_template.format(question=question)
  finetuning_dataset.append({"question": text_with_prompt_template, 
                             "answer": answer})

In the above example, we have formatted “questions” and “answers” in a prompt template and stored them in a separate file for tokenization and padding before training the LLM.

Tokenize the Dataset

# Tokenization of the dataset with padding and truncation
def tokenize_function(examples):
    if "question" in examples and "answer" in examples:
      text = examples["question"][0]   examples["answer"][0]
    elif "input" in examples and "output" in examples:
      text = examples["input"][0]   examples["output"][0]
    else:
      text = examples["text"][0]
    
    # padding
    tokenizer.pad_token = tokenizer.eos_token
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        padding=True,
    )

    max_length = min(
        tokenized_inputs["input_ids"].shape[1],
        2048
    )
    # truncation of the text
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=max_length
    )

    return tokenized_inputs

The above code takes the dataset samples as input for padding and truncation with tokenization to generate preprocessed tokenized dataset samples, which can be used for fine-tuning pre-trained models. Now that the dataset is ready, we will look into the training and evaluation of models using the Lamini platform.

Fine-Tuning Process

Now that we have a dataset prepared in an instruction-tuning format, we will load the dataset into the environment and fine-tune the pre-trained LLM model using Lamini’s easy-to-use training techniques.

Fine-Tune Open-Source LLMs Using Lamini - Analytics Vidhya

Setting up an Environment

To begin the fine-tuning open-sourceLLMs UsingLamini, we must first ensure that our code environment has suitable resources and libraries installed. We must ensure you have a suitable machine with sufficient GPU resources and install necessary libraries such as transformers, datasets, torches, and pandas. You must securely load environment variables like api_url and api_key, typically from environment files. You can use packages like dotenv to load these variables. After preparing the environment, load the dataset and models for training.

import os
from lamini import Lamini

lamini.api_url = os.getenv("POWERML__PRODUCTION__URL")
lamini.api_key = os.getenv("POWERML__PRODUCTION__KEY")

# import necessary library and load the environment files
import datasets
import tempfile
import logging
import random
import config
import os
import yaml
import time
import torch
import transformers
import pandas as pd
import jsonlines

# Loading transformer architecture and [[
from utilities import *
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
from transformers import TrainingArguments
from transformers import AutoModelForCausalLM
from llama import BasicModelRunner

logger = logging.getLogger(__name__)
global_config = None

Load Dataset

After setting up logging for monitoring and debugging, prepare your dataset using datasets or other data handling libraries like jsonlines and pandas. After loading the dataset, we will set up a tokenizer and model with training configurations for the training process.

# load the dataset from you local system or HF cloud
dataset_name = "lamini_docs.jsonl"
dataset_path = f"/content/{dataset_name}"
use_hf = False

# dataset path
dataset_path = "lamini/lamini_docs"

Set up model, training config, and tokenizer

Next, we select the model for fine-tuning open-sourceLLMs UsingLamini, “EleutherAI/pythia-70m,” and define its configuration under training_config, specifying the pre-trained model name and dataset path. We initialize the AutoTokenizer with the model’s tokenizer and set padding to the end-of-sequence token. Then, we tokenize the data and split it into training and testing datasets using a custom function, tokenize_and_split_data. Finally, we instantiate the base model using AutoModelForCausalLM, enabling it to perform causal language modeling tasks. Also, the below code sets up compute requirements for our model fine-tuning process.

# model name
model_name = "EleutherAI/pythia-70m"

# training config
training_config = {
    "model": {
        "pretrained_name": model_name,
        "max_length" : 2048
    },
    "datasets": {
        "use_hf": use_hf,
        "path": dataset_path
    },
    "verbose": True
}

# setting up auto tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
train_dataset, test_dataset = tokenize_and_split_data(training_config, tokenizer)

# set up a baseline model from lamini
base_model = Lamini(model_name)

# gpu parallization
device_count = torch.cuda.device_count()
if device_count > 0:
    logger.debug("Select GPU device")
    device = torch.device("cuda")
else:
    logger.debug("Select CPU device")
    device = torch.device("cpu")

Setup Training to Fine-Tune, the Model

Finally, we set up training argument parameters with hyperparameters. It includes learning rate, epochs, batch size, output directory, eval steps, sav, warmup steps, evaluation and logging strategy, etc., to fine-tune the custom training dataset.

max_steps = 3

# trained model name
trained_model_name = f"lamini_docs_{max_steps}_steps"
output_dir = trained_model_name

training_args = TrainingArguments(
  # Learning rate
  learning_rate=1.0e-5,
  # Number of training epochs
  num_train_epochs=1,

  # Max steps to train for (each step is a batch of data)
  # Overrides num_train_epochs, if not -1
  max_steps=max_steps,

  # Batch size for training
  per_device_train_batch_size=1,

  # Directory to save model checkpoints
  output_dir=output_dir,

  # Other arguments
  overwrite_output_dir=False, # Overwrite the content of the output directory
  disable_tqdm=False, # Disable progress bars
  eval_steps=120, # Number of update steps between two evaluations
  save_steps=120, # After # steps model is saved
  warmup_steps=1, # Number of warmup steps for learning rate scheduler
  per_device_eval_batch_size=1, # Batch size for evaluation
  evaluation_strategy="steps",
  logging_strategy="steps",
  logging_steps=1,
  optim="adafactor",
  gradient_accumulation_steps = 4,
  gradient_checkpointing=False,

  # Parameters for early stopping
  load_best_model_at_end=True,
  save_total_limit=1,
  metric_for_best_model="eval_loss",
  greater_is_better=False
)

After setting the training arguments, the system calculates the model’s floating-point operations per second (FLOPs) based on the input size and gradient accumulation steps. Thus giving insight into the computational load. It also assesses memory usage, estimating the model’s footprint in gigabytes. Once these calculations are complete, a Trainer initializes the base model, FLOPs, total training steps, and the prepared datasets for training and evaluation. This setup optimizes the training process and enables resource utilization monitoring, critical for efficiently handling large-scale model fine-tuning. At the end of training, the fine-tuned model is ready for deployment on the cloud to serve users as an API.

# model parameters
model_flops = (
  base_model.floating_point_ops(
    {
       "input_ids": torch.zeros(
           (1, training_config["model"]["max_length"])
      )
    }
  )
  * training_args.gradient_accumulation_steps
)

print(base_model)
print("Memory footprint", base_model.get_memory_footprint() / 1e9, "GB")
print("Flops", model_flops / 1e9, "GFLOPs")

# Set up a trainer
trainer = Trainer(
    model=base_model,
    model_flops=model_flops,
    total_steps=max_steps,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

Conclusion

In conclusion, this article provides an in-depth guide to understanding the need to fine-tune LLMs using the Lamini platform. It gives a comprehensive overview of why we must fine-tune the model for custom datasets and business use cases and the benefits of using Lamini tools. We also saw a step-by-step guide to fine-tuning the model using a custom dataset and LLM with tools from Lamini. Let’s summarise critical takeaways from the blog.

Key Takeaways

Learning is needed for fine-tuning models against prompt engineering and retrieval augmented generation methods.
UUtilizationof platforms like Lamini for easy-to-use hardware setup and deployment techniques for fine-tuned models to serve the user requirements
We are preparing data for the fine-tuning task and setting up a pipeline to train a base model using a wide range of hyperparameters.

Explore the code behind this article on GitHub.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. How to fine-tune my models?

A. The fine-tuning process starts with understanding context-specific requirements, dataset preparation, tokenization, and setting up training setups like hardware requirements, training configs, and training arguments. Eventually, a training job for model development is run.

Q2. What does fine-tuning of LLMs mean?

A. Fine-tuning an LLM means training a base model on a specific custom dataset. This generates accurate and context-relevant outputs for specific queries per the use case.

Q3. What is Lamini in LLM fine-tuning?

A. Lamini provides integrated language model fine-tuning, inference, and GPU setup for LLMs’ seamless, efficient, and cost-effective development.

The above is the detailed content of Fine-Tune Open-Source LLMs Using Lamini - Analytics Vidhya. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Tesla's Robovan Was The Hidden Gem In 2024's Robotaxi TeaserApr 22, 2025 am 11:48 AM

Since 2008, I've championed the shared-ride van—initially dubbed the "robotjitney," later the "vansit"—as the future of urban transportation. I foresee these vehicles as the 21st century's next-generation transit solution, surpas

Sam's Club Bets On AI To Eliminate Receipt Checks And Enhance RetailApr 22, 2025 am 11:29 AM

Revolutionizing the Checkout Experience Sam's Club's innovative "Just Go" system builds on its existing AI-powered "Scan & Go" technology, allowing members to scan purchases via the Sam's Club app during their shopping trip.

Nvidia's AI Omniverse Expands At GTC 2025Apr 22, 2025 am 11:28 AM

Nvidia's Enhanced Predictability and New Product Lineup at GTC 2025 Nvidia, a key player in AI infrastructure, is focusing on increased predictability for its clients. This involves consistent product delivery, meeting performance expectations, and

Exploring the Capabilities of Google's Gemma 2 ModelsApr 22, 2025 am 11:26 AM

Google's Gemma 2: A Powerful, Efficient Language Model Google's Gemma family of language models, celebrated for efficiency and performance, has expanded with the arrival of Gemma 2. This latest release comprises two models: a 27-billion parameter ver

The Next Wave of GenAI: Perspectives with Dr. Kirk Borne - Analytics VidhyaApr 22, 2025 am 11:21 AM

This Leading with Data episode features Dr. Kirk Borne, a leading data scientist, astrophysicist, and TEDx speaker. A renowned expert in big data, AI, and machine learning, Dr. Borne offers invaluable insights into the current state and future traje

AI For Runners And Athletes: We're Making Excellent ProgressApr 22, 2025 am 11:12 AM

There were some very insightful perspectives in this speech—background information about engineering that showed us why artificial intelligence is so good at supporting people’s physical exercise. I will outline a core idea from each contributor’s perspective to demonstrate three design aspects that are an important part of our exploration of the application of artificial intelligence in sports. Edge devices and raw personal data This idea about artificial intelligence actually contains two components—one related to where we place large language models and the other is related to the differences between our human language and the language that our vital signs “express” when measured in real time. Alexander Amini knows a lot about running and tennis, but he still

Jamie Engstrom On Technology, Talent And Transformation At CaterpillarApr 22, 2025 am 11:10 AM

Caterpillar's Chief Information Officer and Senior Vice President of IT, Jamie Engstrom, leads a global team of over 2,200 IT professionals across 28 countries. With 26 years at Caterpillar, including four and a half years in her current role, Engst

New Google Photos Update Makes Any Photo Pop With Ultra HDR QualityApr 22, 2025 am 11:09 AM

Google Photos' New Ultra HDR Tool: A Quick Guide Enhance your photos with Google Photos' new Ultra HDR tool, transforming standard images into vibrant, high-dynamic-range masterpieces. Ideal for social media, this tool boosts the impact of any photo,

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks agoByDDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months agoByDDD

Where to find the Crane Control Keycard in Atomfall

3 weeks agoByDDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks agoByDDD

Hot Tools

VSCode Windows 64-bit Download

A free and powerful IDE editor launched by Microsoft

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software