Fine-tuning large language models: a step-by-step guide

Prompting gets you started and in-context learning takes you further, but fine-tuning is how you break through performance limits. This guide explains how and when to use it.

Paco Awissi

12 min read • November 27, 2025

So you've been working with a large language model (LLM) and the first thing you probably tried was getting your use case to work with basic prompt engineering. You write clear prompts, follow the Guidelines to Effective Prompting, try to be as direct and unambiguous as possible. For complex problems, you ask the model to reason step by step. If you want a complete guide on building reliable, production-ready LLM features with effective prompt engineering, I put together a detailed walkthrough that covers pretty much everything.

But what happens when that doesn't quite work? You move on to in-context learning. As I explained in The Magic of In-Context Learning, this basically means you add examples to help the model understand what you're trying to achieve. For more on in-context learning techniques and when to use them, I wrote a practical guide that might help.

Now, if those steps still aren't getting you where you need to be, it's probably time to think about fine-tuning. I covered this in Fine-Tuning 101. Fine-tuning lets you customize the LLM to handle your specific use case much more effectively. Depending on what resources you have and how precise you need the model to be, you can either go with full fine-tuning or parameter-efficient fine-tuning. You might also want to look into parameter-efficient fine-tuning approaches like LoRA if you're looking to reduce costs while still getting strong performance.

In this cookbook, I'm going to focus specifically on full fine-tuning. I'll walk you through exactly how to apply it to your task, step by step. Let me show you what I learned when I tackled this myself.

Setup

Let's start by setting up our environment. I'm going to follow the same setup I outlined in Running an LLM Locally on Your Own Server: A Practical Guide. If you're not sure about any part of the process, just go back to that post for details.

## Import the necessary packages
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

# Load the FLAN-T5 model
model_name = "google/flan-t5-base"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Few-shot learning
input_text = """
Answer the following geography questions using the format shown in the context. 
Answer with a single sentence containing the city’s name, country, population, and three famous landmarks. 

Follow the pattern below:

Q: Tell me about Paris.  
A: Paris is a city in France with a population of 2.1 million, known for landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral.

Q: Describe New York.  
A: New York is a city in the United States with a population of 8.5 million, known for landmarks such as the Statue of Liberty, Central Park, and Times Square.

Q: What can you say about Tokyo?  
A: Tokyo is a city in Japan with a population of 14 million, known for landmarks such as the Tokyo Tower, Shibuya Crossing, and the Meiji Shrine.

Q: Tell me some information about Sydney.  
A: Sydney is a city in Australia with a population of 5.3 million, known for landmarks such as the Sydney Opera House, the Harbour Bridge, and Bondi Beach.

Q: Could you give me some details about Cairo?  
A: Cairo is a city in Egypt with a population of 9.5 million, known for landmarks such as the Pyramids of Giza, the Sphinx, and the Egyptian Museum.

Now, describe Vancouver in the same format.
"""

# Tokenize input
inputs = tokenizer(input_text, return_tensors="pt")

#  Generate response
outputs = model.generate(inputs.input_ids, max_length=50)

# Decode and print the ouput
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Vancouver is a city in Canada.

So here's what I discovered. When our task is to request information about cities, including the city's name, country, population, and three famous landmarks, in-context learning basically hit its limit. We weren't getting the results we wanted at all. This is where fine-tuning comes in. Let's see if we can get better outcomes. First thing we need to do is prepare the training data.

Preparing the Training Data

Here's something interesting I tried. We're going to use another LLM, actually a larger one, to create the training data we need. For this example, I'll generate 100 labeled input-output pairs for fine-tuning our model and see if that's enough. This approach is particularly useful when you need to generate datasets quickly. Honestly, it saves a ton of time compared to manual data collection. But here's the thing, you really need to make sure the generated data is accurate and diverse enough to support solid training.

Advantages of Using an LLM

Speed: An LLM can generate thousands of examples in just minutes. I remember when I first tried this, I couldn't believe how fast it was.
Consistency: The model keeps formatting consistent across your entire dataset. Less variability, fewer manual corrections.
Adaptability: You can easily adjust your prompts to generate more diverse outputs. Want to vary question phrasing? Add facts about different countries? It's all doable.

Process Using GPT-4

Define a Template for the Desired Output First, you need to create a template for the type of question-answer pairs you want. Let's say your task involves cities. You might include fields like the city's name, country, population, and key landmarks.
Provide Few-Shot Examples to the LLM Use few-shot learning to guide the model. Start with a small set of manually written examples, then prompt the model to generate more pairs in that same style.
Use the LLM to Generate Multiple Examples Once you have your template ready, the LLM can generate more question-answer pairs. By using API calls and looping over a list of geographic entities, you can generate as many examples as you need. I found that batching these requests worked really well.
Review and Refine the Data Now, while LLMs can produce structured outputs, the quality can vary. You absolutely need to review the generated data to ensure:
- Accuracy: Facts like capitals, populations, and landmarks are actually correct. I caught several errors in my first batch.
- Format Consistency: All answers follow your defined template.
- Diversity: Your dataset includes a variety of questions and phrasing. This one's easy to overlook but super important.

Actually, to further improve quality, you might want to explore techniques for building robust retrieval-augmented generation (RAG) systems. These approaches help you curate and validate the most relevant context for both training and evaluation.

# Import the necessary Python libraries
from dotenv import load_dotenv, find_dotenv
from openai import OpenAI
import json
import time

# Load the OPENAI_API_KEY from local .env file
_ = load_dotenv(find_dotenv())

# Instantiate the OpenAI client
client = OpenAI()

# List of cities for generating question-answer pairs
cities = [
    "Paris", "Tokyo", "New York", "Sydney", "Cairo", "Rio de Janeiro",
    "London", "Berlin", "Dubai", "Rome", "Beijing", "Bangkok", "Moscow",
    "Toronto", "Los Angeles", "Cape Town", "Mumbai", "Seoul", "Buenos Aires",
    "Istanbul", "Mexico City", "Jakarta", "Shanghai", "Lagos", "Madrid",
    "Lisbon", "Stockholm", "Vienna", "Prague", "Warsaw", "Helsinki", "Oslo",
    "Brussels", "Zurich", "Kuala Lumpur", "Singapore", "Manila", "Lima",
    "Santiago", "Bogotá", "Nairobi", "Havana", "San Francisco", "Chicago",
    "Venice", "Florence", "Edinburgh", "Glasgow", "Dublin", "Athens",
    "Melbourne", "Perth", "Hong Kong", "Doha", "Casablanca", "Tehran",
    "Bucharest", "Munich", "Barcelona", "Kyoto", "Kolkata", "Amman",
    "Lyon", "Nice", "Marseille", "Tel Aviv", "Jerusalem", "Geneva", 
    "Ho Chi Minh City", "Phnom Penh", "Yangon", "Colombo", "Riyadh",
    "Abu Dhabi", "Addis Ababa", "Seville", "Bilbao", "Porto", "Bratislava",
    "Ljubljana", "Tallinn", "Riga", "Vilnius", "Belgrade", "Sarajevo",
    "Skopje", "Tirana", "Baku", "Yerevan", "Tashkent", "Almaty", "Ulaanbaatar",
    "Karachi", "Islamabad", "Helsinki", "Chennai", "Kigali", "Antananarivo",
    "Bangui", "San Juan"
]

# Function to create a prompt for a specific city
def create_prompt(city):
    return f"""
    Your task is to provide question-and-answer pairs about cities following this format:
    {{
      "input": "[A unique way to ask for a description of the city]",
      "output": "[City Name] is a city in [Country] with a population of [Population], known for landmarks such as [Landmark 1], [Landmark 2], and [Landmark 3]."
    }}

    Here are a few examples:
    {{
      "input": "Tell me about Paris.",
      "output": "Paris is a city in France with a population of 2.1 million, known for landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral."
    }}
    {{
      "input": "Can you provide information on Tokyo?",
      "output": "Tokyo is a city in Japan with a population of 37 million, known for landmarks such as the Tokyo Tower, Shibuya Crossing, and Meiji Shrine."
    }}
    {{
      "input": "What can you tell me about New York?",
      "output": "New York is a city in the United States with a population of 8.4 million, known for landmarks such as the Statue of Liberty, Central Park, and Times Square."
    }}

    Now, generate a similar question-answer pair for the city {city}.
    """

# Function to generate a Q&A pair using GPT-4 for a given city
def generate_city_qa(city):
    prompt = create_prompt(city)
    try:
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=200
        )
        return json.loads(response.choices[0].message.content.strip())
    except Exception as e:
        print(f"Error generating data for {city}: {e}")
        return None
        
# Generate and save Q&A pairs incrementally to a JSONL file
with open("city_qna.jsonl", "w") as f:
    for i, city in enumerate(cities):
        qa_pair = generate_city_qa(city)
        if qa_pair:
            f.write(json.dumps(qa_pair) + "\n")
            # Print the first 3
            if i < 3:
                print(qa_pair)
            else:
                print(".", end="")
        time.sleep(1)  # Add delay to manage rate limits

print("\n\nCity Q&A pairs saved to city_qna.jsonl")

{'input': 'Can you describe the city of Paris to me?', 'output': 'Paris is a city in France with a population of 2.1 million, known for landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral.'}
{'input': 'What should I know about Tokyo?', 'output': 'Tokyo is a city in Japan with a population of 37 million, known for landmarks such as the Tokyo Skytree, Senso-ji Temple, and the Imperial Palace.'}
{'input': 'What do you know about New York?', 'output': 'New York is a city in the United States with a population of 8.4 million, known for landmarks such as the Empire State Building, Brooklyn Bridge, and Wall Street.'}
.................................................................................................

City Q&A pairs saved to city_qna.jsonl

Full Fine-Tuning

Alright, now that we've prepared our training data and saved it in a JSONL file, we're ready to move on to the actual fine-tuning process. With the data ready, the next step is configuring the fine-tuning environment and beginning the process of adapting the model. Let me walk you through exactly how I did this.

Load the Dataset from the JSONL File

We'll use the Hugging Face datasets library to load our JSONL file. Pretty straightforward.

from datasets import load_dataset

# Load your dataset from the JSONL file
dataset = load_dataset("json", data_files="city_qna.jsonl")

# Check the dataset structure
print(dataset["train"][0])

Generating train split: 0 examples [00:00, ? examples/s]
{'input': 'Can you describe the city of Paris to me?', 'output': 'Paris is a city in France with a population of 2.1 million, known for landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral.'}

Preprocess the Data

Since Flan-T5 is a seq2seq model, we need to tokenize both the input and output appropriately. This took me a while to get right the first time.

from transformers import AutoTokenizer

# Load the tokenizer for Flan-T5
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

# Tokenize the dataset
def preprocess_data(examples):
    # Extract inputs and outputs as lists from the dictionary
    inputs = examples["input"]
    outputs = examples["output"]

    # Tokenize inputs and outputs with padding and truncation
    model_inputs = tokenizer(inputs, max_length=128, padding="max_length", truncation=True)
    labels = tokenizer(outputs, max_length=128, padding="max_length", truncation=True).input_ids

    # Replace padding token IDs with -100 to ignore them in the loss function
    labels = [[-100 if token == tokenizer.pad_token_id else token for token in label] for label in labels]
    model_inputs["labels"] = labels

    return model_inputs

# Use the map function to apply the preprocessing to the whole dataset
tokenized_dataset = dataset["train"].map(preprocess_data, batched=True)

Map: 0%| | 0/100 [00:00<?, ? examples/s]

Set Up the Model for Fine-Tuning

If you want a quick refresher on the core building blocks here, check out my guide on understanding the transformer architecture that powers models like Flan-T5.

from transformers import AutoModelForSeq2SeqLM

# Load the Flan-T5 model
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")

Define the Training Arguments

Now we need to set the training parameters. Things like the number of epochs, batch size, and learning rate. I experimented with these quite a bit to find what worked best.

from transformers import TrainingArguments, Trainer

# Define training arguments
training_args = TrainingArguments(
    output_dir="/home/ubuntu/flan-t5-city-tuning",  # Output directory
    eval_strategy="no",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=1,  # Only keep the most recent checkpoint
    logging_dir="./logs",  # Directory for logs
    logging_steps=10,
    push_to_hub=False  # Set this to True if you want to push to Hugging Face Hub
)

Create a Trainer and Train the Model

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer
)

# Start fine-tuning
trainer.train()

[39/39 02:59, Epoch 3/3]
Step Training Loss
10 1.516000
20 1.537800
30 1.412000

TrainOutput(global_step=39, training_loss=1.4650684992472331, metrics={'train_runtime': 184.0522, 'train_samples_per_second': 1.63, 'train_steps_per_second': 0.212, 'total_flos': 51356801433600.0, 'train_loss': 1.4650684992472331, 'epoch': 3.0})

Evaluate the Model Qualitatively (Human Evaluation)

# Load the fine-tuned model
model_name = "/home/ubuntu/flan-t5-city-tuning/checkpoint-39"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Few-shot learning
input_text = "Describe the city of Vancouver"

# Tokenize input
inputs = tokenizer(input_text, return_tensors="pt")

#  Generate response
outputs = model.generate(inputs.input_ids, max_length=50)

# Decode and print the ouput
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Vancouver is a city in Canada with a population of 2.8 million, known for landmarks such as the Vancouver Bridge, Vancouver City Hall, and Vancouver International Airport.

Great! Now the model provides exactly the format we were aiming for. Sure, some of the information isn't entirely accurate, but the structure is much improved. This result shows significant progress. And remember, we achieved this without any in-context learning at all. That's pretty impressive if you ask me.

Conclusion

In this post, I demonstrated how to fine-tune a large language model using a structured dataset that we generated with the help of another LLM. We started by preparing our training data, emphasized why accuracy and diversity matter so much, and walked through the entire fine-tuning process.

The results showed real improvement in the model's ability to provide responses in the format we wanted. Yes, some factual errors persisted, but the structured output clearly shows that fine-tuning can significantly enhance a model's ability to meet specific requirements. And it did this even without in-context learning.

Fine-tuning really does offer a powerful way to tailor a model to specialized tasks. But you do need to carefully review both the data and outputs to ensure factual correctness and diversity. With the right approach though, fine-tuning can unlock new levels of precision and customization. I've seen this work time and time again in my own projects, and honestly, once you get the hang of it, it becomes an invaluable tool in your toolkit.