Sometimes you have perfectly good reasons to run language models on your own hardware. Maybe your data can’t leave your network. Maybe you’re tired of API caps. Or maybe you just want full control over how your model behaves.

As noted in the article on 9 principles for reliable, scalable AI agent systems, “for critical agents… you need three backups. Two cloud, one on-premises.” I’ve been running models locally for a while now, and the benefits are real. No rate limits, no surprise bills, and no wondering where your data is going.

Today I want to show you exactly how to get FLAN-T5 running on your own Ubuntu server. This is the same setup I’ve used in personal projects where data privacy was non-negotiable.

Uploaded image

Here's what we're going to build together:

  • Install PyTorch, Transformers, and NVIDIA drivers on Ubuntu

  • Load and run FLAN-T5-base for text generation

  • Measure latency and throughput for CPU and GPU

  • Improve outputs with zero-shot, one-shot, and few-shot prompting

  • Validate your setup with reproducible acceptance tests

Prerequisites: You should be comfortable with Ubuntu, SSH, and Python virtual environments. You'll need access to an Ubuntu 22.04 server with 8 to 16 GB RAM. If you have an NVIDIA GPU with 4+ GB VRAM, even better.

Out of scope: We won't cover API serving, fine-tuning, or production deployment. This guide focuses on getting a single, working inference pipeline up and running.

What We're Building

A command-line inference script that accepts a prompt, tokenizes it, runs it through FLAN-T5-base, and returns the generated text. The script logs latency and token counts for performance monitoring.

System flow:

Prompt → Tokenizer → Model (CPU/GPU) → Decode → Output + Logs

Deliverable: A Python script (run_flan.py) that runs inference on any prompt and exits with code 0 on success.

Success criteria:

  • FLAN-T5-base generates 32 tokens in less than 2 seconds on g5.xlarge, less than 15 seconds on t3.xlarge

  • Script returns exit code 0 on 5 predefined prompts

  • Logs capture prompt length, output length, and latency

Choosing the Right Hardware

FLAN-T5-base has 270M parameters. In fp32, the model needs about 1.08 GB of memory. In fp16, it needs about 0.54 GB. But here's the thing, you need to budget 2 to 3 times that for KV cache and activations. So plan for about 1.5 to 2 GB VRAM or RAM per concurrent sequence.

CPU baseline:

  • AWS t3.large (2 vCPU, 8 GB RAM): About 10 to 15 seconds per 32-token generation

  • AWS t3.xlarge (4 vCPU, 16 GB RAM): About 5 to 10 seconds per 32-token generation

GPU acceleration:

  • AWS g5.xlarge (1x A10G, 24 GB VRAM): Less than 2 seconds per 32-token generation

  • Supports fp16 and batch inference for higher throughput

I'd recommend starting with CPU to validate the pipeline, then moving to GPU if latency becomes critical. Actually, when I first experimented with this setup on a personal project, I ran everything on CPU for weeks before realizing I needed the speed boost.

Setup & Installation

1. Access Your Server

SSH into your Ubuntu 22.04 server:

ssh -i ~/.ssh/id_rsa ubuntu@SERVER_IP

2. Update System Packages

Update Ubuntu and install build essentials:

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential git curl wget ca-certificates
sudo apt install -y python3 python3-venv python3-pip

3. Create a Python Virtual Environment

Create and activate a virtual environment:

python3 -m venv ~/llm-venv
source ~/llm-venv/bin/activate

4. Upgrade pip and wheel

Ensure clean installs:

python -m pip install --upgrade pip wheel

5. Install PyTorch

For CPU:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

For GPU (CUDA 12.1):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

6. Install NVIDIA Drivers (GPU Only)

If you're using a GPU, install NVIDIA drivers and CUDA toolkit:

sudo ubuntu-drivers autoinstall
sudo apt install -y nvidia-cuda-toolkit

Verify installation:

nvidia-smi

You should see GPU details and driver version.

7. Install Hugging Face Transformers

Install Transformers and dependencies:

pip install transformers accelerate sentencepiece safetensors

8. Install Jupyter Notebook (Optional)

If you prefer interactive development:

pip install jupyter

9. Pin Dependencies for Reproducibility

Create a requirements.txt with pinned versions:

torch==2.4.0+cu121
transformers==4.44.2
accelerate==0.34.2
sentencepiece==0.2.0
safetensors==0.4.5

Install from the file:

pip install -r requirements.txt

10. Verify Installation

Check PyTorch and CUDA:

import torch
print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("CUDA device count:", torch.cuda.device_count())
    print("CUDA device name:", torch.cuda.get_device_name(0))

Validate tokenizer download:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
print("Tokenizer loaded successfully.")

Stop here if these checks fail. You need to resolve installation issues before proceeding. Trust me on this one, I've wasted hours debugging model issues that were actually installation problems.

11. Network Configuration

Secure your server with ufw:

sudo ufw allow 22/tcp
sudo ufw enable

If you're using Jupyter, bind to localhost only and access via SSH tunnel:

jupyter notebook --no-browser --port=8888

On your local machine:

ssh -L 8888:localhost:8888 ubuntu@SERVER_IP

Never bind Jupyter to 0.0.0.0 in production. Seriously, don't do it. I learned this the hard way in a previous role when someone found our unsecured notebook server.

How It Works: High-Level System Overview

FLAN-T5 is a sequence-to-sequence model. It takes a text prompt, encodes it into token IDs, generates output token IDs, and decodes them back to text.

Key integration points:

  • Tokenizer: Converts text to token IDs and back

  • Model: Runs inference on token IDs

  • GenerationConfig: Controls output length, sampling, and repetition

  • Device placement: Moves tensors to CPU or GPU

Why FLAN-T5-base?

  • Instruction-tuned for zero-shot tasks (summarization, Q&A, translation)

  • 270M parameters fit on modest hardware

  • Apache-2.0 license allows commercial use

Downloading and Running Your First Model

Let's start with a minimal inference pipeline. This script loads FLAN-T5-base, tokenizes a prompt, generates output, and decodes it.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig
import torch

model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

Tokenize a prompt and inspect tensor shapes:

prompt = "Question. What is the capital of Italy?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
print("Input IDs shape:", inputs["input_ids"].shape)
print("Attention mask shape:", inputs["attention_mask"].shape)

Configure generation parameters for controlled output:

gen_cfg = GenerationConfig(
    max_new_tokens=64,
    temperature=0.7,
    top_p=0.9,
    num_beams=1,
    repetition_penalty=1.05
)

Run inference and decode the output:

with torch.inference_mode():
    output_ids = model.generate(**inputs, generation_config=gen_cfg)
    text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    print("Model output:", text)

Design choices:

  • max_new_tokens=64 limits latency and cost

  • temperature=0.7 balances diversity and coherence

  • top_p=0.9 uses nucleus sampling for natural output

  • repetition_penalty=1.05 reduces repetitive phrases

Understanding Model Behavior: Base vs Instruction-Tuned

FLAN-T5-base is instruction-tuned. It follows explicit instructions like "Summarize" or "Translate." Base models like T5-base aren't instruction-tuned and often fail on zero-shot tasks.

Compare outputs on the same prompt:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

def run(model_id: str, prompt: str) -> str:
    tok = AutoTokenizer.from_pretrained(model_id)
    mdl = AutoModelForSeq2SeqLM.from_pretrained(model_id).to("cpu")
    inputs = tok(prompt, return_tensors="pt")
    with torch.inference_mode():
        out = mdl.generate(**inputs, max_new_tokens=32)
    return tok.decode(out[0], skip_special_tokens=True)

prompt = "Instruction. List three benefits of version control in software projects."
print("t5-base:", run("t5-base", prompt))
print("flan-t5-base:", run("google/flan-t5-base", prompt))

Expected output:

  • t5-base: Gibberish or incomplete fragments

  • flan-t5-base: "1. Track changes 2. Collaborate 3. Revert errors"

Instruction-tuned models save you from fine-tuning for common tasks. This is a huge time-saver. When I was working on a text summarization project last year, switching from T5 to FLAN-T5 literally saved me weeks of fine-tuning work.

Improving Outputs with In-Context Learning

You'll run a prompt through a fast inference pipeline, then iterate on prompt quality with zero-shot, one-shot, and few-shot patterns. For a deeper dive into how in-context learning can dramatically improve your model's accuracy and flexibility, check out our in-context learning guide.

Zero-Shot Prompting

Provide a task with no examples:

prompt = "Instruction. Summarize the following review in one sentence. Review. The coffee was strong, the staff friendly, but the place was a bit noisy."

Output: "Strong coffee, friendly staff, noisy environment."

Token count: About 30 input, about 10 output

One-Shot Prompting

Provide one example to teach the model a format:

one_shot = """Task. Write a JSON object with keys title and summary.
Input. 'Deploy a Python app on Ubuntu.'
Output. {"title": "Deploy a Python App", "summary": "Install Python, set up a venv, configure service, then monitor."}
Input. 'Set up a Redis cache for Django.'
Output."""

Output: {"title": "Set Up Redis Cache", "summary": "Install Redis, configure Django, test cache."}

Token count: About 60 input, about 20 output

Few-Shot Prompting

Provide multiple examples to reinforce patterns:

few_shot = """Task. Convert a sentence to a short title.
Input. 'Troubleshoot slow PostgreSQL queries.'
Output. 'Fix Slow PostgreSQL Queries'
Input. 'Implement OAuth2 login with FastAPI.'
Output. 'FastAPI OAuth2 Login'
Input. 'Harden Ubuntu SSH for production.'
Output. 'Harden SSH on Ubuntu'
Input. 'Automate backups with S3 lifecycle rules.'
Output. 'Automate S3 Backups'
Input. 'Audit API calls with structured logs.'
Output."""

Output: "Audit API Calls"

Token count: About 120 input, about 5 output

Trade-offs:

  • Zero-shot: Fast, low token cost, less control

  • One-shot: Moderate cost, teaches format

  • Few-shot: High token cost, strongest control

Actually, let me share something interesting. In a personal project where I was categorizing support tickets, I found that two well-chosen examples worked better than five mediocre ones. Quality beats quantity every time.

If you want to refine prompts further and boost reliability, explore these prompt engineering techniques for reliable LLM outputs. If outputs still fall short, evaluate fine-tuning or parameter-efficient fine-tuning. Choose the least invasive method that meets your quality and cost goals. For a step-by-step walkthrough of full fine-tuning with Hugging Face, follow our detailed guide.

Run and Validate Your Self-Hosted LLM

End-to-End CLI Script

This script runs inference from the command line and logs performance:

import sys
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

def main():
    model_id = "google/flan-t5-base"
    tok = AutoTokenizer.from_pretrained(model_id)
    mdl = AutoModelForSeq2SeqLM.from_pretrained(model_id).to("cuda" if torch.cuda.is_available() else "cpu")
    
    inp = " ".join(sys.argv[1:]) or "Question. What is the capital of Italy?"
    inputs = tok(inp, return_tensors="pt").to(mdl.device)
    
    with torch.inference_mode():
        out = mdl.generate(**inputs, max_new_tokens=64)
    
    print(tok.decode(out[0], skip_special_tokens=True))

if __name__ == "__main__":
    main()

Save as run_flan.py and run:

python run_flan.py "Question. What is the capital of France?"

Measure Latency

Track inference time for performance monitoring:

import time

t0 = time.time()
with torch.inference_mode():
    output_ids = model.generate(**inputs, max_new_tokens=64)
latency = time.time() - t0
print("Latency (seconds):", latency)

Batch Inference

Process multiple prompts in one pass:

prompts = [
    "Question. Who wrote The Pragmatic Programmer?",
    "Instruction. Translate to French. How are you today?",
    "Instruction. Summarize. Kubernetes manages containers across nodes."
]

enc = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to(device)
with torch.inference_mode():
    out = model.generate(**enc, max_new_tokens=64)

for i, o in enumerate(out):
    print(f"Prompt {i} output:", tokenizer.decode(o, skip_special_tokens=True))

Enable Logging

Log prompt and output lengths for debugging:

import logging
from transformers.utils import logging as hf_logging

logging.basicConfig(level=logging.INFO)
hf_logging.set_verbosity_info()

logging.info(f"Prompt length: {inputs['input_ids'].shape[1]}")
logging.info(f"Output length: {output_ids.shape[1]}")

Security note: Redact sensitive data in logs. Use log rotation (logrotate) and set retention policies. I once had a script that logged customer prompts to debug an issue. Bad idea. Really bad idea. Always sanitize your logs.

Acceptance Tests

Validate your setup with 5 canonical prompts:

  1. "Question. What is the capital of Italy?" should return "Rome"

  2. "Instruction. Translate to Spanish. Hello." should return "Hola"

  3. "Instruction. Summarize. AI is transforming industries." should return "AI transforms industries"

  4. "Task. List two benefits of Docker." should return "1. Portability 2. Isolation"

  5. "Question. Who invented Python?" should return "Guido van Rossum"

Run each prompt and verify:

  • Exit code 0

  • Output matches expected pattern

  • Latency within thresholds (CPU: less than 15s, GPU: less than 2s)

Scaling Considerations

GPU Optimization

Use fp16 for faster inference and lower memory:

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

Enable TF32 on Ampere GPUs:

torch.backends.cuda.matmul.allow_tf32 = True

Warning: Don't use fp16 on CPU. It will slow down inference. Found this out the hard way when I tried to optimize a CPU deployment and made it 3x slower.

Batch Processing

Increase throughput by processing multiple prompts at once:

batch = tokenizer(
    prompts,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=512
).to(device)

with torch.inference_mode():
    outputs = model.generate(**batch, max_new_tokens=64)

decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print("Batch outputs:", decoded)

Trade-off: Larger batches increase throughput but also increase latency per prompt.

If you plan to scale beyond prompts and need retrieval to ground generations, consider implementing vector store retrieval for RAG systems. This helps reduce hallucinations and improves factual accuracy at scale.

Offline and Air-Gapped Environments

Download models once and cache them:

export HF_HOME=/path/to/cache
export TRANSFORMERS_CACHE=/path/to/cache
export HF_HUB_OFFLINE=1

For proxy environments, set HTTP_PROXY and HTTPS_PROXY before downloading models.

Advanced Topics

Deterministic Inference

Set seeds for reproducible outputs:

torch.manual_seed(42)

Limitation: GPU inference isn't fully deterministic. Use torch.use_deterministic_algorithms(True) for stricter control, but expect slower performance.

Parameter-Efficient Fine-Tuning

If you need consistent schemas, domain-specific style, or complex transformations, in-context learning might not be enough. Actually, let me put it this way. When you hit the limits of prompting, evaluate parameter-efficient fine-tuning like LoRA using PEFT. Check out our hands-on PEFT and LoRA guide. This reduces training cost and memory while giving strong control over output behavior.

License and Compliance

FLAN-T5-base is licensed under Apache-2.0, which allows commercial use. Review the model card on Hugging Face for restrictions. For production deployments, confirm internal approval and audit requirements.

Conclusion and Next Steps

So there you have it. You've deployed FLAN-T5-base on Ubuntu, run inference, and validated outputs. You learned to measure latency, batch prompts, and improve quality with in-context learning.

Key architectural decisions:

  • Seq2seq model for instruction-following tasks

  • CPU-first baseline for cost control

  • Instruction-tuned model to avoid fine-tuning

Next steps:

  • Target 5x latency reduction with GPU and fp16

  • Wrap the script with FastAPI and enforce JWT auth

  • Explore building reliable LLM workflows with LangChain

  • Implement log rotation and PII redaction for production

  • Evaluate LoRA fine-tuning for domain-specific tasks

You now have a working, self-hosted LLM pipeline. And honestly, that's no small achievement. The first time I got a model running locally, generating coherent text on my own hardware, it felt like magic. Now it's your turn to scale it, secure it, and deploy it with confidence.