How to Run Your Own Self-Hosted LLM on a Server: A Practical Guide
Ship a secure self-hosted LLM on Ubuntu. Size hardware, pick models, run vLLM, serve via FastAPI endpoints, choose adaptation confidently.
Sometimes you have perfectly good reasons to run language models on your own hardware. Maybe your data can’t leave your network. Maybe you’re tired of API caps. Or maybe you just want full control over how your model behaves.
As noted in the article on 9 principles for reliable, scalable AI agent systems, “for critical agents… you need three backups. Two cloud, one on-premises.” I’ve been running models locally for a while now, and the benefits are real. No rate limits, no surprise bills, and no wondering where your data is going.
Today I want to show you exactly how to get FLAN-T5 running on your own Ubuntu server. This is the same setup I’ve used in personal projects where data privacy was non-negotiable.

Here's what we're going to build together:
Install PyTorch, Transformers, and NVIDIA drivers on Ubuntu
Load and run FLAN-T5-base for text generation
Measure latency and throughput for CPU and GPU
Improve outputs with zero-shot, one-shot, and few-shot prompting
Validate your setup with reproducible acceptance tests
Prerequisites: You should be comfortable with Ubuntu, SSH, and Python virtual environments. You'll need access to an Ubuntu 22.04 server with 8 to 16 GB RAM. If you have an NVIDIA GPU with 4+ GB VRAM, even better.
Out of scope: We won't cover API serving, fine-tuning, or production deployment. This guide focuses on getting a single, working inference pipeline up and running.
What We're Building
A command-line inference script that accepts a prompt, tokenizes it, runs it through FLAN-T5-base, and returns the generated text. The script logs latency and token counts for performance monitoring.
System flow:
Prompt → Tokenizer → Model (CPU/GPU) → Decode → Output + LogsDeliverable: A Python script (run_flan.py) that runs inference on any prompt and exits with code 0 on success.
Success criteria:
FLAN-T5-base generates 32 tokens in less than 2 seconds on g5.xlarge, less than 15 seconds on t3.xlarge
Script returns exit code 0 on 5 predefined prompts
Logs capture prompt length, output length, and latency
Choosing the Right Hardware
FLAN-T5-base has 270M parameters. In fp32, the model needs about 1.08 GB of memory. In fp16, it needs about 0.54 GB. But here's the thing, you need to budget 2 to 3 times that for KV cache and activations. So plan for about 1.5 to 2 GB VRAM or RAM per concurrent sequence.
CPU baseline:
AWS t3.large (2 vCPU, 8 GB RAM): About 10 to 15 seconds per 32-token generation
AWS t3.xlarge (4 vCPU, 16 GB RAM): About 5 to 10 seconds per 32-token generation
GPU acceleration:
AWS g5.xlarge (1x A10G, 24 GB VRAM): Less than 2 seconds per 32-token generation
Supports fp16 and batch inference for higher throughput
I'd recommend starting with CPU to validate the pipeline, then moving to GPU if latency becomes critical. Actually, when I first experimented with this setup on a personal project, I ran everything on CPU for weeks before realizing I needed the speed boost.
Setup & Installation
1. Access Your Server
SSH into your Ubuntu 22.04 server:
ssh -i ~/.ssh/id_rsa ubuntu@SERVER_IP
2. Update System Packages
Update Ubuntu and install build essentials:
sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential git curl wget ca-certificates
sudo apt install -y python3 python3-venv python3-pip
3. Create a Python Virtual Environment
Create and activate a virtual environment:
python3 -m venv ~/llm-venv
source ~/llm-venv/bin/activate
4. Upgrade pip and wheel
Ensure clean installs:
python -m pip install --upgrade pip wheel
5. Install PyTorch
For CPU:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
For GPU (CUDA 12.1):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
6. Install NVIDIA Drivers (GPU Only)
If you're using a GPU, install NVIDIA drivers and CUDA toolkit:
sudo ubuntu-drivers autoinstall
sudo apt install -y nvidia-cuda-toolkit
Verify installation:
nvidia-smi
You should see GPU details and driver version.
7. Install Hugging Face Transformers
Install Transformers and dependencies:
pip install transformers accelerate sentencepiece safetensors
8. Install Jupyter Notebook (Optional)
If you prefer interactive development:
pip install jupyter
9. Pin Dependencies for Reproducibility
Create a requirements.txt with pinned versions:
torch==2.4.0+cu121
transformers==4.44.2
accelerate==0.34.2
sentencepiece==0.2.0
safetensors==0.4.5
Install from the file:
pip install -r requirements.txt
10. Verify Installation
Check PyTorch and CUDA:
import torch
print("PyTorch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
print("CUDA device count:", torch.cuda.device_count())
print("CUDA device name:", torch.cuda.get_device_name(0))
Validate tokenizer download:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
print("Tokenizer loaded successfully.")
Stop here if these checks fail. You need to resolve installation issues before proceeding. Trust me on this one, I've wasted hours debugging model issues that were actually installation problems.
11. Network Configuration
Secure your server with ufw:
sudo ufw allow 22/tcp
sudo ufw enable
If you're using Jupyter, bind to localhost only and access via SSH tunnel:
jupyter notebook --no-browser --port=8888
On your local machine:
ssh -L 8888:localhost:8888 ubuntu@SERVER_IP
Never bind Jupyter to 0.0.0.0 in production. Seriously, don't do it. I learned this the hard way in a previous role when someone found our unsecured notebook server.
How It Works: High-Level System Overview
FLAN-T5 is a sequence-to-sequence model. It takes a text prompt, encodes it into token IDs, generates output token IDs, and decodes them back to text.
Key integration points:
Tokenizer: Converts text to token IDs and back
Model: Runs inference on token IDs
GenerationConfig: Controls output length, sampling, and repetition
Device placement: Moves tensors to CPU or GPU
Why FLAN-T5-base?
Instruction-tuned for zero-shot tasks (summarization, Q&A, translation)
270M parameters fit on modest hardware
Apache-2.0 license allows commercial use
Downloading and Running Your First Model
Let's start with a minimal inference pipeline. This script loads FLAN-T5-base, tokenizes a prompt, generates output, and decodes it.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig
import torch
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
Tokenize a prompt and inspect tensor shapes:
prompt = "Question. What is the capital of Italy?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
print("Input IDs shape:", inputs["input_ids"].shape)
print("Attention mask shape:", inputs["attention_mask"].shape)
Configure generation parameters for controlled output:
gen_cfg = GenerationConfig(
max_new_tokens=64,
temperature=0.7,
top_p=0.9,
num_beams=1,
repetition_penalty=1.05
)
Run inference and decode the output:
with torch.inference_mode():
output_ids = model.generate(**inputs, generation_config=gen_cfg)
text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("Model output:", text)
Design choices:
max_new_tokens=64 limits latency and cost
temperature=0.7 balances diversity and coherence
top_p=0.9 uses nucleus sampling for natural output
repetition_penalty=1.05 reduces repetitive phrases
Understanding Model Behavior: Base vs Instruction-Tuned
FLAN-T5-base is instruction-tuned. It follows explicit instructions like "Summarize" or "Translate." Base models like T5-base aren't instruction-tuned and often fail on zero-shot tasks.
Compare outputs on the same prompt:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
def run(model_id: str, prompt: str) -> str:
tok = AutoTokenizer.from_pretrained(model_id)
mdl = AutoModelForSeq2SeqLM.from_pretrained(model_id).to("cpu")
inputs = tok(prompt, return_tensors="pt")
with torch.inference_mode():
out = mdl.generate(**inputs, max_new_tokens=32)
return tok.decode(out[0], skip_special_tokens=True)
prompt = "Instruction. List three benefits of version control in software projects."
print("t5-base:", run("t5-base", prompt))
print("flan-t5-base:", run("google/flan-t5-base", prompt))
Expected output:
t5-base: Gibberish or incomplete fragments
flan-t5-base: "1. Track changes 2. Collaborate 3. Revert errors"
Instruction-tuned models save you from fine-tuning for common tasks. This is a huge time-saver. When I was working on a text summarization project last year, switching from T5 to FLAN-T5 literally saved me weeks of fine-tuning work.
Improving Outputs with In-Context Learning
You'll run a prompt through a fast inference pipeline, then iterate on prompt quality with zero-shot, one-shot, and few-shot patterns. For a deeper dive into how in-context learning can dramatically improve your model's accuracy and flexibility, check out our in-context learning guide.
Zero-Shot Prompting
Provide a task with no examples:
prompt = "Instruction. Summarize the following review in one sentence. Review. The coffee was strong, the staff friendly, but the place was a bit noisy."
Output: "Strong coffee, friendly staff, noisy environment."
Token count: About 30 input, about 10 output
One-Shot Prompting
Provide one example to teach the model a format:
one_shot = """Task. Write a JSON object with keys title and summary.
Input. 'Deploy a Python app on Ubuntu.'
Output. {"title": "Deploy a Python App", "summary": "Install Python, set up a venv, configure service, then monitor."}
Input. 'Set up a Redis cache for Django.'
Output."""
Output: {"title": "Set Up Redis Cache", "summary": "Install Redis, configure Django, test cache."}
Token count: About 60 input, about 20 output
Few-Shot Prompting
Provide multiple examples to reinforce patterns:
few_shot = """Task. Convert a sentence to a short title.
Input. 'Troubleshoot slow PostgreSQL queries.'
Output. 'Fix Slow PostgreSQL Queries'
Input. 'Implement OAuth2 login with FastAPI.'
Output. 'FastAPI OAuth2 Login'
Input. 'Harden Ubuntu SSH for production.'
Output. 'Harden SSH on Ubuntu'
Input. 'Automate backups with S3 lifecycle rules.'
Output. 'Automate S3 Backups'
Input. 'Audit API calls with structured logs.'
Output."""
Output: "Audit API Calls"
Token count: About 120 input, about 5 output
Trade-offs:
Zero-shot: Fast, low token cost, less control
One-shot: Moderate cost, teaches format
Few-shot: High token cost, strongest control
Actually, let me share something interesting. In a personal project where I was categorizing support tickets, I found that two well-chosen examples worked better than five mediocre ones. Quality beats quantity every time.
If you want to refine prompts further and boost reliability, explore these prompt engineering techniques for reliable LLM outputs. If outputs still fall short, evaluate fine-tuning or parameter-efficient fine-tuning. Choose the least invasive method that meets your quality and cost goals. For a step-by-step walkthrough of full fine-tuning with Hugging Face, follow our detailed guide.
Run and Validate Your Self-Hosted LLM
End-to-End CLI Script
This script runs inference from the command line and logs performance:
import sys
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
def main():
model_id = "google/flan-t5-base"
tok = AutoTokenizer.from_pretrained(model_id)
mdl = AutoModelForSeq2SeqLM.from_pretrained(model_id).to("cuda" if torch.cuda.is_available() else "cpu")
inp = " ".join(sys.argv[1:]) or "Question. What is the capital of Italy?"
inputs = tok(inp, return_tensors="pt").to(mdl.device)
with torch.inference_mode():
out = mdl.generate(**inputs, max_new_tokens=64)
print(tok.decode(out[0], skip_special_tokens=True))
if __name__ == "__main__":
main()
Save as run_flan.py and run:
python run_flan.py "Question. What is the capital of France?"
Measure Latency
Track inference time for performance monitoring:
import time
t0 = time.time()
with torch.inference_mode():
output_ids = model.generate(**inputs, max_new_tokens=64)
latency = time.time() - t0
print("Latency (seconds):", latency)
Batch Inference
Process multiple prompts in one pass:
prompts = [
"Question. Who wrote The Pragmatic Programmer?",
"Instruction. Translate to French. How are you today?",
"Instruction. Summarize. Kubernetes manages containers across nodes."
]
enc = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to(device)
with torch.inference_mode():
out = model.generate(**enc, max_new_tokens=64)
for i, o in enumerate(out):
print(f"Prompt {i} output:", tokenizer.decode(o, skip_special_tokens=True))
Enable Logging
Log prompt and output lengths for debugging:
import logging
from transformers.utils import logging as hf_logging
logging.basicConfig(level=logging.INFO)
hf_logging.set_verbosity_info()
logging.info(f"Prompt length: {inputs['input_ids'].shape[1]}")
logging.info(f"Output length: {output_ids.shape[1]}")
Security note: Redact sensitive data in logs. Use log rotation (logrotate) and set retention policies. I once had a script that logged customer prompts to debug an issue. Bad idea. Really bad idea. Always sanitize your logs.
Acceptance Tests
Validate your setup with 5 canonical prompts:
"Question. What is the capital of Italy?" should return "Rome"
"Instruction. Translate to Spanish. Hello." should return "Hola"
"Instruction. Summarize. AI is transforming industries." should return "AI transforms industries"
"Task. List two benefits of Docker." should return "1. Portability 2. Isolation"
"Question. Who invented Python?" should return "Guido van Rossum"
Run each prompt and verify:
Exit code 0
Output matches expected pattern
Latency within thresholds (CPU: less than 15s, GPU: less than 2s)
Scaling Considerations
GPU Optimization
Use fp16 for faster inference and lower memory:
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
Enable TF32 on Ampere GPUs:
torch.backends.cuda.matmul.allow_tf32 = True
Warning: Don't use fp16 on CPU. It will slow down inference. Found this out the hard way when I tried to optimize a CPU deployment and made it 3x slower.
Batch Processing
Increase throughput by processing multiple prompts at once:
batch = tokenizer(
prompts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512
).to(device)
with torch.inference_mode():
outputs = model.generate(**batch, max_new_tokens=64)
decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print("Batch outputs:", decoded)
Trade-off: Larger batches increase throughput but also increase latency per prompt.
If you plan to scale beyond prompts and need retrieval to ground generations, consider implementing vector store retrieval for RAG systems. This helps reduce hallucinations and improves factual accuracy at scale.
Offline and Air-Gapped Environments
Download models once and cache them:
export HF_HOME=/path/to/cache
export TRANSFORMERS_CACHE=/path/to/cache
export HF_HUB_OFFLINE=1
For proxy environments, set HTTP_PROXY and HTTPS_PROXY before downloading models.
Advanced Topics
Deterministic Inference
Set seeds for reproducible outputs:
torch.manual_seed(42)
Limitation: GPU inference isn't fully deterministic. Use torch.use_deterministic_algorithms(True) for stricter control, but expect slower performance.
Parameter-Efficient Fine-Tuning
If you need consistent schemas, domain-specific style, or complex transformations, in-context learning might not be enough. Actually, let me put it this way. When you hit the limits of prompting, evaluate parameter-efficient fine-tuning like LoRA using PEFT. Check out our hands-on PEFT and LoRA guide. This reduces training cost and memory while giving strong control over output behavior.
License and Compliance
FLAN-T5-base is licensed under Apache-2.0, which allows commercial use. Review the model card on Hugging Face for restrictions. For production deployments, confirm internal approval and audit requirements.
Conclusion and Next Steps
So there you have it. You've deployed FLAN-T5-base on Ubuntu, run inference, and validated outputs. You learned to measure latency, batch prompts, and improve quality with in-context learning.
Key architectural decisions:
Seq2seq model for instruction-following tasks
CPU-first baseline for cost control
Instruction-tuned model to avoid fine-tuning
Next steps:
Target 5x latency reduction with GPU and fp16
Wrap the script with FastAPI and enforce JWT auth
Implement log rotation and PII redaction for production
Evaluate LoRA fine-tuning for domain-specific tasks
You now have a working, self-hosted LLM pipeline. And honestly, that's no small achievement. The first time I got a model running locally, generating coherent text on my own hardware, it felt like magic. Now it's your turn to scale it, secure it, and deploy it with confidence.