I've been working with PDF extraction for a while now, and let me tell you, it's one of those problems that seems simple until you actually try to solve it. You know what I mean? You think "just extract the text" and then you end up with scrambled tables and paragraphs that jump between columns like they're playing hopscotch.

So here's what I figured out: combine PyMuPDF for rendering with GPT-4o's vision capabilities. This approach actually preserves your headings, tables, lists, even those annoying multi-column layouts that usually turn into word soup.

Uploaded image

Prerequisites are pretty straightforward: Python 3.10+, an OpenAI API key, and either Colab or a local environment. Cost-wise, you're looking at about $0.01–0.05 per page at 200 DPI. Processing time runs around 5–10 seconds per page, depending on how complex your content is.

One thing to note—this works best with digital-native PDFs. If you're dealing with scanned documents that have no embedded text, the whole thing relies on vision capabilities, which... well, let's just say the accuracy takes a hit with dense or low-quality scans.

Why This Approach Works

Pure OCR tools like Tesseract? They completely miss the point. I learned this the hard way after spending hours trying to fix broken table outputs. Tables break apart, headings get flattened into regular text, and don't even get me started on multi-column layouts—they scramble into this unreadable mess that makes you question your life choices.

And text-only extraction isn't much better. Sure, you get the words, but you lose all those visual cues. The borders, the column flow, the actual structure that makes a document readable.

What I discovered is that this hybrid approach feeds GPT-4o both pieces of the puzzle—the embedded text for accuracy and the page image for layout understanding. The model then reconstructs everything faithfully. You actually get Markdown that mirrors the original document's hierarchy and reading order. If you've been struggling with invisible characters or weird tokenization issues breaking your pipeline, check out our guide on tokenization pitfalls and invisible characters in prompts and RAG for some normalization strategies that saved my sanity.

For those of you looking to extract structured data directly from documents (invoices, forms, that sort of thing), we've got a walkthrough on building a structured data extraction pipeline with LLMs that might be helpful.

Now, about trade-offs—GPT-4o definitely costs more than traditional OCR. At 200 DPI, a 10-page document will run you somewhere between $0.10–0.50. But honestly? The fidelity is worth it. For large batches though, you'll want parallel processing, which I'll cover later.

How It Works

The process breaks down into four main steps:

  1. Render pages to images – PyMuPDF converts each page to PNG at 200 DPI. This preserves the visual layout exactly as it appears.

  2. Extract embedded text – PyMuPDF pulls any text layer from the PDF. This gives us the accurate character data we need.

  3. Transcribe with GPT-4o – We send both the image and text to GPT-4o with a structured prompt. The model spits out clean Markdown.

  4. Assemble final document – Take all those per-page Markdown chunks and concatenate them into one file.

Setup

First, let's get our dependencies sorted. Install everything in one go:

!pip install --quiet pymupdf pillow "openai>=1.40.0,<2"

Next up, your OpenAI API key. If you're using Colab, here's what you need:

import os
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

Working locally? Just export the key in your shell:

export OPENAI_API_KEY='your-key-here'

Configuration

I like to keep all my settings in one place. Makes it easier to tweak things later without hunting through the code:

CONFIG = {
    "dpi": 200,
    "model": "gpt-4o",
    "temperature": 0.0,
    "max_retries": 5,
    "initial_backoff": 2.0,
}

Step 1: Render Pages to Images

PyMuPDF renders each page as a PNG at 200 DPI. Why 200 DPI? I tested a bunch of different resolutions, and this one hits the sweet spot between quality and payload size. Go higher and you're just burning money without much improvement. Trust me on this one.

import logging
from pathlib import Path
from typing import List

import fitz  # PyMuPDF
from PIL import Image

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s: %(message)s")

def ensure_dir(path: Path) -> None:
    """Create directory if it doesn't exist."""
    path.mkdir(parents=True, exist_ok=True)

def prepare_output_dirs(pdf_path: Path):
    """Set up output directories for images, text, and cache."""
    pdf_stem = pdf_path.stem
    base_dir = Path("output") / pdf_stem
    images_dir = base_dir / "images"
    txt_dir = base_dir / "txt"
    cache_dir = base_dir / ".cache"
    ensure_dir(images_dir)
    ensure_dir(txt_dir)
    ensure_dir(cache_dir)
    return base_dir, images_dir, txt_dir, cache_dir

def convert_pages_to_images(pdf_path: Path, images_dir: Path, dpi: int = 200) -> List[Path]:
    """Render each PDF page as a PNG at the specified DPI."""
    doc = fitz.open(pdf_path)
    images = []
    scale = dpi / 72  # PDF default is 72 DPI
    matrix = fitz.Matrix(scale, scale)
    
    for page_index in range(len(doc)):
        page = doc[page_index]
        # Render without alpha channel to reduce payload size
        pix = page.get_pixmap(matrix=matrix, alpha=False)
        
        # Guard against blank or failed renders
        if pix.width == 0 or pix.height == 0:
            logging.warning(f"Page {page_index+1}: Rendering failed or empty page, skipping.")
            continue
        
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        out_path = images_dir / f"page_{page_index + 1:03d}.png"
        img.save(out_path, format="PNG", optimize=True)
        images.append(out_path)
        logging.info(f"Saved image: {out_path} ({out_path.stat().st_size // 1024} KB)")
    
    doc.close()
    return images

Step 2: Extract Embedded Text

This step pulls any text layer from the PDF. It gives GPT-4o the accurate character data it needs to anchor its transcription. Skip this and you're basically asking the model to work blindfolded—not ideal.

def extract_page_texts(pdf_path: Path, txt_dir: Path) -> List[Path]:
    """Extract embedded text from each page and save as .txt files."""
    doc = fitz.open(pdf_path)
    txt_files = []
    
    for page_index in range(len(doc)):
        page = doc[page_index]
        text = page.get_text("text") or ""
        text = text.replace("\r\n", "\n").strip()
        
        # Avoid polluting the prompt with placeholder text
        if not text:
            text = ""
        
        out_path = txt_dir / f"page_{page_index + 1:03d}.txt"
        out_path.write_text(text, encoding="utf-8")
        txt_files.append(out_path)
        logging.info(f"Saved text: {out_path}")
    
    doc.close()
    return txt_files

Step 3: Transcribe with GPT-4o

Here's where the magic happens. We send both the page image and the extracted text to GPT-4o. The model uses the image to understand layout and the text for accuracy. It's actually pretty clever how it combines both signals—like having two different perspectives on the same problem.

import base64
import time
from openai import OpenAI

# Initialize client with environment variable
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def encode_image_to_data_url(image_path: Path) -> str:
    """Encode PNG as base64 data URL for API input."""
    with open(image_path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:image/png;base64,{b64}"

SYSTEM_PROMPT = (
    "You are a meticulous document transcription engine. "
    "Transcribe each page into clean, well-structured Markdown. "
    "Preserve headings, lists, tables, and reading order. "
    "Use the extracted text for accuracy but follow the visual layout from the image. "
    "Do not hallucinate content. If content is illegible, mark it clearly. "
    "If extracted text is empty or indicates no embedded text, rely solely on the image."
)

USER_TEMPLATE = (
    "Use both the page image and the extracted text below. "
    "Reconstruct the document faithfully into Markdown.\n\n"
    "Extracted text:\n\n{page_text}"
)

def transcribe_page_with_gpt4o(
    client: OpenAI,
    image_path: Path,
    text_path: Path,
    model: str = "gpt-4o",
    temperature: float = 0.0,
    max_retries: int = 5,
    initial_backoff: float = 2.0,
) -> str:
    """Send multimodal request to GPT-4o for Markdown transcription with retry logic."""
    data_url = encode_image_to_data_url(image_path)
    page_text = text_path.read_text(encoding="utf-8")
    user_text = USER_TEMPLATE.format(page_text=page_text if page_text else "[No embedded text]")
    
    for attempt in range(max_retries):
        try:
            resp = client.chat.completions.create(
                model=model,
                temperature=temperature,
                messages=[
                    {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": user_text},
                            {"type": "image_url", "image_url": {"url": data_url}},
                        ],
                    },
                ],
            )
            content = resp.choices[0].message.content
            return content.strip()
        except Exception as e:
            wait = initial_backoff * (2 ** attempt)
            logging.warning(
                f"GPT-4o request failed (attempt {attempt+1}/{max_retries}): {type(e).__name__} - {e}. "
                f"Retrying in {wait:.1f}s."
            )
            if attempt == max_retries - 1:
                logging.error("Max retries reached. Raising exception.")
                raise
            time.sleep(wait)

Step 4: Assemble the Final Document

Now we just loop through all the pages, transcribe each one, and stitch them together into one Markdown file:

def process_pdf(pdf_path_str: str, dpi: int = 200) -> Path:
    """End-to-end pipeline: render, extract, transcribe, assemble."""
    pdf_path = Path(pdf_path_str).expanduser().resolve()
    base_dir, images_dir, txt_dir, _ = prepare_output_dirs(pdf_path)
    
    # Render and extract
    image_files = convert_pages_to_images(pdf_path, images_dir, dpi=dpi)
    text_files = extract_page_texts(pdf_path, txt_dir)
    
    if len(image_files) != len(text_files):
        raise RuntimeError("Mismatch between number of images and text files.")
    
    # Transcribe each page
    page_markdowns = []
    for idx, (img, txt) in enumerate(zip(image_files, text_files), start=1):
        logging.info(f"Transcribing page {idx}/{len(image_files)}: {img.name}")
        try:
            md = transcribe_page_with_gpt4o(client, img, txt, model=CONFIG["model"], temperature=CONFIG["temperature"])
        except Exception as e:
            logging.error(f"Transcription failed for page {idx}: {e}")
            md = "[Transcription failed for this page.]"
        page_markdowns.append(f"---\n\n## Page {idx}\n\n{md}\n")
    
    # Write final transcript
    transcript_path = base_dir / "transcript.md"
    transcript_path.write_text("\n".join(page_markdowns), encoding="utf-8")
    logging.info(f"Wrote transcript: {transcript_path}")
    return transcript_path

Run and Validate

Time to test this thing out. If you're in Colab, upload a sample PDF:

from google.colab import files
uploaded = files.upload()
pdf_path = list(uploaded.keys())[0]

Then run the pipeline:

transcript = process_pdf(pdf_path, dpi=CONFIG["dpi"])
print(f"Transcript saved to: {transcript}")

Let's inspect what we got:

from itertools import islice

def list_dir(p: Path, limit: int = 10) -> None:
    """List up to `limit` files in a directory."""
    files_list = sorted(p.glob("*"))
    for f in islice(files_list, 0, limit):
        print(f.name)
    if len(files_list) > limit:
        print(f"... and {len(files_list) - limit} more")

base = Path("output") / Path(pdf_path).stem
print("Images:")
list_dir(base / "images")
print("\nText files:")
list_dir(base / "txt")

Want to see one of the rendered pages?

from IPython.display import display

img_path = base / "images" / "page_001.png"
img = Image.open(img_path)
display(img)

Check out a sample of the extracted text:

txt_path = base / "txt" / "page_001.txt"
print(txt_path.read_text(encoding="utf-8")[:1000])

And here's the first 80 lines of your final Markdown:

transcript_md = (base / "transcript.md").read_text(encoding="utf-8")
print("\n".join(transcript_md.splitlines()[:80]))

I always add some basic validation too. Better to catch issues early:

# Assert at least one heading is present
assert "##" in transcript_md or "#" in transcript_md, "No headings found in transcript"
logging.info("Validation passed: headings detected.")

What You Get

You end up with a per-page Markdown transcription assembled into one clean document. Headings map correctly to their original styles, lists stay intact, tables remain coherent, and—this is the big one—multi-column content actually reads in the right order.

Let me be more specific here. You get clean Markdown that you can immediately drop into your documentation system, knowledge base, whatever you're building. No post-processing, no cleanup, no fixing broken formatting. It just works.

Actually, if you're curious about how LLMs handle context and why managing memory matters for large documents, our article on context rot and why LLMs "forget" as their memory grows dives into that whole mess.

Next Steps

  • Parallel processing: Use concurrent.futures.ThreadPoolExecutor to transcribe multiple pages at once. In my experience, this cuts processing time by 60-70%. Game changer for larger documents.

  • Payload optimization: Downscale images that are above your byte-size threshold. Keeps you under API limits and saves money.

  • Observability: Add structured logging—error types, status codes, per-page cost estimates. Trust me, when something breaks at 2 AM, you'll be glad you have proper monitoring.

  • Deployment: Wrap this whole thing in a FastAPI endpoint or throw together a Streamlit app. Makes it way easier for your team to use without having to touch the code.