Suraj TC - surajtc.dev

Vision language models are quickly replacing traditional OCR systems, offering a deeper understanding of document structure and content. SmolDocling is one such model, designed as a compact document language model built for high-accuracy structural extraction. However, running inference for these models still requires considerable compute, often tied to expensive GPU servers.

Serverless infrastructure provides a more practical solution. In this post, a complete example is presented using Modal to download the SmolDocling model, store its weights in a persistent volume, and run inference in a fully serverless environment. This approach helps reduce costs, improves scalability, and removes the need to manage dedicated hardware.

Why serverless inference?

Document parsing is a transient task; maintaining always-on servers is inefficient.
Scalability is provided through automatic allocation of GPU-enabled containers when needed.
Costs are reduced by billing only for active inference runs.
Infrastructure complexity is minimized since Modal handles container orchestration and resource allocation.

Prerequisites

The guide assumes availability of the following:

A Modal account with the CLI configured
An S3-compatible storage bucket (for eg, Cloudflare R2) for PDF input and output
A Hugging Face API token, if access to private or custom model repos is required

High-Level Architecture

User uploads a PDF through an API, which stores the file in an S3 bucket. The API then passes the file URL to a serverless GPU instance running on Modal. The instance processes the document using SmolDocling and writes the extracted output back to the same bucket.

Step 1: Download and save the model in a persistent volume

The first step is to fetch the SmolDocling model from Hugging Face and store it in a persistent volume on Modal. This ensures that the model weights are downloaded only once and can be reused during inference runs.

The code below defines a small setup application to perform this task:

import os, shutil
import modal
 
from huggingface_hub import snapshot_download
 
MODEL_ID     = "ds4sd/SmolDocling-256M-preview"
VOLUME_NAME  = "smoldocling-weights"
TARGET_DIR   = "/models"
 
model_volume = modal.Volume.from_name(VOLUME_NAME, create_if_missing=True)
 
app = modal.App("smoldocling-setup")
 
image = modal.Image.debian_slim().pip_install("huggingface_hub", "transformers")
 
@app.function(
    image=image,
    volumes={TARGET_DIR: model_volume},
    timeout=600,
)
def download_model():
    """Fetches model from HF and copies into the persistent volume."""
    print(f"Downloading model: {MODEL_ID}")
    local_dir = snapshot_download(MODEL_ID)
 
    dst = os.path.join(TARGET_DIR, MODEL_ID.split("/")[-1])
    if os.path.exists(dst):
        shutil.rmtree(dst)
    shutil.copytree(local_dir, dst)
    print("Model copied to volume.")

To trigger the model download locally:

@app.local_entrypoint()
def main():
    download_model.remote()

Run the script using the Modal CLI:

modal run download_model.py

Once executed, the model weights will be available in the specified volume and ready for use during inference.

Step 2: Run inference in a serverless GPU container

With the model saved in a persistent volume, the next step is to set up a serverless GPU function that runs inference on uploaded PDFs. Modal handles container creation, GPU provisioning, and automatic cleanup.

This section configures a Modal function that:

Loads the model from the persistent volume
Accepts input PDFs from an S3-compatible bucket
Processes each page using ONNX runtime and SmolDocling
Saves the extracted structure as JSON and images back to the bucket

Setup and configuration:

import logging
import time
import modal
 
# volume and model config
VOLUME_NAME = "smoldocling-weights"
MODEL_DIR = "/models/SmolDocling-256M-preview"
 
# S3-compatible bucket (e.g., Cloudflare R2)
S3_BUCKET_NAME = "<R2_BUCKET_NAME>"
S3_URL = "<R2_URL>"
 
# modal setup
app = modal.App("smoldocling-inference")
volume = modal.Volume.from_name(VOLUME_NAME, create_if_missing=False)
 
# base image with GPU drivers and dependencies
cuda_version = "12.9.0"
flavor = "cudnn-runtime"
operating_sys = "ubuntu24.04"
tag = f"{cuda_version}-{flavor}-{operating_sys}"
 
image = (
    modal.Image
    .from_registry(f"nvidia/cuda:{tag}", add_python="3.12")
    .run_commands("pip install onnxruntime-gpu transformers jinja2 docling-core pymupdf")
)
 
# mount cloud storage
s3_volume = modal.CloudBucketMount(
    bucket_name=S3_BUCKET_NAME,
    bucket_endpoint_url=S3_URL,
    secret=modal.Secret.from_name("cloudflare-r2"),
)
 
# logging setup
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

Define the Modal function that will run inference:

This function processes each page of a PDF and extracts structured content in Markdown format, along with images and raw tags, all stored back in the bucket.

The inference logic is adapted from the official SmolDocling repository on Hugging Face and modified to support our serverless deployment use case.

@app.function(
    image=image,
    volumes={"/models": volume, "/s3": s3_volume},
    gpu="T4",
    timeout=600,
)
def run_inference(s3_key: str):
    import json, os
    from pathlib import Path
    import numpy as np, onnxruntime, fitz
    from PIL import Image
    from transformers import AutoConfig, AutoProcessor
    from docling_core.types.doc import DoclingDocument
    from docling_core.types.doc.document import DocTagsDocument
 
    start_time = time.time()
    os.environ["OMP_NUM_THREADS"] = "1"
    os.environ["ORT_CUDA_USE_MAX_WORKSPACE"] = "1"
    os.environ["CUDA_MODULE_LOADING"] = "LAZY"
    os.environ["ORT_DISABLE_MEM_PATTERN"] = "1"
 
    model_path = MODEL_DIR
    config = AutoConfig.from_pretrained(model_path)
    processor = AutoProcessor.from_pretrained(model_path)
 
    # load ONNX model sessions
    sess_options = onnxruntime.SessionOptions()
    sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
 
    vision_session = onnxruntime.InferenceSession(
        f"{model_path}/onnx/vision_encoder.onnx",
        sess_options,
        providers=["CUDAExecutionProvider"],
    )
    embed_session = onnxruntime.InferenceSession(
        f"{model_path}/onnx/embed_tokens.onnx", providers=["CUDAExecutionProvider"]
    )
    decoder_session = onnxruntime.InferenceSession(
        f"{model_path}/onnx/decoder_model_merged.onnx", providers=["CUDAExecutionProvider"]
    )
 
    # load config values
    num_key_value_heads = config.text_config.num_key_value_heads
    head_dim = config.text_config.head_dim
    num_hidden_layers = config.text_config.num_hidden_layers
    eos_token_id = config.text_config.eos_token_id
    image_token_id = config.image_token_id
    end_of_utterance_id = processor.tokenizer.convert_tokens_to_ids("<end_of_utterance>")
 
    # read PDF from storage
    input_path = f"/s3/{s3_key}"
    filename = Path(s3_key).stem
    pdf_bytes = open(input_path, "rb").read()
    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
 
    logger.info("Processing %d pages", len(doc))
    pages_output = []
    image_output_dir = Path(f"/s3/output/{filename}/images")
    image_output_dir.mkdir(parents=True, exist_ok=True)
 
    # loop through PDF pages
    for i, page in enumerate(doc):
        page_start_time = time.time()
 
        # convert page to image
        pix = page.get_pixmap(dpi=200)
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        image_path = image_output_dir / f"page_{i+1:03}.png"
        img.save(image_path)
 
        # prepare prompt
        messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Convert this page to docling."}]}]
        prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
        inputs = processor(text=prompt, images=[img], return_tensors="np")
 
        # initialize input variables
        batch_size = inputs["input_ids"].shape[0]
        past_key_values = {
            f"past_key_values.{layer}.{kv}": np.zeros(
                [batch_size, num_key_value_heads, 0, head_dim], dtype=np.float32
            )
            for layer in range(num_hidden_layers)
            for kv in ("key", "value")
        }
        image_features = None
        input_ids = inputs["input_ids"]
        attention_mask = inputs["attention_mask"]
        position_ids = np.cumsum(attention_mask, axis=-1)
        generated_tokens = np.array([[]], dtype=np.int64)
 
        # inference loop
        for _ in range(8192):
            inputs_embeds = embed_session.run(None, {"input_ids": input_ids})[0]
 
            if image_features is None:
                image_features = vision_session.run(
                    ["image_features"],
                    {
                        "pixel_values": inputs["pixel_values"],
                        "pixel_attention_mask": inputs["pixel_attention_mask"].astype(np.bool_),
                    },
                )[0]
                inputs_embeds[inputs["input_ids"] == image_token_id] = (
                    image_features.reshape(-1, image_features.shape[-1])
                )
 
            logits, *present_key_values = decoder_session.run(
                None,
                dict(
                    inputs_embeds=inputs_embeds,
                    attention_mask=attention_mask,
                    position_ids=position_ids,
                    **past_key_values,
                ),
            )
 
            input_ids = logits[:, -1].argmax(-1, keepdims=True)
            attention_mask = np.ones_like(input_ids)
            position_ids = position_ids[:, -1:] + 1
            for j, key in enumerate(past_key_values):
                past_key_values[key] = present_key_values[j]
 
            generated_tokens = np.concatenate([generated_tokens, input_ids], axis=-1)
            if (input_ids == eos_token_id).all() or (input_ids == end_of_utterance_id).all():
                break
 
        # post-process output
        doctags = processor.batch_decode(generated_tokens, skip_special_tokens=False)[0].lstrip()
        doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [img])
        structured_doc = DoclingDocument.load_from_doctags(doctags_doc, document_name=f"Page {i+1}")
 
        pages_output.append({
            "page_no": i + 1,
            "doctags": doctags,
            "markdown": structured_doc.export_to_markdown(),
            "image_path": image_path.as_posix(),
        })
 
        logger.info("Processed page %d in %.2f seconds", i + 1, time.time() - page_start_time)
 
    # save results to S3
    result = {"pages": pages_output}
    result_path = Path(f"/s3/output/{filename}/results.json")
    result_path.write_text(json.dumps(result, indent=2))
    logger.info("Saved output to %s", result_path)
    logger.info("Job completed in %.2f seconds", time.time() - start_time)

To trigger inference manually from a local entrypoint:

@app.local_entrypoint()
def main():
    run_inference.remote("<R2_KEY>")  # replace with the uploaded PDF key

Run the inference using the Modal CLI:

modal run smoldocling_inference.py

While the inference job is running, logs can be monitored in real-time through the Modal CLI or the Modal dashboard. Once complete, the processed results can be verified in the configured S3 bucket under the output path.

This setup demonstrates how SmolDocling can be deployed in a fully serverless environment using Modal. By separating model download and inference steps, and leveraging persistent volumes and GPU-backed containers, document parsing becomes efficient, scalable, and cost-effective. The same pattern can be adapted for other vision-language models with minimal changes.

Running SmolDocling Serverless Inference

Why serverless inference?

Prerequisites

High-Level Architecture

Step 1: Download and save the model in a persistent volume

Step 2: Run inference in a serverless GPU container