Using Moondream with Transformers - Documentation

This guide shows you how to run Moondream directly with Hugging Face Transformers, giving you maximum control over model execution and parameters.

Prerequisites

First, you'll need to install the core dependencies:

pip install transformers torch pillow einops
System Requirements
  • RAM: 8GB+ (16GB recommended) - Storage: 5GB for model weights - GPU: Recommended but not required (4GB+ VRAM) - Python: 3.8 or higher

Platform-Specific Setup

# Install pyvips for faster image processing
pip install pyvips-binary pyvips

Basic Usage

Here's a simple example demonstrating the core Moondream capabilities:

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
 
# Load the model
 
model = AutoModelForCausalLM.from_pretrained(
"vikhyatk/moondream2",
revision="2025-01-09",
trust_remote_code=True, # Uncomment for GPU acceleration & pip install accelerate # device_map={"": "cuda"}
)
 
# Load your image
 
image = Image.open("path/to/your/image.jpg")
 
# 1. Image Captioning
 
print("Short caption:")
print(model.caption(image, length="short")["caption"])
 
print("Detailed caption:")
for t in model.caption(image, length="normal", stream=True)["caption"]:
print(t, end="", flush=True)
 
# 2. Visual Question Answering
 
print("Asking questions about the image:")
print(model.query(image, "How many people are in the image?")["answer"])
 
# 3. Object Detection
 
print("Detecting objects:")
objects = model.detect(image, "face")["objects"]
print(f"Found {len(objects)} face(s)")
 
# 4. Visual Pointing
 
print("Locating objects:")
points = model.point(image, "person")["points"]
print(f"Found {len(points)} person(s)")

Advanced Features

GPU Acceleration

To enable GPU acceleration:

model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-01-09",
    trust_remote_code=True,
    device_map={"": "cuda"},  # Use "cuda" for NVIDIA GPUs
)

Multiple Model Instances

If you have enough VRAM (4-5GB per instance), you can run multiple instances of the model for parallel processing:

model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-01-09",
    trust_remote_code=True,
    device_map={"": "cuda"},
)
 
model2 = AutoModelForCausalLM.from_pretrained(
"vikhyatk/moondream2",
revision="2025-01-09",
trust_remote_code=True,
device_map={"": "cuda"},
)

Efficient Image Encoding

For multiple operations on the same image, encode it once to save processing time:

image = Image.open("path/to/your/image.jpg")
encoded_image = model.encode_image(image)
 
# Reuse the encoded image for each inference
 
print(model.caption(encoded_image, length="short")["caption"])
print(model.query(encoded_image, "How many people are in the image?")["answer"])

API Reference

Captioning

model.caption(image, length="normal", stream=False)
ParameterTypeDescription
imagePIL.Image or encoded imageThe image to process
lengthstrCaption detail level: "short" or "normal"
streamboolWhether to stream the response token by token

Visual Question Answering

model.query(image, question, stream=False)
ParameterTypeDescription
imagePIL.Image or encoded imageThe image to process
questionstrThe question to ask about the image
streamboolWhether to stream the response token by token

Object Detection

model.detect(image, object_name)
ParameterTypeDescription
imagePIL.Image or encoded imageThe image to process
object_namestrThe type of object to detect

Visual Pointing

model.point(image, object_name)
ParameterTypeDescription
imagePIL.Image or encoded imageThe image to process
object_namestrThe type of object to locate

Performance Optimization

Best Practices
  • Use GPU acceleration when possible - Reuse encoded images for multiple operations - For batch processing, pre-load the model once - Process images in batches rather than loading/unloading the model repeatedly - Resize very large images to reasonable dimensions before processing - Use quantization for deployment on memory-constrained devices

Troubleshooting

Common Issues
  • Out of Memory: Reduce image size or use lighter model variant
  • Slow Performance: Enable GPU acceleration and reuse encoded images
  • Library Errors: Ensure all dependencies are installed correctly
  • Unexpected Results: Check image formatting and question clarity

Next Steps

Now that you understand how to use Moondream with Transformers, you might want to:

  • Try advanced prompting techniques
  • Integrate Moondream into your own applications
  • Create custom pipelines for specialized tasks
  • Explore our recipes for common use cases