Home >Backend Development >Python Tutorial >Unlock the Magic of Images: A Quick and Easy Guide to Using the Cutting-Edge SmolVLM-M Model

Unlock the Magic of Images: A Quick and Easy Guide to Using the Cutting-Edge SmolVLM-M Model

Susan Sarandon
Susan SarandonOriginal
2025-01-24 14:10:10251browse

This article showcases SmolVLM-500M-Instruct, a cutting-edge, compact vision-to-text model. Despite its relatively small size (500 million parameters), it demonstrates impressive capabilities.

Here's the Python code:

<code class="language-python">import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import warnings

warnings.filterwarnings("ignore", message="Some kwargs in processor config are unused")

def describe_image(image_path):
    processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")
    model = AutoModelForVision2Seq.from_pretrained("HuggingFaceTB/SmolVLM-500M-Instruct")

    image = Image.open(image_path)

    prompt = "Describe the image content in detail.  Provide a concise textual response."
    inputs = processor(text=[prompt], images=[image], return_tensors="pt")

    with torch.no_grad():
        outputs = model.generate(
            pixel_values=inputs["pixel_values"],
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=150,
            do_sample=True,
            temperature=0.7
        )

    description = processor.batch_decode(outputs, skip_special_tokens=True)[0]
    return description.strip()

if __name__ == "__main__":
    image_path = "images/bender.jpg"

    try:
        description = describe_image(image_path)
        print("Image Description:", description)
    except Exception as e:
        print(f"Error: {e}")</code>

This script leverages the Hugging Face Transformers library to generate a textual description from an image. It loads a pre-trained model and processor, processes the image, and outputs a descriptive text. Error handling is included.

The code is available here: https://www.php.cn/link/042886829869470b75f63dddfd7e9d9d

Using the following non-stock image (placed in the project's image directory):

Unlock the Magic of Images: A Quick and Easy Guide to Using the Cutting-Edge SmolVLM-M Model

The model generates a description (the prompt and parameters can be adjusted for finer control): A robot, seated on a couch, is engrossed in reading a book. Bookshelves and a door are visible in the background. A white chair with a cushion is also in the scene.

The model's speed and efficiency are noteworthy compared to larger language models.

The above is the detailed content of Unlock the Magic of Images: A Quick and Easy Guide to Using the Cutting-Edge SmolVLM-M Model. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn