Home >Technology peripherals >AI >Phi-4-Multimodal: A Guide With Demo Project

Phi-4-Multimodal: A Guide With Demo Project

Lisa KudrowOriginal: 2025-03-13 10:46:08878browse

This tutorial demonstrates building a multimodal language tutor using Microsoft's lightweight Phi-4-multimodal model. This AI-powered application leverages text, image, and audio processing for a comprehensive language learning experience.

Key Features:

Text-based learning: Offers real-time grammar checking, language translation, sentence restructuring, and context-aware vocabulary suggestions.
Image-based learning: Extracts and translates text from images and provides visual content summaries.
Audio-based learning: Converts speech to text, assesses pronunciation, and offers real-time speech translation.

Phi-4-Multimodal Overview:

Phi-4-multimodal excels at processing text, images, and speech. Its capabilities include:

Text processing: Grammar correction, translation, and sentence construction.
Vision processing: Optical Character Recognition (OCR), image summarization, and multimodal interactions.
Speech processing: Automatic Speech Recognition (ASR), pronunciation feedback, and speech-to-text translation.

Its 128K token context length optimizes performance for real-time applications.

Phi-4-Multimodal: A Guide With Demo Project

Step-by-Step Implementation:

1. Prerequisites:

Install necessary Python libraries:

pip install gradio transformers torch soundfile pillow flash-attn --no-build-isolation

Note: FlashAttention2 is recommended for optimal performance. If using older GPUs, consider setting _attn_implementation="eager" during model initialization.

Import required libraries:

import gradio as gr
import torch
import requests
import io
import os
import soundfile as sf
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

2. Loading Phi-4-Multimodal:

Load the model and processor from Hugging Face:

model_path = "microsoft/Phi-4-multimodal-instruct"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True,
    _attn_implementation='flash_attention_2',
).cuda()
generation_config = GenerationConfig.from_pretrained(model_path)

3. Core Functionalities:

clean_response(response, instruction_keywords): Removes prompt text from the model's output.
process_input(file, input_type, question): Handles text, image, and audio inputs, generating responses using the Phi-4-multimodal model. This function manages the input processing, model inference, and response cleaning for each modality.
process_text_translate(text, target_language) and process_text_grammar(text): Specific functions for translation and grammar correction, respectively, leveraging process_input.

4. Gradio Interface:

A Gradio interface provides a user-friendly way to interact with the model. The interface is structured with tabs for text, image, and audio processing, each with appropriate input fields (text boxes, image upload, audio upload) and output displays. Buttons trigger the relevant processing functions.

5. Testing and Results:

The tutorial includes example outputs demonstrating the model's capabilities in translation, grammar correction, image text extraction, and audio transcription/translation. These examples showcase the functionality of each module within the application.

Conclusion:

This tutorial provides a practical guide to building a robust multimodal language tutor using Phi-4-multimodal. The application's versatility and real-time capabilities highlight the potential of multimodal AI in enhancing language learning.

The above is the detailed content of Phi-4-Multimodal: A Guide With Demo Project. For more information, please follow other related articles on the PHP Chinese website!

Python if for include Token using Length Interface function this input ocr microsoft prompt Transcription Prompt

Statement：

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Previous article：Top 12 Open Source Models on HuggingFace in 2024Next article：Top 12 Open Source Models on HuggingFace in 2024

See more

Phi-4-Multimodal: A Guide With Demo Project

Related articles