Home  >  Article  >  Technology peripherals  >  so fast! Recognize video speech into text in just a few minutes with less than 10 lines of code

so fast! Recognize video speech into text in just a few minutes with less than 10 lines of code

WBOY
WBOYforward
2024-02-27 13:55:02539browse

so fast! Recognize video speech into text in just a few minutes with less than 10 lines of code

Hello everyone, I am Kite

Two years ago, the need to convert audio and video files into text content was difficult to achieve. But now it can be easily solved in just a few minutes.

It is said that in order to obtain training data, some companies have fully crawled videos on short video platforms such as Douyin and Kuaishou, and then extracted the audio from the videos and converted them into text form for use as big data Model training corpus.

If you need to convert video or audio files to text, you can try this open source solution available today. For example, you can search for the specific time points when dialogues in film and television programs appear.

Without further ado, let’s get to the point.

Whisper

This solution is OpenAI’s open source Whisper. Of course it is written in Python. You only need to simply install a few packages and write a few lines of code. Wait for a while (depending on the performance of your machine and the length of the audio and video), the final text content will come out, it's that simple.

GitHub warehouse address: https://github.com/openai/whisper

Fast-Whisper

Although it has been quite simplified, for the program It is still not streamlined enough for the staff. After all, programmers tend to prefer simplicity and efficiency. Although it is relatively easy to install and call Whisper, you still need to install PyTorch, ffmpeg, and even Rust separately.

Therefore, Fast-Whisper came into being, which is faster and more concise than Whisper. Fast-Whisper is not just a simple encapsulation of Whisper, but a reconstruction of OpenAI's Whisper model by using CTranslate2. CTranslate2 is an efficient inference engine for the Transformer model.

To summarize, it is faster than Whisper. The official statement is that it is 4-8 times faster than Whisper. Not only does it support GPU, but it also supports CPU, and even my broken Mac can be used.

GitHub warehouse address: https://github.com/SYSTRAN/faster-whisper

It only takes two steps to use.

  1. Install dependency packages
pip install faster-whisper
  1. Write code,
from faster_whisper import WhisperModelmodel_size = "large-v3"# Run on GPU with FP16model = WhisperModel(model_size, device="cuda", compute_type="float16")# or run on GPU with INT8# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")# or run on CPU with INT8# model = WhisperModel(model_size, device="cpu", compute_type="int8")segments, info = model.transcribe("audio.mp3", beam_size=5)print("Detected language '%s' with probability %f" % (info.language, info.language_probability))for segment in segments:print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Yes, it's that simple.

What can I do?

It happens that a friend wants to make short videos and post some chicken soup literature videos. Chicken Soup comes from interviews with some famous people. However, he didn't want to watch the entire video again, he just wanted to use the fastest way to get the text content, and then read the text, because reading text is much faster than watching a video, and it can also be searched.

Let me just say, if you don’t even have the piety to watch a complete video, how can you manage an account well?

So I made one for him, using Fast-Whisper.

Client

The client uses Swift and only supports Mac.

  1. Select a video;
  2. Then click "Extract Text", then the Python interface will be called, and you need to wait for a while;
  3. Load the parsed text As well as the start and end times that appear;
  4. Select a start time and an end event;
  5. Click the "Export" button, and the video clip will be exported;

, duration 00:10

Server

The server is, of course, Python, and then it is packaged with Flask and the interface is open to the outside world.

from flask import Flask, request, jsonifyfrom faster_whisper import WhisperModelapp = Flask(__name__)model_size = "large-v2"model = WhisperModel(model_size, device="cpu", compute_type="int8")@app.route('/transcribe', methods=['POST'])def transcribe():# Get the file path from the requestfile_path = request.json.get('filePath')# Transcribe the filesegments, info = model.transcribe(file_path, beam_size=5, initial_prompt="简体")segments_copy = []with open('segments.txt', 'w') as file:for segment in segments:line = "%.2fs|%.2fs|[%.2fs -> %.2fs]|%s" % (segment.start, segment.end, segment.start, segment.end, segment.text)segments_copy.append(line)file.write(line + '\n')# Prepare the responseresponse_data = {"language": info.language,"language_probability": info.language_probability,"segments": []}for segment in segments_copy:response_data["segments"].append(segment)return jsonify(response_data)if __name__ == '__main__':app.run(debug=False)

The above is the detailed content of so fast! Recognize video speech into text in just a few minutes with less than 10 lines of code. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete