Practical Approaches to Key Information Extraction (Part 1)-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Practical Approaches to Key Information Extraction (Part 1)

Patricia Arquette

Oct 04, 2024 pm 04:11 PM

안녕하세요 미스터자이2k입니다!

이 시리즈에서는 송장에서 핵심 정보 추출(KIE) 문제를 해결하는 나의 접근 방식을 공유하고 싶습니다. 정보 추출을 위해 ChatGPT 및 Qwen2와 같은 대규모 언어 모델(LLM)을 활용하는 방법을 살펴보겠습니다. 그런 다음 PaddleOCR, 제로 샷 분류 모델 또는 Llama 3.1과 같은 OCR 모델을 사용하여 결과를 후처리하는 방법을 살펴보겠습니다.

— 젠장, 이거 신난다!

한 단계 더 나아가 모든 형식, 모든 언어로 청구서를 처리해 드립니다. 네 맞습니다. 이게 진짜예요!

요구 사항 분석

모든 언어로 된 모든 유형의 송장에서 모든 관련 정보를 추출하는 서비스를 구축해야 한다고 가정해 보겠습니다. 이 샘플 웹사이트에서 찾을 수 있는 것과 유사합니다.

다음은 우리가 작업할 샘플 송장 이미지입니다.

Practical Approaches to Key Information Extraction (Part 1)

주요 고려사항

먼저 요구사항을 자세히 분석해 보겠습니다. 이는 우리 시스템에 적합한 기술 스택을 결정하는 데 도움이 될 것입니다. 특정 기술은 잘 작동할 수 있지만 모든 시나리오에 이상적이지는 않을 수 있습니다. 위에서 아래로 우선순위를 정해야 할 사항은 다음과 같습니다.

빠른 시스템 출시
정확성 보장
제한된 컴퓨팅 리소스에서도 작동하도록 하세요
- (예: 12GB VRAM 또는 CPU를 갖춘 GPU RTX 3060)
합리적인 처리 시간 유지
- CPU에서는 청구서당 최대 1분, GPU에서는 최대 10초
유용하고 중요한 세부정보만 추출하는 데 집중

이러한 요구 사항을 고려하여 세부 조정은 하지 않을 것입니다. 대신 기존 기술을 결합하고 쌓아서 형식과 언어에 관계없이 빠르고 정확한 결과를 얻을 것입니다.

벤치마크로 샘플 웹사이트에서는 약 3~4초 만에 인보이스를 처리하는 것으로 나타났습니다. 따라서 우리 시스템에서 10초를 목표로 하는 것은 충분히 가능합니다.

출력 형식은 샘플 웹사이트에서 사용된 형식과 일치해야 합니다.

Practical Approaches to Key Information Extraction (Part 1)

Chatgpt

자, 첫 번째 도구인 ChatGPT에 대해 이야기해 보겠습니다. 아마도 사용하기가 얼마나 쉬운지 이미 알고 계실 것입니다. 그렇다면 왜 이 블로그를 읽어야 할까요? 그렇다면 제가 토큰 사용 최적화 및 처리 속도 향상을 도와드릴 수 있다고 말하면 어떨까요? 아직 흥미가 없나요? 조금만 기다려 주세요. 방법을 설명해 드리겠습니다.

기본 접근 방식

다음은 기본 코드 스니펫입니다. (참고: 코드가 완벽하지 않을 수 있습니다. 이는 정확한 구현보다는 아이디어에 관한 것입니다.) 내 저장소 다국어 송장 OCR 저장소에서 전체 코드를 확인하실 수 있습니다.


class OpenAIExtractor(BaseExtractor):
    def __init__(self, config_path: str = "config/config.yaml"):
        super().__init__(config_path)

        self.config = self.config['openai']
        self.model = self.config['model_name']
        self.temperature = self.config['temperature']
        self.max_tokens = self.config['max_tokens']

        self.OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
        from openai import OpenAI
        self.client = OpenAI(api_key=self.OPENAI_API_KEY)

    def _extract_invoice_llm(self, ocr_text, base64_image:str, invoice_template:str):
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": """You are a helpful assistant that responds in JSON format with the invoice information in English. 
                                            Don't add any annotations there. Remember to close any bracket. Number, price and amount should be number, date should be convert to dd/mm/yyyy, 
                                            time should be convert to HH:mm:ss, currency should be 3 chracters like VND, USD, EUR"""},
                {"role": "user", "content": [
                    {"type": "text", "text": f"From the image of the bill and the text from OCR, extract the information. The ocr text is: {ocr_text} \n. Return the key names as in the template is a MUST. The invoice template: \n {invoice_template}"},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
                ]}
            ],
            temperature=self.temperature,
            max_tokens=self.max_tokens,
        )
        return response.choices[0].message.content

    def extract_json(self, text: str) -> dict:
        start_index = text.find('{')
        end_index = text.rfind('}') + 1
        json_string = text[start_index:end_index]
        json_string = json_string.replace('true', 'True').replace('false', 'False').replace('null', 'None')
        result = eval(json_string)
        return result

    @retry_on_failure(max_retries=3, delay=1.0)
    def extract_invoice(self, ocr_text, image: Union[str, np.ndarray], invoice_template:str) -> dict:
        base64_image = self.encode_image(image)
        invoice_info = self._extract_invoice_llm(ocr_text, base64_image, 
                                                 invoice_template=invoice_template)
        invoice_info = self.extract_json(invoice_info)
        return invoice_info

그래, 결과를 보자



invoice {
    "invoice_info": {
        "amount": 32.0,
        "amount_change": 0,
        "amount_shipping": 0,
        "vatamount": 0,
        "amountexvat": 32.0,
        "currency": "EUR",
        "purchasedate": "28/06/2008",
        "purchasetime": "17:46:26",
        "vatitems": [
            {
                "amount": 32.0,
                "amount_excl_vat": 32.0,
                "amount_incl_vat": 32.0,
                "amount_incl_excl_vat_estimated": false,
                "percentage": 0,
                "code": ""
            }
        ],
        "vat_context": "",
        "lines": [
            {
                "description": "",
                "lineitems": [
                    {
                        "title": "Lunettes",
                        "description": "",
                        "amount": 22.0,
                        "amount_each": 22.0,
                        "amount_ex_vat": 22.0,
                        "vat_amount": 0,
                        "vat_percentage": 0,
                        "quantity": 1,
                        "unit_of_measurement": "",
                        "sku": "",
                        "vat_code": ""
                    },
                    {
                        "title": "Chapeau",
                        "description": "",
                        "amount": 10.0,
                        "amount_each": 10.0,
                        "amount_ex_vat": 10.0,
                        "vat_amount": 0,
                        "vat_percentage": 0,
                        "quantity": 1,
                        "unit_of_measurement": "",
                        "sku": "",
                        "vat_code": ""
                    }
                ]
            }
        ],
        "paymentmethod": "CB EMV",
        "payment_auth_code": "",
        "payment_card_number": "",
        "payment_card_account_number": "",
        "payment_card_bank": "",
        "payment_card_issuer": "",
        "payment_due_date": "",
        "terminal_number": "",
        "document_subject": "",
        "package_number": "",
        "invoice_number": "",
        "receipt_number": "000130",
        "shop_number": "",
        "transaction_number": "000148",
        "transaction_reference": "",
        "order_number": "",
        "table_number": "",
        "table_group": "",
        "merchant_name": "G\u00e9ant Casino",
        "merchant_id": "",
        "merchant_coc_number": "",
        "merchant_vat_number": "",
        "merchant_bank_account_number": "",
        "merchant_bank_account_number_bic": "",
        "merchant_chain_liability_bank_account_number": "",
        "merchant_chain_liability_amount": 0,
        "merchant_bank_domestic_account_number": "",
        "merchant_bank_domestic_bank_code": "",
        "merchant_website": "",
        "merchant_email": "",
        "merchant_address": "Annecy",
        "merchant_phone": "04.50.88.20.00",
        "customer_name": "",
        "customer_address": "",
        "customer_phone": "",
        "customer_website": "",
        "customer_vat_number": "",
        "customer_coc_number": "",
        "customer_bank_account_number": "",
        "customer_bank_account_number_bic": "",
        "customer_email": "",
        "document_language": ""
    }
}
Test_Openai_Invoice Took 0:00:11.15

결과는 꽤 안정적이지만 처리 시간이 문제입니다. 10초 제한을 초과합니다. 또한 출력에 빈 필드가 많이 포함되어 처리 시간이 늘어날 뿐만 아니라 오류가 발생하고 더 많은 토큰을 소비할 수 있어 본질적으로 더 많은 비용이 소요될 수 있습니다.

고급 접근 방식

이 문제를 해결하려면 약간의 조정만 필요하다는 것이 밝혀졌습니다.

다음 문장을 프롬프트에 추가하기만 하면 됩니다.

"값이 있는 필드만 출력하고 빈 필드는 반환하지 않습니다."

짜잔! 문제가 해결되었습니다!


 invoice_info {
    "invoice_info": {
        "amount": 32,
        "currency": "EUR",
        "purchasedate": "28/06/2008",
        "purchasetime": "17:46:26",
        "lines": [
            {
                "description": "",
                "lineitems": [
                    {
                        "title": "LUNETTES",
                        "description": "",
                        "amount": 22,
                        "amount_each": 22,
                        "amount_ex_vat": 22,
                        "vat_amount": 0,
                        "vat_percentage": 0,
                        "quantity": 1,
                        "unit_of_measurement": "",
                        "sku": "",
                        "vat_code": ""
                    },
                    {
                        "title": "CHAPEAU",
                        "description": "",
                        "amount": 10,
                        "amount_each": 10,
                        "amount_ex_vat": 10,
                        "vat_amount": 0,
                        "vat_percentage": 0,
                        "quantity": 1,
                        "unit_of_measurement": "",
                        "sku": "",
                        "vat_code": ""
                    }
                ]
            }
        ],
        "invoice_number": "000130"
    }
}
Test_Openai_Invoice Took 0:00:05.79

와, 정말 획기적인 일이네요! 이제 결과는 더 짧고 정확해졌으며 처리 시간은 11.15초에서 5.79초로 단축되었습니다. 한 문장만 수정하면 비용과 처리 시간이 약 50% 단축되었습니다. 정말 멋지죠?

이 경우에는 잘 작동하는 GPT-4o-mini를 사용하고 있지만 제 경험으로는 Gemini Flash가 훨씬 더 빠르고 무료로 더 나은 성능을 발휘합니다! 꼭 확인해 볼 가치가 있습니다.

특정 요구 사항에 따라 가장 중요한 필드에만 집중하여 템플릿을 단축하면 작업을 더욱 최적화할 수 있습니다.

패들OCR

결과는 꽤 좋아 보이지만 전화번호나 계산원 이름과 같이 캡처하고 싶은 필드가 여전히 몇 가지 누락되어 있습니다. 간단히 ChatGPT를 다시 요청할 수도 있지만 LLM에만 의존하는 것은 예측할 수 없습니다. 결과는 실행마다 다를 수 있습니다. 게다가 프롬프트 템플릿이 꽤 길어서(모든 형식에 대해 가능한 모든 정보를 추출하려고 하기 때문에) ChatGPT가 특정 세부 정보를 "잊어버릴" 수 있습니다.

여기서 PaddleOCR이 등장합니다. 정확한 OCR 텍스트를 제공하여 LLM의 비전 기능을 향상시켜 모델이 추출해야 할 항목에 정확하게 집중할 수 있도록 도와줍니다.

In my previous prompt, I used this structure:


{"type": "text", "text": f"From the image of the bill and the text from OCR, extract the information. The ocr text is: {ocr_text} \n.

Previously, I set ocr_text = '', but now we'll populate it with the output from PaddleOCR. Since I'm unsure of the specific language for now, I'll use English (as it's the most commonly supported). In the next part, I’ll guide you on detecting the language, so hang tight!

Here’s the updated code to integrate PaddleOCR:


ocr = PaddleOCR(lang='en', show_log=False, use_angle_cls=True, cls=True)
result = ocr.ocr(np.array(image))

This is the OCR output.


    "Geant Casino ANNECY BIENVENUE DANS NOTRE MAGASIN Caisse014 Date28/06/2008 VOTRE MAGASIN VOUS ACCUEILLE DU LUNDI AU SAMEDI DE 8H30 A21H00 TEL.04.50.88.20.00 LUNETTES 22.00E CHAPEAU 10.00E =TOTAL2) 32.00E CB EMV 32.00E Si vous aviez la carte fidelite, vous auriez cumule 11SMILES Caissier:000148/Heure:17:46:26 Numero de ticket :000130 Rapidite,confort d'achat budget maitrise.. Scan' Express vous attend!! Merci de votre visite A bientot"

As you can see, the results are pretty good. In this case, the invoice is in French, which looks similar to English, so the output is decent. However, if we were dealing with languages like Japanese or Chinese, the results wouldn't be as accurate.

Now, let’s see what happens when we combine the OCR output with ChatGPT.


 invoice_info {
    "invoice_info": {
        "amount": 32,
        "currency": "EUR",
        "purchasedate": "28/06/2008",
        "purchasetime": "17:46:26",
        "lines": [
            {
                "description": "",
                "lineitems": [
                    {
                        "title": "LUNETTES",
                        "description": "",
                        "amount": 22,
                        "amount_each": 22,
                        "amount_ex_vat": 22,
                        "vat_amount": 0,
                        "vat_percentage": 0,
                        "quantity": 1,
                        "unit_of_measurement": "",
                        "sku": "",
                        "vat_code": ""
                    },
                    {
                        "title": "CHAPEAU",
                        "description": "",
                        "amount": 10,
                        "amount_each": 10,
                        "amount_ex_vat": 10,
                        "vat_amount": 0,
                        "vat_percentage": 0,
                        "quantity": 1,
                        "unit_of_measurement": "",
                        "sku": "",
                        "vat_code": ""
                    }
                ]
            }
        ],
        "paymentmethod": "CB EMV",
        "receipt_number": "000130",
        "transaction_number": "000130",
        "merchant_name": "G\u00e9ant Casino",
        "customer_email": "",
        "customer_name": "",
        "customer_address": "",
        "customer_phone": ""
    }
}
Test_Openai_Invoice Took 0:00:06.78

Awesome! It uses a few more tokens and takes slightly longer, but it returns additional fields like payment_method, receipt_number, and cashier. That’s a fair trade-off and totally acceptable!

Language Detection

Right now, we’re facing two major challenges. First, PaddleOCR cannot automatically detect the language, which significantly affects the OCR output, and ultimately impacts the entire result. Second, most LLMs perform best with English, so if the input is in another language, the quality of the results decreases.

To demonstrate, I’ll use a challenging example.

Here’s a Japanese invoice:

Practical Approaches to Key Information Extraction (Part 1)

Let’s see what happens if we fail to auto-detect the language and use lang='en' to extract OCR on this Japanese invoice.

The result


'TEL045-752-6131 E TOP&CIubQJMB-FJ 2003 20130902 LNo.0102 No0073 0011319-2x198 396 00327111 238 000805 VR-E--E 298 003276 9 -435 298 001093 398 000335 138 000112 7 2x158 316 A000191 92 29 t 2.111 100) 10.001 10.001 7.890'

As you can see, the result is pretty bad.

Now, let’s detect the language using a zero-shot classification model. In this case, I’m using "facebook/metaclip-b32-400m". This is one of the best ways to detect around 80 languages supported by PaddleOCR without needing fine-tuning while still maintaining accuracy.


def initialize_language_detector(self):
        # Initialize the zero-shot image classification model
        self.image_classifier = pipeline(task="zero-shot-image-classification", 
                                         model="facebook/metaclip-b32-400m", 
                                         device=self.device,
                                         batch_size=8)

def _get_lang(self, image: Image.Image) -> str:
        # Define candidate labels for language classification
        candidate_labels = [f"language {key}" for key in self.language_dict]

        # Perform inference to classify the language
        outputs = self.image_classifier(image, candidate_labels=candidate_labels)
        outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]

        # Extract the language with the highest score
        language_names = [entry['label'].replace('language ', '') for entry in outputs]
        scores = [entry['score'] for entry in outputs]
        abbreviations = [self.language_dict.get(language) for language in language_names]

        first_abbreviation = abbreviations[0]
        lang = 'en'  # Default to English

        if scores[0] > self.language_thresh:
            lang = first_abbreviation
        print("The source language", abbreviations)
        return lang

Let's see the result


Recognized Text: 
{'ori_text': '根岸 東急ストア TEL 045-752-6131 領収証 [TOP2C!UbO J3カード」 クレヅッ 卜でのお支払なら 200円で3ボイン卜 お得なカード! 是非こ入会下さい。 2013年09月02日(月) レジNO. 0102 NOO07さ と う 001131 スダフエウ卜チーネ 23 単198 1396 003271 オインイ年 ユウ10 4238 000805 ソマ一ク スモー一クサーモン 1298 003276 タカナン ナマクリーム35 1298 001093 ヌテラ スフレクト 1398 000335 バナサ 138 000112 アボト 2つ 単158 1316 A000191 タマネキ 429 合計 2,111 (内消費税等 100 現金 10001 お預り合計 110 001 お釣り 7 890', 
'ori_language': 'ja', 
'text': 'Negishi Tokyu Store TEL 045-752-6131 Receipt [TOP2C!UbO J3 Card] If you pay with a credit card, you can get 3 points for 200 yen.A great value card!Please join us. Monday, September 2, 2013 Cashier No. 0102 NOO07 Satou 001131 Sudafue Bucine 23 Single 198 1396 003271 Oinyen Yu 10 4238 000805 Soma Iku Smo Iku Salmon 1298 003276 Takanan Nama Cream 35 1 298 001093 Nutella Sprect 1398 000335 Banasa 138 000112 Aboto 2 AA 158 1316 A000191 Eggplant 429 Total 2,111 (including consumption tax, etc. 100 Cash 10001 Total deposited 110 001 Change 7 890', 
'language': 'en',}

The results are much better now! I also translated the original Japanese into English. With this approach, the output will significantly improve for other languages as well.

Summary

In this blog, we explored how to extract key information from invoices by combining LLMs and OCR, while also optimizing processing time, minimizing token usage, and improving multilingual support. By incorporating PaddleOCR and a zero-shot language detection model, we boosted both accuracy and reliability across different formats and languages. I hope these examples help you grasp the full process, from initial concept to final implementation.

Reference:

Mrzaizai2k - Multilanguage invoice ocr

If you’d like to learn more, be sure to check out my other posts and give me a like! It would mean a lot to me. Thank you.

Real-Time Data Processing with MongoDB Change Streams and Python
Replay Attack: Let’s Learn
Reasons to Write

The above is the detailed content of Practical Approaches to Key Information Extraction (Part 1). For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Python vs. C : Understanding the Key DifferencesApr 21, 2025 am 12:18 AM

Python and C each have their own advantages, and the choice should be based on project requirements. 1) Python is suitable for rapid development and data processing due to its concise syntax and dynamic typing. 2)C is suitable for high performance and system programming due to its static typing and manual memory management.

Python vs. C : Which Language to Choose for Your Project?Apr 21, 2025 am 12:17 AM

Choosing Python or C depends on project requirements: 1) If you need rapid development, data processing and prototype design, choose Python; 2) If you need high performance, low latency and close hardware control, choose C.

Reaching Your Python Goals: The Power of 2 Hours DailyApr 20, 2025 am 12:21 AM

By investing 2 hours of Python learning every day, you can effectively improve your programming skills. 1. Learn new knowledge: read documents or watch tutorials. 2. Practice: Write code and complete exercises. 3. Review: Consolidate the content you have learned. 4. Project practice: Apply what you have learned in actual projects. Such a structured learning plan can help you systematically master Python and achieve career goals.

Maximizing 2 Hours: Effective Python Learning StrategiesApr 20, 2025 am 12:20 AM

Methods to learn Python efficiently within two hours include: 1. Review the basic knowledge and ensure that you are familiar with Python installation and basic syntax; 2. Understand the core concepts of Python, such as variables, lists, functions, etc.; 3. Master basic and advanced usage by using examples; 4. Learn common errors and debugging techniques; 5. Apply performance optimization and best practices, such as using list comprehensions and following the PEP8 style guide.

Choosing Between Python and C : The Right Language for YouApr 20, 2025 am 12:20 AM

Python is suitable for beginners and data science, and C is suitable for system programming and game development. 1. Python is simple and easy to use, suitable for data science and web development. 2.C provides high performance and control, suitable for game development and system programming. The choice should be based on project needs and personal interests.

Python vs. C : A Comparative Analysis of Programming LanguagesApr 20, 2025 am 12:14 AM

Python is more suitable for data science and rapid development, while C is more suitable for high performance and system programming. 1. Python syntax is concise and easy to learn, suitable for data processing and scientific computing. 2.C has complex syntax but excellent performance and is often used in game development and system programming.

2 Hours a Day: The Potential of Python LearningApr 20, 2025 am 12:14 AM

It is feasible to invest two hours a day to learn Python. 1. Learn new knowledge: Learn new concepts in one hour, such as lists and dictionaries. 2. Practice and exercises: Use one hour to perform programming exercises, such as writing small programs. Through reasonable planning and perseverance, you can master the core concepts of Python in a short time.

Python vs. C : Learning Curves and Ease of UseApr 19, 2025 am 12:20 AM

Python is easier to learn and use, while C is more powerful but complex. 1. Python syntax is concise and suitable for beginners. Dynamic typing and automatic memory management make it easy to use, but may cause runtime errors. 2.C provides low-level control and advanced features, suitable for high-performance applications, but has a high learning threshold and requires manual memory and type safety management.

See all articles