关键信息提取的实用方法（第 1 部分）-Python教程-PHP中文网

首页

后端开发

Python教程

关键信息提取的实用方法（第 1 部分）

Patricia Arquette

Oct 04, 2024 pm 04:11 PM

大家好，我又是 Mrzaizai2k！

在这个系列中，我想分享我解决发票关键信息提取（KIE）问题的方法。我们将探索如何利用 ChatGPT 和 Qwen2 等大型语言模型 (LLM) 进行信息提取。然后，我们将深入研究使用 PaddleOCR、零样本分类模型或 Llama 3.1 等 OCR 模型来对结果进行后处理。

— 天哪，这太令人兴奋了！

为了更上一层楼，我们将处理任何格式和任何语言的发票。是的，没错——这是真的！

分析需求

假设您需要构建一项服务，以任何语言从任何类型的发票中提取所有相关信息。就像您在此示例网站上找到的内容一样。

这是我们将使用的示例发票图像：

Practical Approaches to Key Information Extraction (Part 1)

关键考虑因素

首先我们来详细分析一下需求。这将帮助我们为我们的系统选择正确的技术堆栈。虽然某些技术可能效果很好，但它们可能并不适合所有场景。以下是我们需要从上到下确定优先级的内容：

快速启动系统
确保准确性
使其在有限的计算资源上运行
- （例如，具有 12 GB VRAM 的 GPU RTX 3060 甚至 CPU）
保持合理的处理时间
- 每个发票在 CPU 上约 1 分钟，在 GPU 上约 10 秒
专注于仅提取有用且重要的细节

考虑到这些要求，我们不会进行任何微调。相反，我们将结合现有技术并将它们堆叠在一起，以便快速准确地获得任何格式和语言的结果。

作为基准，我注意到示例网站在大约 3-4 秒内处理发票。因此，在我们的系统中10秒的目标是完全可以实现的。

输出格式应与示例网站上使用的格式匹配：

Practical Approaches to Key Information Extraction (Part 1)

查特普特

好吧，我们来谈谈第一个工具：ChatGPT。您可能已经知道它的使用有多么简单。那么，为什么还要阅读这个博客呢？好吧，如果我告诉您我可以帮助您优化令牌使用并加快处理速度呢？感兴趣了吗？请稍等——我会解释如何做。

基本方法

这是一个基本的代码片段。（注意：代码可能并不完美——这更多的是关于想法而不是确切的实现）。您可以在我的存储库多语言发票 OCR 存储库中查看完整代码。


class OpenAIExtractor(BaseExtractor):
    def __init__(self, config_path: str = "config/config.yaml"):
        super().__init__(config_path)

        self.config = self.config['openai']
        self.model = self.config['model_name']
        self.temperature = self.config['temperature']
        self.max_tokens = self.config['max_tokens']

        self.OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
        from openai import OpenAI
        self.client = OpenAI(api_key=self.OPENAI_API_KEY)

    def _extract_invoice_llm(self, ocr_text, base64_image:str, invoice_template:str):
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": """You are a helpful assistant that responds in JSON format with the invoice information in English. 
                                            Don't add any annotations there. Remember to close any bracket. Number, price and amount should be number, date should be convert to dd/mm/yyyy, 
                                            time should be convert to HH:mm:ss, currency should be 3 chracters like VND, USD, EUR"""},
                {"role": "user", "content": [
                    {"type": "text", "text": f"From the image of the bill and the text from OCR, extract the information. The ocr text is: {ocr_text} \n. Return the key names as in the template is a MUST. The invoice template: \n {invoice_template}"},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
                ]}
            ],
            temperature=self.temperature,
            max_tokens=self.max_tokens,
        )
        return response.choices[0].message.content

    def extract_json(self, text: str) -> dict:
        start_index = text.find('{')
        end_index = text.rfind('}') + 1
        json_string = text[start_index:end_index]
        json_string = json_string.replace('true', 'True').replace('false', 'False').replace('null', 'None')
        result = eval(json_string)
        return result

    @retry_on_failure(max_retries=3, delay=1.0)
    def extract_invoice(self, ocr_text, image: Union[str, np.ndarray], invoice_template:str) -> dict:
        base64_image = self.encode_image(image)
        invoice_info = self._extract_invoice_llm(ocr_text, base64_image, 
                                                 invoice_template=invoice_template)
        invoice_info = self.extract_json(invoice_info)
        return invoice_info

好的，让我们看看结果



invoice {
    "invoice_info": {
        "amount": 32.0,
        "amount_change": 0,
        "amount_shipping": 0,
        "vatamount": 0,
        "amountexvat": 32.0,
        "currency": "EUR",
        "purchasedate": "28/06/2008",
        "purchasetime": "17:46:26",
        "vatitems": [
            {
                "amount": 32.0,
                "amount_excl_vat": 32.0,
                "amount_incl_vat": 32.0,
                "amount_incl_excl_vat_estimated": false,
                "percentage": 0,
                "code": ""
            }
        ],
        "vat_context": "",
        "lines": [
            {
                "description": "",
                "lineitems": [
                    {
                        "title": "Lunettes",
                        "description": "",
                        "amount": 22.0,
                        "amount_each": 22.0,
                        "amount_ex_vat": 22.0,
                        "vat_amount": 0,
                        "vat_percentage": 0,
                        "quantity": 1,
                        "unit_of_measurement": "",
                        "sku": "",
                        "vat_code": ""
                    },
                    {
                        "title": "Chapeau",
                        "description": "",
                        "amount": 10.0,
                        "amount_each": 10.0,
                        "amount_ex_vat": 10.0,
                        "vat_amount": 0,
                        "vat_percentage": 0,
                        "quantity": 1,
                        "unit_of_measurement": "",
                        "sku": "",
                        "vat_code": ""
                    }
                ]
            }
        ],
        "paymentmethod": "CB EMV",
        "payment_auth_code": "",
        "payment_card_number": "",
        "payment_card_account_number": "",
        "payment_card_bank": "",
        "payment_card_issuer": "",
        "payment_due_date": "",
        "terminal_number": "",
        "document_subject": "",
        "package_number": "",
        "invoice_number": "",
        "receipt_number": "000130",
        "shop_number": "",
        "transaction_number": "000148",
        "transaction_reference": "",
        "order_number": "",
        "table_number": "",
        "table_group": "",
        "merchant_name": "G\u00e9ant Casino",
        "merchant_id": "",
        "merchant_coc_number": "",
        "merchant_vat_number": "",
        "merchant_bank_account_number": "",
        "merchant_bank_account_number_bic": "",
        "merchant_chain_liability_bank_account_number": "",
        "merchant_chain_liability_amount": 0,
        "merchant_bank_domestic_account_number": "",
        "merchant_bank_domestic_bank_code": "",
        "merchant_website": "",
        "merchant_email": "",
        "merchant_address": "Annecy",
        "merchant_phone": "04.50.88.20.00",
        "customer_name": "",
        "customer_address": "",
        "customer_phone": "",
        "customer_website": "",
        "customer_vat_number": "",
        "customer_coc_number": "",
        "customer_bank_account_number": "",
        "customer_bank_account_number_bic": "",
        "customer_email": "",
        "document_language": ""
    }
}
Test_Openai_Invoice Took 0:00:11.15

结果非常可靠，但处理时间是一个问题——它超出了我们的 10 秒限制。您可能还会注意到，输出包含大量空字段，这不仅会增加处理时间，还会引入错误并消耗更多令牌 - 本质上会花费更多金钱。

先进的方法

事实证明，我们只需要一个小小的调整就可以解决这个问题。

只需将以下句子添加到您的提示中：

“只输出有值的字段，不返回任何空字段。”

瞧！问题解决了！


 invoice_info {
    "invoice_info": {
        "amount": 32,
        "currency": "EUR",
        "purchasedate": "28/06/2008",
        "purchasetime": "17:46:26",
        "lines": [
            {
                "description": "",
                "lineitems": [
                    {
                        "title": "LUNETTES",
                        "description": "",
                        "amount": 22,
                        "amount_each": 22,
                        "amount_ex_vat": 22,
                        "vat_amount": 0,
                        "vat_percentage": 0,
                        "quantity": 1,
                        "unit_of_measurement": "",
                        "sku": "",
                        "vat_code": ""
                    },
                    {
                        "title": "CHAPEAU",
                        "description": "",
                        "amount": 10,
                        "amount_each": 10,
                        "amount_ex_vat": 10,
                        "vat_amount": 0,
                        "vat_percentage": 0,
                        "quantity": 1,
                        "unit_of_measurement": "",
                        "sku": "",
                        "vat_code": ""
                    }
                ]
            }
        ],
        "invoice_number": "000130"
    }
}
Test_Openai_Invoice Took 0:00:05.79

哇，这真是一个改变游戏规则的人！现在结果更短、更精确，处理时间从 11.15 秒下降到仅 5.79 秒。通过这一单句调整，我们将成本和处理时间减少了约 50%。很酷吧？

在本例中，我使用的是 GPT-4o-mini，效果很好，但根据我的经验，Gemini Flash 表现更好 - 更快且免费！绝对值得一看。

您可以通过缩短模板来进一步优化，根据您的具体要求仅关注最重要的字段。

桨OCR

结果看起来相当不错，但仍然有一些缺失的字段，例如电话号码或收银员姓名，我们也想捕获这些字段。虽然我们可以简单地重新提示 ChatGPT，但仅依赖 LLM 可能是不可预测的 - 每次运行的结果可能会有所不同。此外，提示模板相当长（因为我们试图提取所有格式的所有可能信息），这可能会导致 ChatGPT “忘记”某些细节。

这就是 PaddleOCR 的用武之地 - 它通过提供精确的 OCR 文本来增强 LLM 的视觉能力，帮助模型准确地专注于需要提取的内容。

In my previous prompt, I used this structure:


{"type": "text", "text": f"From the image of the bill and the text from OCR, extract the information. The ocr text is: {ocr_text} \n.

Previously, I set ocr_text = '', but now we'll populate it with the output from PaddleOCR. Since I'm unsure of the specific language for now, I'll use English (as it's the most commonly supported). In the next part, I’ll guide you on detecting the language, so hang tight!

Here’s the updated code to integrate PaddleOCR:


ocr = PaddleOCR(lang='en', show_log=False, use_angle_cls=True, cls=True)
result = ocr.ocr(np.array(image))

This is the OCR output.


    "Geant Casino ANNECY BIENVENUE DANS NOTRE MAGASIN Caisse014 Date28/06/2008 VOTRE MAGASIN VOUS ACCUEILLE DU LUNDI AU SAMEDI DE 8H30 A21H00 TEL.04.50.88.20.00 LUNETTES 22.00E CHAPEAU 10.00E =TOTAL2) 32.00E CB EMV 32.00E Si vous aviez la carte fidelite, vous auriez cumule 11SMILES Caissier:000148/Heure:17:46:26 Numero de ticket :000130 Rapidite,confort d'achat budget maitrise.. Scan' Express vous attend!! Merci de votre visite A bientot"

As you can see, the results are pretty good. In this case, the invoice is in French, which looks similar to English, so the output is decent. However, if we were dealing with languages like Japanese or Chinese, the results wouldn't be as accurate.

Now, let’s see what happens when we combine the OCR output with ChatGPT.


 invoice_info {
    "invoice_info": {
        "amount": 32,
        "currency": "EUR",
        "purchasedate": "28/06/2008",
        "purchasetime": "17:46:26",
        "lines": [
            {
                "description": "",
                "lineitems": [
                    {
                        "title": "LUNETTES",
                        "description": "",
                        "amount": 22,
                        "amount_each": 22,
                        "amount_ex_vat": 22,
                        "vat_amount": 0,
                        "vat_percentage": 0,
                        "quantity": 1,
                        "unit_of_measurement": "",
                        "sku": "",
                        "vat_code": ""
                    },
                    {
                        "title": "CHAPEAU",
                        "description": "",
                        "amount": 10,
                        "amount_each": 10,
                        "amount_ex_vat": 10,
                        "vat_amount": 0,
                        "vat_percentage": 0,
                        "quantity": 1,
                        "unit_of_measurement": "",
                        "sku": "",
                        "vat_code": ""
                    }
                ]
            }
        ],
        "paymentmethod": "CB EMV",
        "receipt_number": "000130",
        "transaction_number": "000130",
        "merchant_name": "G\u00e9ant Casino",
        "customer_email": "",
        "customer_name": "",
        "customer_address": "",
        "customer_phone": ""
    }
}
Test_Openai_Invoice Took 0:00:06.78

Awesome! It uses a few more tokens and takes slightly longer, but it returns additional fields like payment_method, receipt_number, and cashier. That’s a fair trade-off and totally acceptable!

Language Detection

Right now, we’re facing two major challenges. First, PaddleOCR cannot automatically detect the language, which significantly affects the OCR output, and ultimately impacts the entire result. Second, most LLMs perform best with English, so if the input is in another language, the quality of the results decreases.

To demonstrate, I’ll use a challenging example.

Here’s a Japanese invoice:

Practical Approaches to Key Information Extraction (Part 1)

Let’s see what happens if we fail to auto-detect the language and use lang='en' to extract OCR on this Japanese invoice.

The result


'TEL045-752-6131 E TOP&CIubQJMB-FJ 2003 20130902 LNo.0102 No0073 0011319-2x198 396 00327111 238 000805 VR-E--E 298 003276 9 -435 298 001093 398 000335 138 000112 7 2x158 316 A000191 92 29 t 2.111 100) 10.001 10.001 7.890'

As you can see, the result is pretty bad.

Now, let’s detect the language using a zero-shot classification model. In this case, I’m using "facebook/metaclip-b32-400m". This is one of the best ways to detect around 80 languages supported by PaddleOCR without needing fine-tuning while still maintaining accuracy.


def initialize_language_detector(self):
        # Initialize the zero-shot image classification model
        self.image_classifier = pipeline(task="zero-shot-image-classification", 
                                         model="facebook/metaclip-b32-400m", 
                                         device=self.device,
                                         batch_size=8)

def _get_lang(self, image: Image.Image) -> str:
        # Define candidate labels for language classification
        candidate_labels = [f"language {key}" for key in self.language_dict]

        # Perform inference to classify the language
        outputs = self.image_classifier(image, candidate_labels=candidate_labels)
        outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]

        # Extract the language with the highest score
        language_names = [entry['label'].replace('language ', '') for entry in outputs]
        scores = [entry['score'] for entry in outputs]
        abbreviations = [self.language_dict.get(language) for language in language_names]

        first_abbreviation = abbreviations[0]
        lang = 'en'  # Default to English

        if scores[0] > self.language_thresh:
            lang = first_abbreviation
        print("The source language", abbreviations)
        return lang

Let's see the result


Recognized Text: 
{'ori_text': '根岸 東急ストア TEL 045-752-6131 領収証 [TOP2C!UbO J3カード」 クレヅッ 卜でのお支払なら 200円で3ボイン卜 お得なカード! 是非こ入会下さい。 2013年09月02日(月) レジNO. 0102 NOO07さ と う 001131 スダフエウ卜チーネ 23 単198 1396 003271 オインイ年 ユウ10 4238 000805 ソマ一ク スモー一クサーモン 1298 003276 タカナン ナマクリーム35 1298 001093 ヌテラ スフレクト 1398 000335 バナサ 138 000112 アボト 2つ 単158 1316 A000191 タマネキ 429 合計 2,111 (内消費税等 100 現金 10001 お預り合計 110 001 お釣り 7 890', 
'ori_language': 'ja', 
'text': 'Negishi Tokyu Store TEL 045-752-6131 Receipt [TOP2C!UbO J3 Card] If you pay with a credit card, you can get 3 points for 200 yen.A great value card!Please join us. Monday, September 2, 2013 Cashier No. 0102 NOO07 Satou 001131 Sudafue Bucine 23 Single 198 1396 003271 Oinyen Yu 10 4238 000805 Soma Iku Smo Iku Salmon 1298 003276 Takanan Nama Cream 35 1 298 001093 Nutella Sprect 1398 000335 Banasa 138 000112 Aboto 2 AA 158 1316 A000191 Eggplant 429 Total 2,111 (including consumption tax, etc. 100 Cash 10001 Total deposited 110 001 Change 7 890', 
'language': 'en',}

The results are much better now! I also translated the original Japanese into English. With this approach, the output will significantly improve for other languages as well.

Summary

In this blog, we explored how to extract key information from invoices by combining LLMs and OCR, while also optimizing processing time, minimizing token usage, and improving multilingual support. By incorporating PaddleOCR and a zero-shot language detection model, we boosted both accuracy and reliability across different formats and languages. I hope these examples help you grasp the full process, from initial concept to final implementation.