


When using OCR to digitize financial reports, you may encounter various approaches for detecting specific categories within those reports. For example, traditional methods like the Levenshtein algorithm can be used for string matching based on edit distance, making it effective for handling near matches, such as correcting typos or small variations in text.
However, the challenge becomes more complex when you need to detect multiple categories in a single line of a report, especially when those categories may not appear exactly as expected or could overlap semantically.
In this post, we analyze a semantic matching approach using Facebook's LASER (Language-Agnostic SEntence Representations) embeddings, showcasing how it can effectively handle this task.
Problem
The objective is to identify specific financial terms (categories) in a given text line. Let’s assume we have a fixed set of predefined categories that represent all possible terms of interest, such as:
["revenues", "operating expenses", "operating profit", "depreciation", "interest", "net profit", "tax", "profit after tax", "metric 1"]
Given an input line like:
"operating profit, net profit and profit after tax"
We aim to detect which identifiers appear in this line.
Semantic Matching with LASER
Instead of relying on exact or fuzzy text matches, we use semantic similarity. This approach leverages LASER embeddings to capture the semantic meaning of text and compares it using cosine similarity.
Implementation
Preprocessing the Text
Before embedding, the text is preprocessed by converting it to lowercase and removing extra spaces. This ensures uniformity.
def preprocess(text): return text.lower().strip()
Embedding Identifiers and Input Line
The LASER encoder generates normalized embeddings for both the list of identifiers and the input/OCR line.
identifier_embeddings = encoder.encode_sentences(identifiers, normalize_embeddings=True) ocr_line_embedding = encoder.encode_sentences([ocr_line], normalize_embeddings=True)[0]
Ranking Identifiers by Specificity
Longer identifiers are prioritized by sorting them based on word count. This helps handle nested matches, where longer identifiers might subsume shorter ones (e.g., "profit after tax" subsumes "profit").
ranked_identifiers = sorted(identifiers, key=lambda x: len(x.split()), reverse=True) ranked_embeddings = encoder.encode_sentences(ranked_identifiers, normalize_embeddings=True)
Calculating Similarity
Using cosine similarity, we measure how semantically similar each identifier is to the input line. Identifiers with similarity above a specified threshold are considered matches.
matches = [] threshold = 0.6 for idx, identifier_embedding in enumerate(ranked_embeddings): similarity = cosine_similarity([identifier_embedding], [ocr_line_embedding])[0][0] if similarity >= threshold: matches.append((ranked_identifiers[idx], similarity))
Resolving Nested Matches
To handle overlapping identifiers, longer matches are prioritized, ensuring shorter matches within them are excluded.
def preprocess(text): return text.lower().strip()
Results
When the code is executed, the output provides a list of detected matches along with their similarity scores. For the example input:
identifier_embeddings = encoder.encode_sentences(identifiers, normalize_embeddings=True) ocr_line_embedding = encoder.encode_sentences([ocr_line], normalize_embeddings=True)[0]
Considerations for Longer and Complex Inputs
This method works well in structured financial reports with multiple categories on a single line, provided there aren't too many categories or much unrelated text. However, accuracy can degrade with longer, complex inputs or unstructured user-generated text, as the embeddings may struggle to focus on relevant categories. It is less reliable for noisy or unpredictable inputs.
Conclusion
This post demonstrates how LASER embeddings can be a useful tool for detecting multiple categories in text. Is it the best option? Maybe not, but it is certainly one of the options worth considering, especially when dealing with complex scenarios where traditional matching techniques might fall short.
Full code
ranked_identifiers = sorted(identifiers, key=lambda x: len(x.split()), reverse=True) ranked_embeddings = encoder.encode_sentences(ranked_identifiers, normalize_embeddings=True)
The above is the detailed content of Semantic Matching of Text Identifiers Using LASER Embeddings in Python. For more information, please follow other related articles on the PHP Chinese website!

TomergelistsinPython,youcanusethe operator,extendmethod,listcomprehension,oritertools.chain,eachwithspecificadvantages:1)The operatorissimplebutlessefficientforlargelists;2)extendismemory-efficientbutmodifiestheoriginallist;3)listcomprehensionoffersf

In Python 3, two lists can be connected through a variety of methods: 1) Use operator, which is suitable for small lists, but is inefficient for large lists; 2) Use extend method, which is suitable for large lists, with high memory efficiency, but will modify the original list; 3) Use * operator, which is suitable for merging multiple lists, without modifying the original list; 4) Use itertools.chain, which is suitable for large data sets, with high memory efficiency.

Using the join() method is the most efficient way to connect strings from lists in Python. 1) Use the join() method to be efficient and easy to read. 2) The cycle uses operators inefficiently for large lists. 3) The combination of list comprehension and join() is suitable for scenarios that require conversion. 4) The reduce() method is suitable for other types of reductions, but is inefficient for string concatenation. The complete sentence ends.

PythonexecutionistheprocessoftransformingPythoncodeintoexecutableinstructions.1)Theinterpreterreadsthecode,convertingitintobytecode,whichthePythonVirtualMachine(PVM)executes.2)TheGlobalInterpreterLock(GIL)managesthreadexecution,potentiallylimitingmul

Key features of Python include: 1. The syntax is concise and easy to understand, suitable for beginners; 2. Dynamic type system, improving development speed; 3. Rich standard library, supporting multiple tasks; 4. Strong community and ecosystem, providing extensive support; 5. Interpretation, suitable for scripting and rapid prototyping; 6. Multi-paradigm support, suitable for various programming styles.

Python is an interpreted language, but it also includes the compilation process. 1) Python code is first compiled into bytecode. 2) Bytecode is interpreted and executed by Python virtual machine. 3) This hybrid mechanism makes Python both flexible and efficient, but not as fast as a fully compiled language.

Useaforloopwheniteratingoverasequenceorforaspecificnumberoftimes;useawhileloopwhencontinuinguntilaconditionismet.Forloopsareidealforknownsequences,whilewhileloopssuitsituationswithundeterminediterations.

Pythonloopscanleadtoerrorslikeinfiniteloops,modifyinglistsduringiteration,off-by-oneerrors,zero-indexingissues,andnestedloopinefficiencies.Toavoidthese:1)Use'i


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Dreamweaver Mac version
Visual web development tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

Notepad++7.3.1
Easy-to-use and free code editor

WebStorm Mac version
Useful JavaScript development tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.
