집 >백엔드 개발 >파이썬 튜토리얼 >NoisOCR: OCR 이후 시끄러운 텍스트를 시뮬레이션하기 위한 Python 라이브러리

NoisOCR: OCR 이후 시끄러운 텍스트를 시뮬레이션하기 위한 Python 라이브러리

Susan Sarandon원래의: 2024-10-13 06:16:30983검색

NoisOCR: A Python Library for Simulating Post-OCR Noisy Texts

NoisOCR은 OCR(광학 문자 인식) 이후 생성된 텍스트의 노이즈를 시뮬레이션하도록 설계된 Python 라이브러리입니다. 이러한 텍스트에는 품질이 낮은 문서나 원고에서 OCR을 처리하는 데 따른 어려움을 반영하는 오류나 주석이 포함될 수 있습니다. 라이브러리는 하이픈 유무에 관계없이 OCR 이후 텍스트의 일반적인 오류 시뮬레이션과 텍스트를 슬라이딩 창으로 분할하는 기능을 제공합니다. 이는 철자 교정을 위한 신경망 모델의 훈련에 기여할 수 있습니다.

GitHub 저장소: NoisOCR

PyPI: PyPI의 NoisOCR

특징

슬라이딩 창: 긴 텍스트를 단어 분리 없이 작은 부분으로 나눕니다.
하이픈을 사용한 슬라이딩 창: 하이픈을 사용하여 글자 수 제한 내에 단어를 맞춥니다.
텍스트 오류 시뮬레이션: OCR 이후 정확도가 낮은 텍스트를 시뮬레이션하려면 임의 오류를 추가하세요.
텍스트 주석 시뮬레이션: BRESSAY 데이터세트에 있는 것과 같은 주석을 삽입하여 텍스트의 단어나 구문을 표시합니다.

설치

pip를 통해 NoisOCR을 쉽게 설치할 수 있습니다.

pip install noisocr

사용 예

1. 슬라이딩 윈도우

이 기능은 텍스트를 제한된 크기의 세그먼트로 나누어 단어를 그대로 유지합니다.

import noisocr

text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50

windows = noisocr.sliding_window(text, max_window_size)

# Output:
# [
#   'Lorem Ipsum is simply dummy text of the printing', 
#   ...
#   'type and scrambled it to make a type specimen', 
#   'book.'
# ]

2. 하이픈을 이용한 슬라이딩 윈도우

하이픈 사용 시 필요에 따라 하이픈을 삽입하여 창당 글자 수 제한을 초과하는 단어를 맞추려고 시도하는 기능입니다. 이 기능은 PyHyphen 패키지를 통해 여러 언어를 지원합니다.

import noisocr

text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50

windows = noisocr.sliding_window_with_hyphenation(text, max_window_size, 'en_US')

# Output:
# [
#   'Lorem Ipsum is simply dummy text of the printing ',        
#   'typesetting industry. Lorem Ipsum has been the in-', 
#   ...
#   'scrambled it to make a type specimen book.'
# ]

3. 텍스트 오류 시뮬레이션

simulate_errors 기능을 사용하면 사용자가 텍스트에 임의의 오류를 추가하여 OCR 이후 텍스트에서 흔히 발견되는 문제를 에뮬레이트할 수 있습니다. 오타 라이브러리는 문자 교환, 공백 누락, 추가 문자 등과 같은 오류를 생성합니다.

import noisocr

text = "Hello world."
text_with_errors = noisocr.simulate_errors(text, interactions=1)
# Output: Hello, wotrld!
text_with_errors = noisocr.simulate_errors(text, 2)
# Output: Hsllo,wlorld!
text_with_errors = noisocr.simulate_errors(text, 5)
# Output: fllo,w0rlr!

4. 텍스트 주석 시뮬레이션

주석 시뮬레이션 기능을 사용하면 사용자는 BRESSAY 데이터세트의 주석을 포함하여 일련의 주석을 기반으로 텍스트에 사용자 정의 표시를 추가할 수 있습니다.

import noisocr

text = "Hello world."
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, $$--xxx--$$
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, ##--world!--##
text_with_annotation = noisocr.simulate_annotation(text, 0.01)
# Output: Hello world.

코드 개요

NoisOCR 라이브러리의 핵심 기능은 오류 시뮬레이션을 위한 오타와 다양한 언어에서 단어 하이픈 관리를 위한 하이픈과 같은 라이브러리를 활용하는 데 기반을 두고 있습니다. 다음은 주요 기능에 대한 설명입니다.

1.simulate_annotation 함수

simulate_annotation 함수는 정의된 주석 세트에 따라 텍스트에서 임의의 단어를 선택하고 주석을 답니다.

import random

annotations = [
    '##@@???@@##', '$$@@???@@$$', '@@???@@', '##--xxx--##', 
    '$$--xxx--$$', '--xxx--', '##--text--##', '$$--text--$$',
    '##text##', '$$text$$', '--text--'
]

def simulate_annotation(text, annotations=annotations, probability=0.01):
    words = text.split()

    if len(words) > 1:
        target_word = random.choice(words)
    else:
        return text

    if random.random() < probability:
        annotation = random.choice(annotations)
        if 'text' in annotation:
            annotated_text = annotation.replace('text', target_word)
        else:
            annotated_text = annotation

        result_text = text.replace(target_word, annotated_text, 1)
        return result_text
    else:
        return text

2.simulate_errors 함수

simulate_errors 함수는 오타 라이브러리에서 무작위로 선택한 다양한 오류를 텍스트에 적용합니다.

import random
import typo

def simulate_errors(text, interactions=3, seed=None):
    methods = ["char_swap", "missing_char", "extra_char", "nearby_char", "similar_char", "skipped_space", "random_space", "repeated_char", "unichar"]

    if seed is not None:
        random.seed(seed)
    else:
        random.seed()

    instance = typo.StrErrer(text)
    method = random.choice(methods)
    method_to_call = getattr(instance, method)
    text = method_to_call().result

    if interactions > 0:
        interactions -= 1
        text = simulate_errors(text, interactions, seed=seed)

    return text

3. Sliding_window 및 Sliding_window_with_hyphenation 함수

이러한 기능은 하이픈 유무에 관계없이 텍스트를 슬라이딩 창으로 분할하는 역할을 합니다.

from hyphen import Hyphenator

def sliding_window_with_hyphenation(text, window_size=80, language='pt_BR'):
    hyphenator = Hyphenator(language)
    words = text.split()
    windows = []
    current_window = []
    remaining_word = ""

    for word in words:
        if remaining_word:
            word = remaining_word + word
            remaining_word = ""

        if len(" ".join(current_window)) + len(word) + 1 <= window_size:
            current_window.append(word)
        else:
            syllables = hyphenator.syllables(word)
            temp_word = ""
            for i, syllable in enumerate(syllables):
                if len(" ".join(current_window)) + len(temp_word) + len(syllable) + 1 <= window_size:
                    temp_word += syllable
                else:
                    if temp_word:
                        current_window.append(temp_word + "-")
                        remaining_word = "".join(syllables[i:]) + " "
                        break
                    else:
                        remaining_word = word + " "
                        break
            else:
                current_window.append(temp_word)
                remaining_word = ""

            windows.append(" ".join(current_window))
            current_window = []

    if remaining_word:
        current_window.append(remaining_word)
    if current_window:
        windows.append(" ".join(current_window))

    return windows

결론

NoisOCR은 OCR 이후 텍스트 수정 작업을 수행하는 사람들에게 필수 도구를 제공하여 디지털화된 텍스트에 오류와 주석이 발생하기 쉬운 실제 시나리오를 더 쉽게 시뮬레이션할 수 있도록 해줍니다. 자동화된 테스트, 텍스트 수정 모델 개발 또는 BRESSAY와 같은 데이터 세트 분석 등 이 라이브러리는 다용도의 사용자 친화적인 솔루션입니다.

GitHub에서 NoisOCR 프로젝트를 확인하고 개선에 기여하세요!

위 내용은 NoisOCR: OCR 이후 시끄러운 텍스트를 시뮬레이션하기 위한 Python 라이브러리의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!

Python pip for using function this github windows ocr word

성명：

이전 기사：신경망 최적화다음 기사：신경망 최적화