Home > Article > Technology peripherals > Handwriting recognition technology and its algorithm classification
The progress of machine learning technology will definitely promote the development of handwriting recognition technology. This article will focus on handwriting recognition technologies and algorithms that currently perform well.
Capsule networks are one of the latest and most advanced architectures in neural networks and are considered to be an important addition to existing Improvements in machine learning techniques.
Pooling layers in convolutional blocks are used to reduce data dimensionality and achieve spatial invariance for identifying and classifying objects in images. However, a drawback of pooling is that a large amount of spatial information about object rotation, position, scale, and other positional properties is lost in the process. Therefore, although the accuracy of image classification is high, the performance of locating the precise location of objects in the image is poor.
Capsule is a neuron module used to store information about the position, rotation, scale and other information of objects in high-dimensional vector space. Each dimension represents a special characteristic of the object.
The kernel that generates feature maps and extracts visual features works with dynamic routing by combining individual opinions from multiple groups called capsules. This results in equal variance between kernels and improves performance compared to CNNs.
The kernel that generates feature maps and extracts visual features works with dynamic routing by combining individual opinions from multiple groups (called capsules). This leads to equivalence between kernels and improved performance compared to CNNs.
RNN/LSTM (Long Short-Term Memory) processing sequential data is limited to processing one-dimensional data, such as Text, they cannot be extended directly to images.
Multidimensional Recurrent Neural Networks can replace a single recurrent connection in a standard Recurrent Neural Network with as many recurrent units as there are dimensions in the data.
During the forward pass, at each point in the data sequence, the hidden layer of the network receives external input and its own activations, which are one step backward from one dimension ongoing.
The main problem in recognition systems is to convert a two-dimensional image into a one-dimensional label sequence. This is done by passing the input data to a hierarchy of MDRNN layers. Selecting the height of the block gradually collapses the 2D image onto a 1D sequence, which can then be labeled by the output layer.
Multi-dimensional recurrent neural networks are designed to make language models robust to every combination of input dimensions, such as image rotation and shearing, ambiguity of strokes and local distortions of different handwriting styles properties and allow them to flexibly model multidimensional contexts.
This is an algorithm that handles tasks such as speech recognition and handwriting recognition, mapping the entire input data to output class/text.
Traditional recognition methods involve mapping images to corresponding text, however we do not know how patches of images are aligned with characters. CTC can be bypassed without knowing how specific parts of the speech audio or handwritten images align with specific characters.
The input to this algorithm is a vector representation of an image of handwritten text. There is no direct alignment between image pixel representation and character sequence. CTC aims to find this mapping by summing the probabilities of all possible alignments between them.
Models trained with CTC typically use recurrent neural networks to estimate the probability at each time step because recurrent neural networks take into account context in the input. It outputs the character score for each sequence element, represented by a matrix.
For decoding we can use:
Best path decoding: involves predicting the sentence by concatenating the most likely characters for each timestamp to form complete words, resulting in the best path. In the next training iteration, repeated characters and spaces are removed for better decoding of the text.
Beam Search Decoder: Suggests multiple output paths with the highest probability. Paths with smaller probabilities are discarded to keep the beam size constant. The results obtained through this method are more accurate and are often combined with language models to give meaningful results.
The Transformer model adopts a different strategy and uses self-attention to remember the entire sequence. A non-cyclic handwriting method can be implemented using the transformer model.
The Transformer model combines the multi-head self-attention layer of the visual layer and the text layer to learn the language model-related dependencies of the character sequence to be decoded. The language knowledge is embedded in the model itself, so there is no need for any additional processing steps using a language model. It is also well suited for predicting outputs that are not part of the vocabulary.
This architecture has two parts:
Text transcriber, which outputs decoded characters by paying attention to each other on visual and language-related features.
Visual feature encoder, designed to extract relevant information from handwritten text images by focusing on various character positions and their contextual information.
Training handwriting recognition systems is always troubled by the scarcity of training data. To solve this problem, this method uses pre-trained feature vectors of text as a starting point. State-of-the-art models use attention mechanisms in conjunction with RNNs to focus on useful features for each timestamp.
The complete model architecture can be divided into four stages: normalize the input text image, encode the normalized input image into a 2D visual feature map, and use bidirectional LSTM for decoding To perform sequential modeling, the output vector of contextual information from the decoder is converted into words.
This is a method for end-to-end handwriting recognition using attention mechanism. It scans the entire page at once. Therefore, it does not rely on splitting the entire word into characters or lines beforehand. This method uses a multidimensional LSTM (MDLSTM) architecture as a feature extractor similar to the above. The only difference is the last layer, where the extracted feature maps are folded vertically and a softmax activation function is applied to identify the corresponding text.
The attention model used here is a hybrid combination of content-based attention and location-based attention. The decoder LSTM module takes the previous state and attention maps and encoder features to generate the final output character and state vector for the next prediction.
This is a sequence-to-sequence model for handwritten text recognition based on the attention mechanism. The architecture contains three main parts:
Recurrent neural networks are most suitable for the temporal characteristics of the text. When paired with such a recurrent architecture, the attention mechanism plays a crucial role in focusing on the right features at each time step.
Synthetic handwriting generation can generate realistic handwritten text, which can be used to enhance existing datasets.
Deep learning models require large amounts of data to train, and obtaining a large corpus of annotated handwritten images in different languages is a tedious task. We can solve this problem by using generative adversarial networks to generate training data.
ScrabbleGAN is a semi-supervised method for synthesizing handwritten text images. It relies on a generative model that can generate arbitrary-length word images using a fully convolutional network.
The above is the detailed content of Handwriting recognition technology and its algorithm classification. For more information, please follow other related articles on the PHP Chinese website!