Home >Technology peripherals >AI >A case study of using bidirectional LSTM model for text classification

A case study of using bidirectional LSTM model for text classification

PHPz
PHPzforward
2024-01-24 10:36:06804browse

A case study of using bidirectional LSTM model for text classification

The bidirectional LSTM model is a neural network used for text classification. Below is a simple example demonstrating how to use bidirectional LSTM for text classification tasks.

First, we need to import the required libraries and modules:

import os  
import numpy as np  
from keras.preprocessing.text import Tokenizer  
from keras.preprocessing.sequence import pad_sequences  
from keras.models import Sequential  
from keras.layers import Dense, Embedding, Bidirectional, LSTM  
from sklearn.model_selection import train_test_split

Next, we need to prepare the dataset. Here we assume that the data set already exists in the specified path and contains three files: train.txt, dev.txt and test.txt. Each file contains a sequence of text and corresponding tags. We can load the dataset using the following code:

def load_imdb_data(path):  
    assert os.path.exists(path)  
    trainset, devset, testset = [], [], []  
    with open(os.path.join(path, "train.txt"), "r") as fr:  
        for line in fr:  
            sentence_label, sentence = line.strip().lower().split("\t", maxsplit=1)  
            trainset.append((sentence, sentence_label))  
    with open(os.path.join(path, "dev.txt"), "r") as fr:  
        for line in fr:  
            sentence_label, sentence = line.strip().lower().split("\t", maxsplit=1)  
            devset.append((sentence, sentence_label))  
    with open(os.path.join(path, "test.txt"), "r") as fr:  
        for line in fr:  
            sentence_label, sentence = line.strip().lower().split("\t", maxsplit=1)  
            testset.append((sentence, sentence_label))  
    return trainset, devset, testset

After loading the dataset, we can preprocess and serialize the text. Here we use Tokenizer for text segmentation, and then pad the index sequence of each word to the same length so that it can be applied to the LSTM model.

max_features = 20000  
maxlen = 80  # cut texts after this number of words (among top max_features most common words)  
batch_size = 32  
  
print('Pad & split data into training set and dev set')  
x_train, y_train = [], []  
for sent, label in trainset:  
    x_train.append(sent)  
    y_train.append(label)  
x_train, y_train = pad_sequences(x_train, maxlen=maxlen), np.array(y_train)  
x_train, y_train = np.array(x_train), np.array(y_train)  
x_dev, y_dev = [], []  
for sent, label in devset:  
    x_dev.append(sent)  
    y_dev.append(label)  
x_dev, y_dev = pad_sequences(x_dev, maxlen=maxlen), np.array(y_dev)  
x_dev, y_dev = np.array(x_dev), np.array(y_dev)

Next, we can build a bidirectional LSTM model. In this model, we use two LSTM layers, one to pass information forward and one to pass information backward. The outputs of these two LSTM layers are concatenated to form a more powerful vector representing the text. Finally, we use a fully connected layer for classification.

print('Build model...')  
model = Sequential()  
model.add(Embedding(max_features, 128, input_length=maxlen))  
model.add(Bidirectional(LSTM(64)))  
model.add(LSTM(64))  
model.add(Dense(1, activation='sigmoid'))  
  
print('Compile model...')  
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Now, we can train the model. We will use the dev dataset as validation data to ensure we do not overfit during training.

epochs = 10  
batch_size = 64  
  
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_dev, y_dev))

After training is completed, we can evaluate the model's performance on the test set.

test_loss, test_acc = model.evaluate(x_test, y_test)  
print('Test accuracy:', test_acc)

The above is a simple text classification example of a two-way LSTM model. You can also try to adjust the parameters of the model, such as the number of layers, number of neurons, optimizers, etc., to get better performance. Or use pre-trained word embeddings (such as Word2Vec or GloVe) to replace the embedding layer to capture more semantic information.

The above is the detailed content of A case study of using bidirectional LSTM model for text classification. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:163.com. If there is any infringement, please contact admin@php.cn delete