Home >Technology peripherals >AI >A case study of using bidirectional LSTM model for text classification
The bidirectional LSTM model is a neural network used for text classification. Below is a simple example demonstrating how to use bidirectional LSTM for text classification tasks.
First, we need to import the required libraries and modules:
import os import numpy as np from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense, Embedding, Bidirectional, LSTM from sklearn.model_selection import train_test_split
Next, we need to prepare the dataset. Here we assume that the data set already exists in the specified path and contains three files: train.txt, dev.txt and test.txt. Each file contains a sequence of text and corresponding tags. We can load the dataset using the following code:
def load_imdb_data(path): assert os.path.exists(path) trainset, devset, testset = [], [], [] with open(os.path.join(path, "train.txt"), "r") as fr: for line in fr: sentence_label, sentence = line.strip().lower().split("\t", maxsplit=1) trainset.append((sentence, sentence_label)) with open(os.path.join(path, "dev.txt"), "r") as fr: for line in fr: sentence_label, sentence = line.strip().lower().split("\t", maxsplit=1) devset.append((sentence, sentence_label)) with open(os.path.join(path, "test.txt"), "r") as fr: for line in fr: sentence_label, sentence = line.strip().lower().split("\t", maxsplit=1) testset.append((sentence, sentence_label)) return trainset, devset, testset
After loading the dataset, we can preprocess and serialize the text. Here we use Tokenizer for text segmentation, and then pad the index sequence of each word to the same length so that it can be applied to the LSTM model.
max_features = 20000 maxlen = 80 # cut texts after this number of words (among top max_features most common words) batch_size = 32 print('Pad & split data into training set and dev set') x_train, y_train = [], [] for sent, label in trainset: x_train.append(sent) y_train.append(label) x_train, y_train = pad_sequences(x_train, maxlen=maxlen), np.array(y_train) x_train, y_train = np.array(x_train), np.array(y_train) x_dev, y_dev = [], [] for sent, label in devset: x_dev.append(sent) y_dev.append(label) x_dev, y_dev = pad_sequences(x_dev, maxlen=maxlen), np.array(y_dev) x_dev, y_dev = np.array(x_dev), np.array(y_dev)
Next, we can build a bidirectional LSTM model. In this model, we use two LSTM layers, one to pass information forward and one to pass information backward. The outputs of these two LSTM layers are concatenated to form a more powerful vector representing the text. Finally, we use a fully connected layer for classification.
print('Build model...') model = Sequential() model.add(Embedding(max_features, 128, input_length=maxlen)) model.add(Bidirectional(LSTM(64))) model.add(LSTM(64)) model.add(Dense(1, activation='sigmoid')) print('Compile model...') model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Now, we can train the model. We will use the dev dataset as validation data to ensure we do not overfit during training.
epochs = 10 batch_size = 64 history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(x_dev, y_dev))
After training is completed, we can evaluate the model's performance on the test set.
test_loss, test_acc = model.evaluate(x_test, y_test) print('Test accuracy:', test_acc)
The above is a simple text classification example of a two-way LSTM model. You can also try to adjust the parameters of the model, such as the number of layers, number of neurons, optimizers, etc., to get better performance. Or use pre-trained word embeddings (such as Word2Vec or GloVe) to replace the embedding layer to capture more semantic information.
The above is the detailed content of A case study of using bidirectional LSTM model for text classification. For more information, please follow other related articles on the PHP Chinese website!