Home >Technology peripherals >AI >From Decision Tree to Transformer—Comparison of Sentiment Analysis Models for Restaurant Reviews
Translator | Zhu Xianzhong
Reviewer| Sun Shujuan
##This articlewillshow various popularThe effectiveness of machine learning models and embedding techniques for sentiment analysis of Macedonian restaurant reviews,Explore and compare several classic machine learning models as well as including neural networks and ## Modern deep learning technology including #Transformers. Experiments show that using the latest OpenAI embedded fine-tuned Transformers models and deep learning models are far Better than other methods.
etcpopular languages;butis the development of less commonly used languages In terms of relatedresearch and applicationof machine learning modelsthere are much less. On the other hand, with the rise of e-commerce due to the COVID-19 epidemic, less common languages such as Macedonian have also generated a large amount of data through online reviews. This provides an opportunity to develop and train machine learning models for sentiment analysis of Macedonian restaurant reviews ; if successful, this could help businesses more Understand customer emotions well and improve related services. In this study, we address the challenges posed by this problem and explore and compare various sentiment analysis models, ranging from classic Random forests to modern deep learning techniques and Transformersetc.
First of all, we give an outline of the content of this article: However, for languages that use Cyrillic (Cyrillic), users on the Internet often The use of Latin scripts to express oneself, resulting in mixed data consisting of Latin and Cyrillic scripts, created an additional challenge. To address this challenge, I used a dataset of a local restaurant with approximately 500 reviews -- which contained both Latin and Cyrillic scripts. The dataset also includes a small set of English comments, which will help evaluate performance on hybrid data. Additionally, online text may contain symbols, such as emoticons, that need to be removed. Therefore, preprocessing is a crucial step before performing any text embedding.
##Machine Learning Model
Random Forest
Results and discussion
Language is a unique form of human communication Tools, computers cannot interpret language without appropriate processing technology. In order for machines to analyze and understand language, we need to represent complex semantic and lexical information in a computably processable way. A popular way to achieve this is to use vector representation. In recent years, in addition to language-specific representation models, multilingual models have emerged. These models can capture the semantic context of text across a wide range of languages.
import pandas as pd
import numpy as np
df = pd.read_csv('/content/data.tsv', sep='t')
# 注意sentiment类别的分布情况
# -------
# 0 337
# 1 322
# Name: sentiment, dtype: int64
Notice that the data set contains the
distribution of almost equal positive Negative class. To remove emojis I used Python library emoji which can easily remove emojis and other symbols.
!pip install emoji import emoji clt = [] for comm in df['comment'].to_numpy(): clt.append(emoji.replace_emoji(comm, replace="")) df['comment'] = clt df.head()For the Cyrillic and Latin questions, I converted all the text to one or the other so that the machine learning model could Tested on both to compare performance. I use "cyrtranslit" library to perform this task. It supports most Cyrillic alphabets like Macedonian, Bulgarian, Ukrainian, etc.
import cyrtranslit latin = [] cyrillic = [] for comm in df['comment'].to_numpy(): latin.append(cyrtranslit.to_latin(comm, "mk")) cyrillic.append(cyrtranslit.to_cyrillic(comm, "mk")) df['comment_cyrillic'] = cyrillic df['comment_latin'] = latin df.head()
Figure 1
:Conversion output
RESULTSFor the embedding models I use, removing punctuation, stop words, and other text cleaning is generally not necessary. These models are designed to process natural language text, including punctuation, and are often able to more accurately capture the meaning of a sentence when it remains intact. In this way, text preprocessing is completed.
LASER(Language-Agnostic Sentence Representations)是一种生成高质量多语言句子嵌入的语言不可知方法。LASER模型基于两阶段过程。其中,第一阶段是对文本进行预处理,包括标记化、小写和应用句子。这部分是特定于语言的;第二阶段涉及使用多层双向LSTM将预处理的输入文本映射到固定长度的嵌入。
!pip install laserembeddings !python -m laserembeddings download-models from laserembeddings import Laser #创建嵌入 laser = Laser() embeddings_c = laser.embed_sentences(df['comment_cyrillic'].to_numpy(),lang='mk') embeddings_l = laser.embed_sentences(df['comment_latin'].to_numpy(),lang='mk') # 保存嵌入 np.save('/content/laser_multi_c.npy', embeddings_c) np.save('/content/laser_multi_l.npy', embeddings_l)
!pip install tensorflow_text import tensorflow as tf import tensorflow_hub as hub import numpy as np import tensorflow_text #加载MUSE模型 module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3" embed = hub.load(module_url) sentences = df['comment_cyrillic'].to_numpy() muse_c = embed(sentences) muse_c = np.array(muse_c) sentences = df['comment_latin'].to_numpy() muse_l = embed(sentences) muse_l = np.array(muse_l) np.save('/content/muse_c.npy', muse_c) np.save('/content/muse_l.npy', muse_l)
!pip install openai import openai openai.api_key = 'YOUR_KEY_HERE' embeds_c = openai.Embedding.create(input = df['comment_cyrillic'].to_numpy().tolist(), model='text-embedding-ada-002')['data'] embeds_l = openai.Embedding.create(input = df['comment_latin'].to_numpy().tolist(), model='text-embedding-ada-002')['data'] full_arr_c = [] for e in embeds_c: full_arr_c.append(e['embedding']) full_arr_c = np.array(full_arr_c) full_arr_l = [] for e in embeds_l: full_arr_l.append(e['embedding']) full_arr_l = np.array(full_arr_l) np.save('/content/openai_ada_c.npy', full_arr_c) np.save('/content/openai_ada_l.npy', full_arr_l)
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(embeddings_c, df['sentiment'], test_size=0.2, random_state=42)
from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, confusion_matrix rfc = RandomForestClassifier(n_estimators=100) rfc.fit(X_train, y_train) print(classification_report(y_test,rfc.predict(X_test))) print(confusion_matrix(y_test,rfc.predict(X_test)))
from xgboost import XGBClassifier from sklearn.metrics import classification_report, confusion_matrix rfc = XGBClassifier(max_depth=15) rfc.fit(X_train, y_train) print(classification_report(y_test,rfc.predict(X_test))) print(confusion_matrix(y_test,rfc.predict(X_test)))
from sklearn.svm import SVC from sklearn.metrics import classification_report, confusion_matrix rfc = SVC() rfc.fit(X_train, y_train) print(classification_report(y_test,rfc.predict(X_test))) print(confusion_matrix(y_test,rfc.predict(X_test)))
import tensorflow as tf from tensorflow import keras from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, confusion_matrix model = keras.Sequential() model.add(keras.layers.Dense(256, activatinotallow='relu', input_shape=(1024,))) model.add(keras.layers.Dropout(0.2)) model.add(keras.layers.Dense(128, activatinotallow='relu')) model.add(keras.layers.Dense(1, activatinotallow='sigmoid')) model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) history = model.fit(X_train, y_train, epochs=11, validation_data=(X_test, y_test)) test_loss, test_acc = model.evaluate(X_test, y_test) print('Test accuracy:', test_acc) y_pred = model.predict(X_test) print(classification_report(y_test,y_pred.round())) print(confusion_matrix(y_test,y_pred.round()))
import matplotlib.pyplot as plt def plot_accuracy(history): plt.plot(history.history['accuracy']) plt.plot(history.history['val_accuracy']) plt.title('Model Accuracy') plt.xlabel('Epoch') plt.ylabel('Accuracy') plt.legend(['Train', 'Validation'], loc='upper left') plt.show()
from sklearn.model_selection import train_test_split from datasets import load_dataset from transformers import TrainingArguments, Trainer from sklearn.metrics import classification_report, confusion_matrix # 创建由数据集加载的训练和测试集的csv文件 df.rename(columns={"sentiment": "label"}, inplace=True) train, test = train_test_split(df, test_size=0.2) pd.DataFrame(train).to_csv('train.csv',index=False) pd.DataFrame(test).to_csv('test.csv',index=False) #加载数据集 dataset = load_dataset("csv", data_files={"train": "train.csv", "test": "test.csv"}) # 标记文本 tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-uncased') encoded_dataset = dataset.map(lambda t: tokenizer(t['comment_cyrillic'], truncatinotallow=True), batched=True,load_from_cache_file=False) # 加载预训练的模型 model = AutoModelForSequenceClassification.from_pretrained('bert-base-multilingual-uncased',num_labels =2) #微调模型 arg = TrainingArguments( "mbert-sentiment-mk", learning_rate=5e-5, num_train_epochs=5, per_device_eval_batch_size=8, per_device_train_batch_size=8, seed=42, push_to_hub=True ) trainer = Trainer( model=model, args=arg, tokenizer=tokenizer, train_dataset=encoded_dataset['train'], eval_dataset=encoded_dataset['test'] ) trainer.train() # 取得预测结果 predictions = trainer.predict(encoded_dataset["test"]) preds = np.argmax(predictions.predictions, axis=-1) # 评估 print(classification_report(predictions.label_ids,preds)) print(confusion_matrix(predictions.label_ids,preds))
In future work, it would be very valuable to collect more data to further train and test the model, especially if the review topics and sources are more diverse in the case of. Additionally, trying to incorporate more features such as metadata (e.g., age, gender, location of the reviewer) or temporal information (e.g., review time) into the model may improve its accuracy. Finally, it would be interesting to extend the analysis to other less commonly used languages and compare the performance of the model with the model trained in the Macedonian review. Conclusion
This article demonstrates various popularmachine learning models and effectiveness of embedding techniques for sentiment analysis of Macedonian restaurant reviews. Several classic machine learning models, such as random forests and SVMs, are explored and compared, as well as modern deep learning techniques including neural networks and Transformers. The results show that fine-tuned Transformers models and deep learning models using the latest OpenAI embeddings outperform other methods, with verification accuracy as high as 90%. Translator Introduction
Zhu Xianzhong, 51CTO community editor, 51CTO expert blogger, lecturer, computer teacher at a university in Weifang, freelance programming A veteran of the world. Original title:
From Decision Trees to Transformers: Comparing Sentiment Analysis Models for Macedonian Restaurant Reviews , Author: Danilo Najkov
The above is the detailed content of From Decision Tree to Transformer—Comparison of Sentiment Analysis Models for Restaurant Reviews. For more information, please follow other related articles on the PHP Chinese website!