Home >Technology peripherals >It Industry >Getting Started with Natural Language Processing in Python
In today's world, a large amount of data is unstructured, such as text data such as social media comments, browsing history, and customer feedback. Facing massive text data, I don’t know where to start analyzing it? Python’s natural language processing (NLP) technology can help you!
This tutorial is designed to guide you to use the core concepts of NLP and analyze text data in Python. We will learn how to break down text into smaller units (word metamorphosis), normalize words into stem form (stem extraction and morphological restoration), and how to clean up documents in preparation for further analysis.
Let's get started!
FreqDist
class of NLTK to find word frequency. This is very useful for finding common terms in text. This tutorial uses Python's NLTK library to perform all NLP operations on text. When writing this tutorial, we were using NLTK version 3.4. You can install the library using the pip command on the terminal:
<code class="language-bash">pip install nltk==3.4</code>
To check the NLTK version installed in the system, you can import the library into the Python interpreter and check the version:
<code class="language-python">import nltk print(nltk.__version__)</code>
In this tutorial, in order to perform certain operations in NLTK, you may need to download specific resources. We will describe each resource when needed.
However, if you want to avoid downloading resources one by one in the later stage of the tutorial, you can now download all resources at once:
<code class="language-bash">python -m nltk.downloader all</code>
Computer systems cannot understand natural language by themselves. The first step in dealing with natural language is to convert the original text into word elements. A word element is a combination of continuous characters with some meaning. It is up to you to break up sentences into lexical elements. For example, an easy way is to split the sentence by spaces to break it down into a single word.
In the NLTK library, you can use the word_tokenize()
function to convert strings to lexical elements. However, you first need to download the punkt resource. Run the following command in the terminal:
<code class="language-bash">nltk.download('punkt')</code>
Next, you need to import nltk.tokenize
from word_tokenize
to use it:
<code class="language-python">from nltk.tokenize import word_tokenize print(word_tokenize("Hi, this is a nice hotel."))</code>The output of the
code is as follows:
<code class="language-bash">pip install nltk==3.4</code>
You will notice that word_tokenize
not only splits strings based on spaces, but also separates punctuation marks into word elements. Keeping or removing punctuation marks depends on your analytical needs.
When dealing with natural language, you often notice that there are various grammatical forms of the same word. For example, “go,” “going” and “gone” are all different forms of the same verb “go”.
While your project may need to preserve various grammatical forms of words, let's discuss a way to convert different grammatical forms of the same word into its stem form. There are two techniques you can use to convert a word into its stem form.
The first technique is stemming extraction. Stem extraction is a simple algorithm that removes word affixes. There are a variety of stemming extraction algorithms available in NLTK. In this tutorial, we will use the Porter algorithm.
We first import nltk.stem.porter
from PorterStemmer
. Next, we initialize the stemming extractor to the stemmer
variable, and then use the .stem()
method to find the stemming form of the word:
<code class="language-python">import nltk print(nltk.__version__)</code>
The output of the above code is go. If you run the stemming extractor for other forms of "go" described above, you will notice that the stemming extractor returns the same stemming form "go". However, since stemming extraction is just a simple algorithm based on removing word affixes, it fails when words are used less frequently in language.
For example, when you try to use a stemming extractor for the word "constitutes", it gives unintuitive results:
<code class="language-bash">python -m nltk.downloader all</code>
You will notice that the output is "constitut".
This problem can be solved by taking a more complex approach that looks up the stem form of a word in a given context. This process is called word form reduction. Word shape restoration normalizes words based on the context and vocabulary of the text. In NLTK, you can use the WordNetLemmatizer
class to perform morphological restoration of sentences.
First, you need to download wordnet resources from the NLTK downloader in the Python terminal:
<code class="language-bash">nltk.download('punkt')</code>
After the download is complete, you need to import the WordNetLemmatizer
class and initialize it:
<code class="language-python">from nltk.tokenize import word_tokenize print(word_tokenize("Hi, this is a nice hotel."))</code>
To use the morphology restorer, use the .lemmatize()
method. It accepts two parameters: word and context. In our example, we will use "v" as the context. After viewing the output of the .lemmatize()
method, we will further explore the context:
<code>['Hi', ',', 'this', 'is', 'a', 'nice', 'hotel', '.']</code>
You will notice that the .lemmatize()
method correctly converts the word "constitutes" to its stem form "constitute". You will also notice that word shape restoration takes longer than stemming extraction because the algorithm is more complex.
Let's check how to programmatically determine the second parameter of the .lemmatize()
method. NLTK has a pos_tag()
function that helps determine the context of words in a sentence. However, you first need to download averaged_perceptron_tagger
Resource:
<code class="language-bash">pip install nltk==3.4</code>
Next, import the pos_tag()
function and run it on the sentence:
<code class="language-python">import nltk print(nltk.__version__)</code>
You will notice that the output is a pair list. Each pair contains a word element and its tag that represents the context of the word element throughout the text. Please note that the label of the punctuation mark is itself:
<code class="language-bash">python -m nltk.downloader all</code>
How to decode the context of each word? Below is a complete list of all tags on the web and their corresponding meanings. Please note that all nouns have labels starting with "N" and all verbs have labels starting with "V". We can use this information in the second parameter of the .lemmatize()
method:
<code class="language-bash">nltk.download('punkt')</code>
The output of the above code is as follows:
<code class="language-python">from nltk.tokenize import word_tokenize print(word_tokenize("Hi, this is a nice hotel."))</code>
This output is as expected, and "constitutes" and "magistrates" are converted to "constitute" and "magistrate" respectively.
The next step in preparing the data is to clean up the data and remove anything that will not add meaning to your analysis. Overall, we will look at how punctuation and stop words can be removed from the analysis.
Removing punctuation marks is a fairly simple task. string
The library's punctuation
object contains all punctuation marks in English:
<code>['Hi', ',', 'this', 'is', 'a', 'nice', 'hotel', '.']</code>
The output of this code snippet is as follows:
<code class="language-python">from nltk.stem.porter import PorterStemmer stemmer = PorterStemmer() print(stemmer.stem("going"))</code>
To remove punctuation marks from word elements, you can simply run the following code:
<code class="language-python">print(stemmer.stem("constitutes"))</code>
Next, we will focus on how to remove stop words. Stop words are commonly used words in language, such as “I,” “a” and “the”, and when analyzing text, these words provide little meaning. Therefore, we will remove the stop words from the analysis. First, download stopwords resources from the NLTK downloader:
<code class="language-bash">nltk.download('wordnet')</code>
After downloading is complete, import nltk.corpus
from stopwords
and use the words()
method, with "english" as the parameter. Here is a list of 179 stop words in English:
<code class="language-python">from nltk.stem.wordnet import WordNetLemmatizer lem = WordNetLemmatizer()</code>
We can combine the word shape restore example with the concepts discussed in this section to create the following function clean_data()
. Additionally, we will convert the word to lowercase before comparing whether it is part of the stop word list. This way, if the stop word appears at the beginning of the sentence and capitalizes, we can still capture it:
<code class="language-python">print(lem.lemmatize('constitutes', 'v'))</code>
The output of this example is as follows:
<code class="language-bash">nltk.download('averaged_perceptron_tagger')</code>
As you can see, punctuation and stop words have been removed.
Now that you are familiar with the basic cleaning techniques in NLP, let's try to find the frequency of words in text. In this exercise, we will use the text of the fairy tale "Rat, Bird, and Sausage", which is available for free on the Gutenberg Project. We will store the text of this fairy tale in a string text
.
First, we multiply the text
and then clean it up using the function defined above clean_data
:
<code class="language-bash">pip install nltk==3.4</code>
To find the frequency distribution of words in text, you can use the FreqDist
class of NLTK. Initialize the class with word elements as parameters. Then use the .most_common()
method to find common terms. In this case, let's try to find the top ten terms:
<code class="language-python">import nltk print(nltk.__version__)</code>
The following are the ten most frequently seen terms in this fairy tale:
<code class="language-bash">python -m nltk.downloader all</code>
According to expectations, the three most common terms are the three main characters in fairy tales.
When analyzing text, the frequency of words may not be important. Generally, the next step in NLP is to generate statistics—TF-IDF (word frequency-inverse document frequency)—which indicates the importance of words in a set of documents.
In this tutorial, we have a preliminary understanding of natural language processing in Python. We convert text to lexical elements, convert words to their stem form, and finally clean the text to remove any part that does not add meaning to the analysis.
While we looked at simple NLP tasks in this tutorial, there are many other techniques to explore. For example, we might want to perform topic modeling on text data, with the goal of finding common topics that text may discuss. A more complex task in NLP is to implement sentiment analysis models to determine the emotions behind any text.
Any comments or questions? Feel free to contact me on Twitter.
Natural Language Processing (NLP) and Natural Language Understanding (NLU) are two sub-fields of artificial intelligence that are often confused. NLP is a broader concept that contains all methods for interacting with computers using natural language. This includes understanding and generating human language. NLU, on the other hand, is a subset of NLP that specializes in understanding aspects. It involves using algorithms to understand and interpret human language in valuable ways.
Improving the accuracy of NLP models involves a variety of strategies. First, you can use more training data. The more learning data your model has, the better its performance. Second, consider using different NLP techniques. For example, if you are using a bag of words (BoW), you may want to try WordFrequency-Inverse Document Frequency (TF-IDF) or Word2Vec. Finally, fine-tuning the parameters of the model can also lead to significant improvements.
NLP has wide applications in the real world. These include language translation, sentiment analysis, chatbots, voice assistants such as Siri and Alexa, text summary and email spam detection.
Word metamorphosis is the process of decomposing text into single words or word elements. This is a key step in NLP because it allows the model to understand and analyze text. In Python, you can use the word_tokenize
function of the NLTK library to perform lexicalization.
Stop words are common words that are often filtered out during the preprocessing phase of NLP because they do not contain much meaningful information. Examples include "is", "the", "and", etc. Removing these words can help improve the performance of the NLP model.
Disposing multiple languages in NLP can be challenging due to differences in grammar, syntax, and vocabulary. However, Python's NLTK library supports multiple languages. You can also use a language detection library like langdetect
to identify the language of the text and then process it.
Stem extraction and morphological restoration are techniques used to simplify words into their stem or root form. The main difference between them is that stem extraction often creates non-existent words, while the word form restoration principle reduces the word to its linguistically correct root form.
Emotional analysis involves determining the emotions expressed in the text. This can be done using various NLP techniques. For example, you can easily perform sentiment analysis using the TextBlob library in Python.
n Metagram is a continuous sequence of n consecutive items in a given text or speech sample. They are used for NLP to predict the next item in the sequence. For example, in binary grammar (n=2), you consider word pairs for analysis or prediction.
Text classification involves classifying text into predefined categories. This can be done using a variety of NLP techniques and machine learning algorithms. For example, you can use bag of words or TF-IDF for feature extraction and then enter these features into a machine learning model for classification.
The above is the detailed content of Getting Started with Natural Language Processing in Python. For more information, please follow other related articles on the PHP Chinese website!