search
HomeBackend DevelopmentPython TutorialIntroducing the Natural Language Toolkit (NLTK)

Natural language processing (NLP) is the automatic or semi-automatic processing of human language. NLP is closely related to linguistics and has links to research in cognitive science, psychology, physiology, and mathematics. In the computer science domain in particular, NLP is related to compiler techniques, formal language theory, human-computer interaction, machine learning, and theorem proving. This Quora question shows the different advantages of NLP.

In this tutorial I'm going to walk you through an interesting Python platform for NLP called the Natural Language Toolkit (NLTK). Before we see how to work with this platform, let me first tell you what NLTK is.

What Is NLTK?

The Natural Language Toolkit (NLTK) is a platform used for building programs for text analysis. The platform was originally released by Steven Bird and Edward Loper in conjunction with a computational linguistics course at the University of Pennsylvania in 2001. There is an accompanying book for the platform called Natural Language Processing with Python.

Installing NLTK

Let's now install NLTK to start experimenting with natural language processing. It will be fun!

Installing NLTK is very simple. I'm using Windows 10, so in my Command Prompt (sent_tokenize() method.

Consider the following text.

"Python is a very high-level programming language. Python is interpreted."<br>

Let's tokenize it using the word_tokenize() method. Let's use the same text and pass it through the word_tokenize()<code>word_tokenize() method.

from nltk.tokenize import word_tokenize
text = "Python is a very high-level programming language. Python is interpreted."<br>print(word_tokenize(text))

Here is the output:

['Python', 'is', 'a', 'very', 'high-level', 'programming', 'language', '.', 'Python', 'is', 'interpreted', '.']<br>

As you can see from the output, punctuation marks are also considered to be words.

Stop Words

Sometimes we need to filter out useless data to make the data more understandable by the computer. In natural language processing (NLP), such useless data (words) are called stop words. These words have no meaning to us, so we would like to remove them.

NLTK provides us with some stop words to start with. To see those words, use the following script:

from nltk.corpus import stopwords<br>print(set(stopwords.words('English')))<br>

In which case you will get the following output:

Introducing the Natural Language Toolkit (NLTK)

What we did is that we printed out a set (unordered collection of items) of stop words in the English language. If you were using another language, for example German, you have to define it as follows:

from nltk.corpus import stopwords<br>print(set(stopwords.words('german')))<br>

How can we remove the stop words from our own text? The example below shows how we can perform this task:

from nltk.corpus import stopwords<br>from nltk.tokenize import word_tokenize<br><br>text = 'In this tutorial, I\'m learning NLTK. It is an interesting platform.'<br>stop_words = set(stopwords.words('english'))<br>words = word_tokenize(text)<br><br>new_sentence = []<br><br>for word in words:<br>    if word not in stop_words:<br>		new_sentence.append(word)<br><br>print(new_sentence)<br>

The output of the above script is:

Introducing the Natural Language Toolkit (NLTK)

So what the word_tokenize()<code>word_tokenize() function does is:

Tokenize a string to split off punctuation other than periods

Searching

Let's say we have the following text file (download the text file from Dropbox). We would like to look for (search) the word language. We can simply do this using the NLTK platform as follows:

"Python is a very high-level programming language. Python is interpreted."<br>

In which case you will get the following output:

Introducing the Natural Language Toolkit (NLTK)

Notice that concordance() returns every occurrence of the word language, in addition to some context. Before that, as shown in the script above, we tokenize the read file and then convert it into an nltk.Text object.

I just want to note that the first time I ran the program, I got the following error, which seems to be related to the encoding the console uses:

from nltk.tokenize import word_tokenize
text = "Python is a very high-level programming language. Python is interpreted."<br>print(word_tokenize(text))

What I simply did to solve this issue is to run this command in my console before running the program: chcp 65001.

The Gutenberg Corpus

As mentioned in Wikipedia:

Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, to "encourage the creation and distribution of eBooks". It was founded in 1971 by Michael S. Hart and is the oldest digital library. Most of the items in its collection are the full texts of public domain books. The project tries to make these as free as possible, in long-lasting, open formats that can be used on almost any computer. As of 3 October 2015, Project Gutenberg reached 50,000 items in its collection.

NLTK contains a small selection of texts from Project Gutenberg. To see the included files from Project Gutenberg, we do the following:

['Python', 'is', 'a', 'very', 'high-level', 'programming', 'language', '.', 'Python', 'is', 'interpreted', '.']<br>

The output of the above script will be as follows:

Introducing the Natural Language Toolkit (NLTK)

If we want to find the number of words for the text file bryant-stories.txt for instance, we can do the following:

from nltk.corpus import stopwords<br>print(set(stopwords.words('English')))<br>

The above script should return the following number of words: 55563.

Conclusion

As we have seen in this tutorial, the NLTK platform provides us with a powerful tool for working with natural language processing (NLP). I have only scratched the surface in this tutorial. If you would like to go deeper into using NLTK for different NLP tasks, you can refer to NLTK's accompanying book: Natural Language Processing with Python.

This post has been updated with contributions from Esther Vaati. Esther is a software developer and writer for Envato Tuts .

The above is the detailed content of Introducing the Natural Language Toolkit (NLTK). For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Python's Execution Model: Compiled, Interpreted, or Both?Python's Execution Model: Compiled, Interpreted, or Both?May 10, 2025 am 12:04 AM

Pythonisbothcompiledandinterpreted.WhenyourunaPythonscript,itisfirstcompiledintobytecode,whichisthenexecutedbythePythonVirtualMachine(PVM).Thishybridapproachallowsforplatform-independentcodebutcanbeslowerthannativemachinecodeexecution.

Is Python executed line by line?Is Python executed line by line?May 10, 2025 am 12:03 AM

Python is not strictly line-by-line execution, but is optimized and conditional execution based on the interpreter mechanism. The interpreter converts the code to bytecode, executed by the PVM, and may precompile constant expressions or optimize loops. Understanding these mechanisms helps optimize code and improve efficiency.

What are the alternatives to concatenate two lists in Python?What are the alternatives to concatenate two lists in Python?May 09, 2025 am 12:16 AM

There are many methods to connect two lists in Python: 1. Use operators, which are simple but inefficient in large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use the = operator, which is both efficient and readable; 4. Use itertools.chain function, which is memory efficient but requires additional import; 5. Use list parsing, which is elegant but may be too complex. The selection method should be based on the code context and requirements.

Python: Efficient Ways to Merge Two ListsPython: Efficient Ways to Merge Two ListsMay 09, 2025 am 12:15 AM

There are many ways to merge Python lists: 1. Use operators, which are simple but not memory efficient for large lists; 2. Use extend method, which is efficient but will modify the original list; 3. Use itertools.chain, which is suitable for large data sets; 4. Use * operator, merge small to medium-sized lists in one line of code; 5. Use numpy.concatenate, which is suitable for large data sets and scenarios with high performance requirements; 6. Use append method, which is suitable for small lists but is inefficient. When selecting a method, you need to consider the list size and application scenarios.

Compiled vs Interpreted Languages: pros and consCompiled vs Interpreted Languages: pros and consMay 09, 2025 am 12:06 AM

Compiledlanguagesofferspeedandsecurity,whileinterpretedlanguagesprovideeaseofuseandportability.1)CompiledlanguageslikeC arefasterandsecurebuthavelongerdevelopmentcyclesandplatformdependency.2)InterpretedlanguageslikePythonareeasiertouseandmoreportab

Python: For and While Loops, the most complete guidePython: For and While Loops, the most complete guideMay 09, 2025 am 12:05 AM

In Python, a for loop is used to traverse iterable objects, and a while loop is used to perform operations repeatedly when the condition is satisfied. 1) For loop example: traverse the list and print the elements. 2) While loop example: guess the number game until you guess it right. Mastering cycle principles and optimization techniques can improve code efficiency and reliability.

Python concatenate lists into a stringPython concatenate lists into a stringMay 09, 2025 am 12:02 AM

To concatenate a list into a string, using the join() method in Python is the best choice. 1) Use the join() method to concatenate the list elements into a string, such as ''.join(my_list). 2) For a list containing numbers, convert map(str, numbers) into a string before concatenating. 3) You can use generator expressions for complex formatting, such as ','.join(f'({fruit})'forfruitinfruits). 4) When processing mixed data types, use map(str, mixed_list) to ensure that all elements can be converted into strings. 5) For large lists, use ''.join(large_li

Python's Hybrid Approach: Compilation and Interpretation CombinedPython's Hybrid Approach: Compilation and Interpretation CombinedMay 08, 2025 am 12:16 AM

Pythonusesahybridapproach,combiningcompilationtobytecodeandinterpretation.1)Codeiscompiledtoplatform-independentbytecode.2)BytecodeisinterpretedbythePythonVirtualMachine,enhancingefficiencyandportability.

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Safe Exam Browser

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

DVWA

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.