引入自然语言工具包（NLTK）

William Shakespeare原创: 2025-03-01 10:05:09197浏览

>自然语言处理（NLP）是人类语言的自动或半自动处理。 NLP与语言学密切相关，并与认知科学，心理学，生理学和数学的研究有联系。特别是在计算机科学领域中，NLP与编译器技术，形式语言理论，人类计算机互动，机器学习和定理证明有关。这个Quora问题显示了NLP。

的不同优点，在本教程中，我将带您浏览一个有趣的NLP平台，称为自然语言工具包（NLTK）。在我们查看如何使用此平台之前，让我首先告诉您NLTK是什么。

nltk是什么？该平台最初是由史蒂文·伯德（Steven Bird）和爱德华·洛珀（Edward Loper）与2001年在宾夕法尼亚大学的计算语言学课程一起发布的。有一本随附的书，用于使用Python的自然语言处理。

现在安装NLTK

"Python is a very high-level programming language. Python is interpreted."<br>

word_tokenize()

from nltk.tokenize import word_tokenize
text = "Python is a very high-level programming language. Python is interpreted."<br>print(word_tokenize(text))

['Python', 'is', 'a', 'very', 'high-level', 'programming', 'language', '.', 'Python', 'is', 'interpreted', '.']<br>

方法中。

from nltk.corpus import stopwords<br>print(set(stopwords.words('English')))<br>

>请考虑以下文本。

>让我们使用word_tokenize（）

from nltk.corpus import stopwords<br>print(set(stopwords.words('german')))<br>

方法来tokenize。输出：

from nltk.corpus import stopwords<br>from nltk.tokenize import word_tokenize<br><br>text = 'In this tutorial, I\'m learning NLTK. It is an interesting platform.'<br>stop_words = set(stopwords.words('english'))<br>words = word_tokenize(text)<br><br>new_sentence = []<br><br>for word in words:<br>    if word not in stop_words:<br>		new_sentence.append(word)<br><br>print(new_sentence)<br>

您可以从输出中看到，标点符号也被认为是单词。它们。以下内容：

>如何从我们自己的文本中删除停止单词？下面的示例显示了我们如何执行此任务：

word_tokenize()

>上面脚本的输出是：

 word_tokenize（）函数是：<code> word_tokenize（）<blockquote>将字符串引用以拆分标点符号，而不是</blockquote>
<h3>>搜索</h3> <p>假设我们有以下文本文件（从dropbox下载文本文件）。我们想查找（搜索）单词<code>language

。我们可以简单地使用NLTK平台进行以下操作：

"Python is a very high-level programming language. Python is interpreted."<br>

在这种情况下，您将获得以下输出：

请注意，除了某些上下文中，concordance() language还返回单词nltk.Text的每一次出现。 Before that, as shown in the script above, we tokenize the read file and then convert it into an

object.

I just want to note that the first time I ran the program, I got the following error, which seems to be related to the encoding the console uses:

from nltk.tokenize import word_tokenize
text = "Python is a very high-level programming language. Python is interpreted."<br>print(word_tokenize(text))

chcp 65001What I simply did to solve this issue is to run this command in my console before running the program:如Wikipedia中所述：

Gutenberg compus

：

Project Gutenberg（PG）是一项志愿者，是为了数字化和归档文化作品而努力，以“鼓励电子书的创建和分布”。它是由迈克尔·哈特（Michael S. Hart）于1971年成立的，是最古老的数字图书馆。其集合中的大多数项目都是公共领域书籍的全文。该项目试图以持久的开放格式使它们尽可能免费，几乎可以在任何计算机上使用。截至2015年10月3日，Gutenberg项目在其收藏中达到了50,000件物品。 nltk包含来自Gutenberg项目的少量文本。要查看Gutenberg项目中随附的文件，我们执行以下操作：

>上面脚本的输出将如下：

['Python', 'is', 'a', 'very', 'high-level', 'programming', 'language', '.', 'Python', 'is', 'interpreted', '.']<br>

如果我们想找到文本文件的单词数引入自然语言工具包（NLTK）

正如我们在本教程中所看到的那样，

。我只在本教程中划过表面。如果您想更深入地将NLTK用于不同的NLP任务，则可以参考NLTK的随附书：使用Python的自然语言处理。bryant-stories.txt

from nltk.corpus import stopwords<br>print(set(stopwords.words('English')))<br>

> >该帖子已通过Esther Vaati的贡献进行了更新。 Esther是Envato Tuts的软件开发人员和作者。

以上是引入自然语言工具包（NLTK）的详细内容。更多信息请关注PHP中文网其他相关文章！

Python String Object define if for Error Filter using public Collection console number function this windows nlp issue prompt word Prompt Other

声明：

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn

上一篇：How to Download Files in Python下一篇：PyQuery: Python's jQuery

查看更多