Home > Article > Backend Development > How to build a system?
How to build a system for extracting structured information and data from unstructured text? What methods use this type of behavior? Which corpora are suitable for this work? Is it possible to train and evaluate the model?
Information extraction, especially structured information extraction, can be compared to database records. The corresponding relationship binds the corresponding data information. For unstructured data such as natural language, in order to obtain the corresponding relationship, the special relationship corresponding to the entity should be searched and recorded using some data structures such as strings and elements.
For example: We saw the yellow dog, according to the idea of chunking, the last three words will be divided into NP, and the three words inside Each word corresponds to DT/JJ/NN respectively; saw is divided into VBD; We is divided into NP. For the last three words, NP is the chunk (larger set). In order to achieve this, you canuse NLTK's own chunking syntax, similar to regular expressions, to implement sentence chunking.
Just pay attention to three points:
Chunking:{ Sub-block under the block} (similar to:
"NP: {3c7d5858e48edc7bb17af0ecefbb969f?6cfae209252e1b07fe4ad75a2fa1207e*5fc4495b6ab379cc2effe1ed9ae99dc1}"A string like this). And ?*+ saves the meaning of the regular expression.
import nltk sentence = [('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),('brak','VBD')] grammer = "NP: {<DT>?<JJ>*<NN>}"cp = nltk.RegexpParser(grammer) #生成规则result = cp.parse(sentence) #进行分块print(result) result.draw() #调用matplotlib库画出来
:}de249b3114fb1469cc68e2fe29baa3f0+{
import nltk sentence = [('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),('bark','VBD'),('at','IN'),('the','DT'),('cat','NN')] grammer = """NP: {<DT>?<JJ>*<NN>} }<VBD|NN>+{ """ #加缝隙,必须保存换行符cp = nltk.RegexpParser(grammer) #生成规则result = cp.parse(sentence) #进行分块print(result)
loop of the
RegexpParser function can be set to 2 and looped multiple times to prevent omissions.
nltk.tree. Tree. As you can tell from the name, this is a tree-like structure.
nltk.Tree can realize tree structure, and supports splicing technology, providing node query and tree drawing.
<pre class="sourceCode python">tree1 = nltk.Tree(&#39;NP&#39;,[&#39;Alick&#39;])print(tree1)
tree2 = nltk.Tree(&#39;N&#39;,[&#39;Alick&#39;,&#39;Rabbit&#39;])print(tree2)
tree3 = nltk.Tree(&#39;S&#39;,[tree1,tree2])print(tree3.label()) #查看树的结点tree3.draw()</pre>
IOB mark
Developing and evaluating chunkers
#这段代码在python2下运行from nltk.corpus import conll2000print conll2000.chunked_sents('train.txt')[99] #查看已经分块的一个句子text = """ he /PRP/ B-NP accepted /VBD/ B-VP the DT B-NP position NN I-NP of IN B-PP vice NN B-NP chairman NN I-NP of IN B-PP Carlyle NNP B-NP Group NNP I-NP , , O a DT B-NP merchant NN I-NP banking NN I-NP concern NN I-NP . . O"""result = nltk.chunk.conllstr2tree(text,chunk_types=['NP'])
For the previously defined rules
cp
cp.evaluate(conll2000.chunked_sents(' train.txt')[99]) to test the accuracy. Using the Unigram tagger we learned before, we can segment noun phrases into chunks and test the accuracy
<pre class="sourceCode python">class UnigramChunker(nltk.ChunkParserI):""" 一元分块器, 该分块器可以从训练句子集中找出每个词性标注最有可能的分块标记, 然后使用这些信息进行分块 """def __init__(self, train_sents):""" 构造函数 :param train_sents: Tree对象列表 """train_data = []for sent in train_sents:# 将Tree对象转换为IOB标记列表[(word, tag, IOB-tag), ...]conlltags = nltk.chunk.tree2conlltags(sent)# 找出每个词性标注对应的IOB标记ti_list = [(t, i) for w, t, i in conlltags]
train_data.append(ti_list)# 使用一元标注器进行训练self.__tagger = nltk.UnigramTagger(train_data)def parse(self, tokens):""" 对句子进行分块 :param tokens: 标注词性的单词列表 :return: Tree对象 """# 取出词性标注tags = [tag for (word, tag) in tokens]# 对词性标注进行分块标记ti_list = self.__tagger.tag(tags)# 取出IOB标记iob_tags = [iob_tag for (tag, iob_tag) in ti_list]# 组合成conll标记conlltags = [(word, pos, iob_tag) for ((word, pos), iob_tag) in zip(tokens, iob_tags)]return nltk.chunk.conlltags2tree(conlltags)
test_sents = conll2000.chunked_sents("test.txt", chunk_types=["NP"])
train_sents = conll2000.chunked_sents("train.txt", chunk_types=["NP"])
unigram_chunker = UnigramChunker(train_sents)print(unigram_chunker.evaluate(test_sents))</pre>
Named entity recognition and information extraction
nltk.ne_chunk(tagged_sent[,binary=False]). If binary is set to True, then named entities are only tagged as NE; otherwise the tags are a bit more complex. <pre class="sourceCode python">sent = nltk.corpus.treebank.tagged_sents()[22]print(nltk.ne_chunk(sent,binary=True))</pre>
If the named entity is determined,
Relationship extraction
#请在Python2下运行import re
IN = re.compile(r'.*\bin\b(?!\b.+ing)')for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):for rel in nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern = IN):print nltk.sem.show_raw_rtuple(rel)
The above is the detailed content of How to build a system?. For more information, please follow other related articles on the PHP Chinese website!