Home  >  Article  >  Backend Development  >  How to build a system?

How to build a system?

PHP中文网
PHP中文网Original
2017-06-20 11:00:182364browse

How to build a system for extracting structured information and data from unstructured text? What methods use this type of behavior? Which corpora are suitable for this work? Is it possible to train and evaluate the model?

Information extraction, especially structured information extraction, can be compared to database records. The corresponding relationship binds the corresponding data information. For unstructured data such as natural language, in order to obtain the corresponding relationship, the special relationship corresponding to the entity should be searched and recorded using some data structures such as strings and elements.

Entity recognition: chunking technology

For example: We saw the yellow dog, according to the idea of ​​chunking, the last three words will be divided into NP, and the three words inside Each word corresponds to DT/JJ/NN respectively; saw is divided into VBD; We is divided into NP. For the last three words, NP is the chunk (larger set). In order to achieve this, you canuse NLTK's own chunking syntax, similar to regular expressions, to implement sentence chunking.

Construction of chunking syntax

Just pay attention to three points:

  • ##Basic chunking:

    Chunking:{ Sub-block under the block} (similar to: "NP: {3c7d5858e48edc7bb17af0ecefbb969f?6cfae209252e1b07fe4ad75a2fa1207e*5fc4495b6ab379cc2effe1ed9ae99dc1}"A string like this). And ?*+ saves the meaning of the regular expression.

import nltk
sentence = [('the','DT'),('little','JJ'),('yellow','JJ'),('dog','NN'),('brak','VBD')]
grammer = "NP: {<DT>?<JJ>*<NN>}"cp = nltk.RegexpParser(grammer) #生成规则result = cp.parse(sentence) #进行分块print(result)

result.draw() #调用matplotlib库画出来


  • You can define a ## for sequences of identifiers that are not included in chunks. #gap

    }de249b3114fb1469cc68e2fe29baa3f0+{

    ##
    import nltk
    sentence = [(&#39;the&#39;,&#39;DT&#39;),(&#39;little&#39;,&#39;JJ&#39;),(&#39;yellow&#39;,&#39;JJ&#39;),(&#39;dog&#39;,&#39;NN&#39;),(&#39;bark&#39;,&#39;VBD&#39;),(&#39;at&#39;,&#39;IN&#39;),(&#39;the&#39;,&#39;DT&#39;),(&#39;cat&#39;,&#39;NN&#39;)]
    grammer = """NP:             {<DT>?<JJ>*<NN>}            }<VBD|NN>+{            """  #加缝隙,必须保存换行符cp = nltk.RegexpParser(grammer) #生成规则result = cp.parse(sentence) #进行分块print(result)


can be called recursively, which conforms to the recursive nesting in the language structure. For example:
    VP: {f33e0e9799d61cc5dd2a85bb2d5920ae*} PP:{5fc4495b6ab379cc2effe1ed9ae99dc1c068b0175345b8eb93ba13c561f8b4a4}
  • . At this time, the parameter

    loop of the RegexpParser function can be set to 2 and looped multiple times to prevent omissions.

    Tree diagram
If you call

print(type(result))

to view the type, you will find that it is

nltk.tree. Tree. As you can tell from the name, this is a tree-like structure. nltk.Tree can realize tree structure, and supports splicing technology, providing node query and tree drawing. <pre class="sourceCode python">tree1 = nltk.Tree(&amp;#39;NP&amp;#39;,[&amp;#39;Alick&amp;#39;])print(tree1) tree2 = nltk.Tree(&amp;#39;N&amp;#39;,[&amp;#39;Alick&amp;#39;,&amp;#39;Rabbit&amp;#39;])print(tree2) tree3 = nltk.Tree(&amp;#39;S&amp;#39;,[tree1,tree2])print(tree3.label()) #查看树的结点tree3.draw()</pre>

IOB mark

stands for internal, external, and beginning respectively (the first letter of the English word). For classifications such as NP and NN mentioned above, you only need to add I-/B-/O- in front. This allows collections outside the rules to be exposed, similar to adding gaps above.

Developing and evaluating chunkers


NLTK already provides us with chunkers, reducing manual building rules. At the same time, it also provides content that has been divided into chunks for reference when we build our own rules.

#这段代码在python2下运行from nltk.corpus import conll2000print conll2000.chunked_sents(&#39;train.txt&#39;)[99] #查看已经分块的一个句子text = """   he /PRP/ B-NP   accepted /VBD/ B-VP   the DT B-NP   position NN I-NP   of IN B-PP   vice NN B-NP   chairman NN I-NP   of IN B-PP   Carlyle NNP B-NP   Group NNP I-NP   , , O   a DT B-NP   merchant NN I-NP   banking NN I-NP   concern NN I-NP   . . O"""result = nltk.chunk.conllstr2tree(text,chunk_types=[&#39;NP&#39;])

For the previously defined rules
cp

, you can use

cp.evaluate(conll2000.chunked_sents(' train.txt')[99]) to test the accuracy. Using the Unigram tagger we learned before, we can segment noun phrases into chunks and test the accuracy<pre class="sourceCode python">class UnigramChunker(nltk.ChunkParserI):&quot;&quot;&quot; 一元分块器, 该分块器可以从训练句子集中找出每个词性标注最有可能的分块标记, 然后使用这些信息进行分块 &quot;&quot;&quot;def __init__(self, train_sents):&quot;&quot;&quot; 构造函数 :param train_sents: Tree对象列表 &quot;&quot;&quot;train_data = []for sent in train_sents:# 将Tree对象转换为IOB标记列表[(word, tag, IOB-tag), ...]conlltags = nltk.chunk.tree2conlltags(sent)# 找出每个词性标注对应的IOB标记ti_list = [(t, i) for w, t, i in conlltags] train_data.append(ti_list)# 使用一元标注器进行训练self.__tagger = nltk.UnigramTagger(train_data)def parse(self, tokens):&quot;&quot;&quot; 对句子进行分块 :param tokens: 标注词性的单词列表 :return: Tree对象 &quot;&quot;&quot;# 取出词性标注tags = [tag for (word, tag) in tokens]# 对词性标注进行分块标记ti_list = self.__tagger.tag(tags)# 取出IOB标记iob_tags = [iob_tag for (tag, iob_tag) in ti_list]# 组合成conll标记conlltags = [(word, pos, iob_tag) for ((word, pos), iob_tag) in zip(tokens, iob_tags)]return nltk.chunk.conlltags2tree(conlltags) test_sents = conll2000.chunked_sents(&quot;test.txt&quot;, chunk_types=[&quot;NP&quot;]) train_sents = conll2000.chunked_sents(&quot;train.txt&quot;, chunk_types=[&quot;NP&quot;]) unigram_chunker = UnigramChunker(train_sents)print(unigram_chunker.evaluate(test_sents))</pre>

Named entity recognition and information extraction

Named entity: an exact noun phrase that refers to a specific type of individual, such as a date, person, organization, etc.

. If you go to Xu Yan classifier by yourself, you will definitely have a big head (ˉ▽ ̄~)~~. NLTK provides a trained classifier--

nltk.ne_chunk(tagged_sent[,binary=False]). If binary is set to True, then named entities are only tagged as NE; otherwise the tags are a bit more complex. <pre class="sourceCode python">sent = nltk.corpus.treebank.tagged_sents()[22]print(nltk.ne_chunk(sent,binary=True))</pre>

If the named entity is determined,
Relationship extraction

can be implemented to extract information. One way is to find all triples (X,a,Y). Among them, X and Y are named entities, and a is a string representing the relationship between the two. The example is as follows:

#请在Python2下运行import re
IN = re.compile(r&#39;.*\bin\b(?!\b.+ing)&#39;)for doc in nltk.corpus.ieer.parsed_docs(&#39;NYT_19980315&#39;):for rel in nltk.sem.extract_rels(&#39;ORG&#39;,&#39;LOC&#39;,doc,corpus=&#39;ieer&#39;,pattern = IN):print nltk.sem.show_raw_rtuple(rel)

The above is the detailed content of How to build a system?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn