search
HomeBackend DevelopmentPython TutorialDetailed example of word vector embedding
Detailed example of word vector embeddingJun 21, 2017 pm 04:11 PM
vectorstudynotes

Word vector embedding requires efficient processing of large-scale text corpora. word2vec. In a simple way, the word is sent to the one-hot encoding learning system, the length is a vector of the length of the vocabulary, the corresponding position element of the word is 1, and the other elements are 0. The vector dimension is very high and cannot describe the semantic association of different words. Co-occurrence represents words, resolves semantic associations, traverses a large-scale text corpus, counts the surrounding words within a certain distance of each word, and represents each word with the normalized number of nearby words. Words in similar contexts have similar semantics. Use PCA or similar methods to reduce the dimensionality of the occurrence vector to obtain a denser representation. It has good performance and tracks all vocabulary co-occurrence matrices. The width and height are the vocabulary length. In 2013, Mikolov, Tomas and others proposed a context calculation word representation method, "Efficient estimation of word representations in vector space" (arXiv preprint arXiv:1301.3781(2013)). The skip-gram model starts from a random representation and predicts a simple classifier of context words based on the current word. The error is propagated through the classifier weight and word representation, and the two are adjusted to reduce the prediction error. The large-scale corpus training model representation vector approximates the compressed co-occurrence vector.

Dataset, English Wikipedia dump file contains the complete revision history of all pages, the current page version is 100GB.

Download the dump file and extract the page words. Count the number of occurrences of words and build a common vocabulary list. Encode the extracted pages using a vocabulary. The file is read line by line and the results are written immediately to disk. Save checkpoints between different steps to avoid program crashes.

__iter__Traverses the word index list page. encode obtains the vocabulary index of the string word. decode returns the string word according to the vocabulary index. _read_pages extracts words from a Wikipedia dump file (compressed XML) and saves them to a pages file, with one line of space-delimited words per page. The bz2 module open function reads files. Intermediate result compression processing. Regular expressions capture any sequence of consecutive letters or individual special letters. _build_vocabulary counts the number of words in the page file, and words with high frequency are written into the file. One-hot encoding requires a vocabulary. Glossary index encoding. Spelling errors and extremely uncommon words are removed, and the vocabulary only contains vocabulary_size - 1 most common words. All words that are not in the vocabulary are marked with and do not appear in word vectors.

Dynamicly form training samples, organize a large amount of data, and the classifier does not occupy a large amount of memory. The skip-gram model predicts the context words of the current word. Traverse the text, current word data, surrounding word targets, and create training samples. Context size R, each word generates 2R samples, R words to the left and right of the current word. Semantic context, close distance is important, create as few training samples of far-context words as possible, and randomly select the word context size in the range [1, D=10]. Training pairs are formed based on the skip-gram model. Numpy arrays generate numerical stream batch data.

Initially, words are represented as random vectors. The classifier predicts the current representation of the context word based on the mid-level representation. Propagate errors, fine-tune weights, and input word representations. MomentumOptimizer model optimization, lack of intelligence and high efficiency.

The classifier is the core of the model. Noise contrastive estimation loss has excellent performance. Softmax classifier modeling. tf.nn.nce_loss New random vector negative sample (comparison sample), approximate softmax classifier.

The training model ends and the final word vector is written to the file. A subset of the Wikipedia corpus was trained on a normal CPU for 5 hours, and the NumPy array embedding representation was obtained. Complete corpus: . The AttrDict class is equivalent to a Python dict, with keys accessible as attributes.

import bz2
   import collections
   import os
   import re
   from lxml import etree
   from helpers import download
   class Wikipedia:
        TOKEN_REGEX = re.compile(r'[A-Za-z]+|[!?.:,()]')
        def __init__(self, url, cache_dir, vocabulary_size=10000):
            self._cache_dir = os.path.expanduser(cache_dir)
            self._pages_path = os.path.join(self._cache_dir, 'pages.bz2')
            self._vocabulary_path = os.path.join(self._cache_dir, 'vocabulary.bz2')
            if not os.path.isfile(self._pages_path):
                print('Read pages')
                self._read_pages(url)
            if not os.path.isfile(self._vocabulary_path):
                print('Build vocabulary')
                self._build_vocabulary(vocabulary_size)
            with bz2.open(self._vocabulary_path, 'rt') as vocabulary:
                print('Read vocabulary')
                self._vocabulary = [x.strip() for x in vocabulary]
            self._indices = {x: i for i, x in enumerate(self._vocabulary)}
        def __iter__(self):
            with bz2.open(self._pages_path, 'rt') as pages:
                for page in pages:
                    words = page.strip().split()
                    words = [self.encode(x) for x in words]
                    yield words
        @property
        def vocabulary_size(self):
            return len(self._vocabulary)
        def encode(self, word):
            return self._indices.get(word, 0)
        def decode(self, index):
            return self._vocabulary[index]
        def _read_pages(self, url):
            wikipedia_path = download(url, self._cache_dir)
            with bz2.open(wikipedia_path) as wikipedia, \
                bz2.open(self._pages_path, 'wt') as pages:
                for _, element in etree.iterparse(wikipedia, tag='{*}page'):
                    if element.find('./{*}redirect') is not None:
                        continue
                    page = element.findtext('./{*}revision/{*}text')
                    words = self._tokenize(page)
                    pages.write(' '.join(words) + '\n')
                    element.clear()
        def _build_vocabulary(self, vocabulary_size):
            counter = collections.Counter()
            with bz2.open(self._pages_path, 'rt') as pages:
                for page in pages:
                    words = page.strip().split()
                    counter.update(words)
            common = [''] + counter.most_common(vocabulary_size - 1)
            common = [x[0] for x in common]
            with bz2.open(self._vocabulary_path, 'wt') as vocabulary:
                for word in common:
                    vocabulary.write(word + '\n')
        @classmethod
        def _tokenize(cls, page):
            words = cls.TOKEN_REGEX.findall(page)
            words = [x.lower() for x in words]
            return words

import tensorflow as tf
   import numpy as np
   from helpers import lazy_property
   class EmbeddingModel:
        def __init__(self, data, target, params):
            self.data = data
            self.target = target
            self.params = params
            self.embeddings
            self.cost
            self.optimize
        @lazy_property
        def embeddings(self):
            initial = tf.random_uniform(
                [self.params.vocabulary_size, self.params.embedding_size],
                -1.0, 1.0)
            return tf.Variable(initial)
        @lazy_property
        def optimize(self):
            optimizer = tf.train.MomentumOptimizer(
                self.params.learning_rate, self.params.momentum)
            return optimizer.minimize(self.cost)
        @lazy_property
        def cost(self):
            embedded = tf.nn.embedding_lookup(self.embeddings, self.data)
            weight = tf.Variable(tf.truncated_normal(
                [self.params.vocabulary_size, self.params.embedding_size],
                stddev=1.0 / self.params.embedding_size ** 0.5))
            bias = tf.Variable(tf.zeros([self.params.vocabulary_size]))
            target = tf.expand_dims(self.target, 1)
            return tf.reduce_mean(tf.nn.nce_loss(
                weight, bias, embedded, target,
                self.params.contrastive_examples,
                self.params.vocabulary_size))

import collections
   import tensorflow as tf
   import numpy as np
   from batched import batched
   from EmbeddingModel import EmbeddingModel
   from skipgrams import skipgrams
   from Wikipedia import Wikipedia
   from helpers import AttrDict
   WIKI_DOWNLOAD_DIR = './wikipedia'
   params = AttrDict(
        vocabulary_size=10000,
        max_context=10,
        embedding_size=200,
        contrastive_examples=100,
        learning_rate=0.5,
        momentum=0.5,
        batch_size=1000,
   )
   data = tf.placeholder(tf.int32, [None])
   target = tf.placeholder(tf.int32, [None])
   model = EmbeddingModel(data, target, params)
   corpus = Wikipedia(
        'https://dumps.wikimedia.org/enwiki/20160501/'
        'enwiki-20160501-pages-meta-current1.xml-p000000010p000030303.bz2',
        WIKI_DOWNLOAD_DIR,
        params.vocabulary_size)
   examples = skipgrams(corpus, params.max_context)
   batches = batched(examples, params.batch_size)
   sess = tf.Session()
   sess.run(tf.initialize_all_variables())
   average = collections.deque(maxlen=100)
   for index, batch in enumerate(batches):
        feed_dict = {data: batch[0], target: batch[1]}
        cost, _ = sess.run([model.cost, model.optimize], feed_dict)
        average.append(cost)
        print('{}: {:5.1f}'.format(index + 1, sum(average) / len(average)))
        if index > 100000:
            break
   embeddings = sess.run(model.embeddings)
   np.save(WIKI_DOWNLOAD_DIR + '/embeddings.npy', embeddings)

The above is the detailed content of Detailed example of word vector embedding. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
手撕Llama3第1层: 从零开始实现llama3手撕Llama3第1层: 从零开始实现llama3Jun 01, 2024 pm 05:45 PM

一、Llama3的架构在本系列文章中,我们从头开始实现llama3。Llama3的整体架构:图片Llama3的模型参数:让我们来看看这些参数在LlaMa3模型中的实际数值。图片[1]上下文窗口(context-window)在实例化LlaMa类时,变量max_seq_len定义了context-window。类中还有其他参数,但这个参数与transformer模型的关系最为直接。这里的max_seq_len是8K。图片[2]词汇量(Vocabulary-size)和注意力层(AttentionL

从零开始学Spring Cloud从零开始学Spring CloudJun 22, 2023 am 08:11 AM

作为一名Java开发者,学习和使用Spring框架已经是一项必不可少的技能。而随着云计算和微服务的盛行,学习和使用SpringCloud成为了另一个必须要掌握的技能。SpringCloud是一个基于SpringBoot的用于快速构建分布式系统的开发工具集。它为开发者提供了一系列的组件,包括服务注册与发现、配置中心、负载均衡和断路器等,使得开发者在构建微

C++ 中的数组与向量有什么区别?C++ 中的数组与向量有什么区别?Jun 02, 2024 pm 12:25 PM

在C++中,数组是一种固定大小的数据结构,需要在创建时指定大小,而向量是一种动态大小的数据结构,大小可以在运行时更改。数组使用[]运算符访问和修改元素,而向量使用push_back()方法添加元素和[]运算符访问元素。数组需要使用delete[]释放内存,而向量使用erase()删除元素。

轻松学会win7怎么还原系统轻松学会win7怎么还原系统Jul 09, 2023 pm 07:25 PM

win7系统自带有备份还原系统的功能,如果之前有给win7系统备份的话,当电脑出现系统故障的时候,我们可以尝试通过win7还原系统修复。那么win7怎么还原系统呢?下面小编就教下大家如何还原win7系统。具体的步骤如下:1、开机在进入Windows系统启动画面之前按下F8键,然后出现系统启动菜单,选择安全模式登陆即可进入。2、进入安全模式之后,点击“开始”→“所有程序”→“附件”→“系统工具”→“系统还原”。3、最后只要选择最近手动设置过的还原点以及其他自动的还原点都可以,但是最好下一步之前点击

C++程序将向量转换为列表C++程序将向量转换为列表Sep 10, 2023 am 08:49 AM

C++中的向量是动态数组,可以包含任何类型的数据,可以是用户定义的或原始的。动态是指向量的大小可以根据操作增加或减少。向量支持各种函数,数据操作非常容易。另一方面,列表是与向量相同的容器,但与向量的数组实现相比,列表实现是基于双向链表的。列表在其中的任何位置都提供相同的恒定时间操作,这是使用列表的主要功能。我们来看看将向量转换为列表的主要方法。使用范围构造函数要使用范围构造函数,在创建列表时必须将向量的起始指针和结束指针作为参数传递给构造函数。语法vector<int>ip;list

学习PHP中的PHPUNIT框架学习PHP中的PHPUNIT框架Jun 22, 2023 am 09:48 AM

随着Web应用程序的需求越来越高,PHP技术在开发领域中变得越来越重要。在PHP开发方面,测试是一个必要的步骤,它可以帮助开发者确保他们创建的代码在各种情况下都可靠和实用。在PHP中,一个流行的测试框架是PHPUnit。PHPUnit是一个基于Junit的测试框架,其目的是创建高质量、可维护和可重复的代码。下面是一些学习使用PHPUnit框架的基础知识和步骤

分割后门训练的后门防御方法:DBD分割后门训练的后门防御方法:DBDApr 25, 2023 pm 11:16 PM

香港中文大学(深圳)吴保元教授课题组和浙江大学秦湛教授课题组联合发表了一篇后门防御领域的文章,已顺利被ICLR2022接收。近年来,后门问题受到人们的广泛关注。随着后门攻击的不断提出,提出针对一般化后门攻击的防御方法变得愈加困难。该论文提出了一个基于分割后门训练过程的后门防御方法。本文揭示了后门攻击就是一个将后门投影到特征空间的端到端监督训练方法。在此基础上,本文分割训练过程来避免后门攻击。该方法与其他后门防御方法进行了对比实验,证明了该方法的有效性。收录会议:ICLR2022文章链接:http

轻松学会win7如何升级win10系统轻松学会win7如何升级win10系统Jul 15, 2023 am 09:37 AM

随着win10系统的成熟,微软停止win7的更新和支持,越来越多人选择win10系统使用,打算将自己的win7升级win10系统。不过很多小伙伴不知道win7如何升级win10系统,找不到升级的按键。下面小编教大家一个简单的win7升级win10系统的方法。我们可以借助工具轻松实现win7升级安装win10的方法,具体的操作步骤如下:1、先在电脑上下载安装小鱼一键重装系统工具并打开,关闭电脑的杀毒软件,备份c盘重要资料。然后选择需要安装的win10系统点击安装此系统。2、这个界面选择想要安装的软

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks agoBy尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
1 months agoBy尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

mPDF

mPDF

mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

SublimeText3 Linux new version

SublimeText3 Linux new version

SublimeText3 Linux latest version

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

PhpStorm Mac version

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools