tensorflow学习-word2vec

1. word2vec
2. skip-model训练过程
1. 2.1. nce_loss
2. 2.2. 训练步骤
3. 代码
4. 应用
5. 模型复用

tensorflow学习笔记系列原始内容，可从CS 20SI: Tensorflow for Deep Learning Research。另还有几本关于tensorflow的书籍，比如tensorflow实战，tensorflow解析等。感谢把知识分享出来的各位大牛。

word2vec

词转成向量的好处是可以把抽象的词数字化，从而在模型（数学公式）中可应用；word2vec是一种词嵌入编码，与之相当的有one-hot编码，即0/1标示；word2vec的理解是把词映射到一个多维空间内，相同或相似的词会在空间内离得很近。这种空间映射在不同语言的翻译中体现在，意思相同的词在各自语言的空间所处的位置是一样的。

总之，词嵌入的word2vec，是把词映射成一维向量，后续的应用可以基于向量来操作。

训练word2vec的方法有很多，两种常见的大类模型是skip-model和cbow-model；在skip-model中，是基于中心词得到其前后skip个词，组成词对作为训练样本；而cbow是在中心词的前后c个词与中心组成一个对作为正样本，而非中心词的都是负样本，NEG(w)。这里对cbow不细讲，主要讲skip-model的训练。

skip-model训练过程

首先，必须了解使用模型训练，要先给定样本、参数、损失函数、优化器。skip-model中样本是中心词前后skip个词构成的词对，参数是神经网络的权重，损失函数用nce，优化器是批梯度下降法。除了样本需要自定义生成外，其他的在tensorflow中均有封装函数可使用，这里会着重讲下nce损失函数，只是个人阅读后的见解。

nce_loss

噪声对比估计损失，可想看论文,详见tensorflow的官方文档介绍,以及tensorflow的源码,

nce是为了加快多分类的速度，多分类下需要对每个可能类计算概率（100万个每个样本就得计算100万个概率），而在nce中，可以随机选几个类，计算概率，然后使用逻辑函数进行计算其他类别的概率(具体操作未知)。

值得注意的几点：

1,num_true大于1，即预测目标的概率和为1；
2，参数详解：
weights,[num_classes,dim]的张量，dim是样本的特征数量，
biases，[num_classes]的张量，
inputs,[batch_size,dim]的张量，前向激活网络的输入，喂入模型的训练数据，中心词
labels，[batch_size,num_true],int64,目标类，目标词，
num_sampled,每个batch中随机选的类数量，在word2vec中即随机选的负样本数（不是中心词本身），
num_classes,所有可能的类别数量，
num_true,每个训练样本的目标类别数（真实的）
sampled_values，采样的方法，
remove_accidental_hits，样本如果与目标类一致是否删除，设计到计算 loss的方法。

其中inputs与labels是配对关系，

3，返回[batch_size,d]为的nce损失值，是一个向量。

sigmoid_cross_entropy_with_logits，多目标，可属于多个目标，
softmax_cross_entropy_with_logits，多分类，结果只能属于一个类，

训练步骤

总的步骤，样本预处理生成，设置输入和输出，设置参数，设置损失函数，分配喂入数据，

样本预处理生成

word2vec在skip-model中的样本是词对，center_word–>target_word,

1，读取训练集，是一个（或多个）文件的文本语料，将这些语料读取进一个list中；
2，把list中的词进行词频统计，然后递减排序，做成一个word<-->index_id对，即把词映射成id，
3，根据word<-->index_id对，得到词典，词与id的映射，id与词的映射关系，
4，依照skip数，生成样本，迭代的，做法有很多种，从list（步骤1得到的）依次取一个词，然后分别从该词的前后[1-skip]中随机取一个词作为目标词，这样就组成了一个样本center_word–>target_word，而词也顺便转成id（步骤2和3得到的映射词典）。随机是避免陷入局部最优。
5，根据4，得到一系列的样本，即list[center_word–>target_word],然后可以做批次取样本，喂入模型。

设置输入和输出

模型的输入是中心词，输出是目标词，采用placeholders，

center_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE], name=’center_words’)
target_words = tf.placeholder(tf.int32,shape=[BATCH_SIZE,1],name=’target_words’)

设置全量的词嵌入矩阵，
embed_matrix = tf.Variable(tf.random_uniform([VOCAB_SIZE, EMBED_SIZE], -1.0, 1.0), name=’embed_matrix’)

设置参数

采取一层网络，参数w和b，
nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE],stddev=1.0 / (EMBED_SIZE ** 0.5)), name=’nce_weight’)
nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name=’nce_bias’)

损失函数

先得到批次的向量，
embed = tf.nn.embedding_lookup(embed_matrix,center_words,name=’embed’)

tf.nn.embedding_lookup，根据input_ids中的id，寻找embedding中的对应向量（一行，从0开始计数）然后组成新的矩阵。

损失函数计算，
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight,
biases=nce_bias,
labels=target_words,
inputs=embed,
num_sampled=NUM_SAMPLED,#负样本数
num_classes=VOCAB_SIZE), name=’loss’)

优化器，一般是批梯度下降法，
optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss)

喂入数据，迭代训练

从样本预处理中，读取每个批次的样本，喂入模型中，

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    total_loss = 0.0 #总的loss
    writer = tf.summary.FileWriter('.my_graph/no_frills/', sess.graph)
    for index in range(NUM_TRAIN_STEPS):
        centers, targets = next(batch_gen) #已经生成的样本集合，不断循环取，
        loss_batch,_ = sess.run([loss,optimizer],feed_dict={center_words:centers,target_words:targets})

        total_loss += loss_batch
        if (index + 1) % SKIP_STEP == 0:
            print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
            total_loss = 0.0
    writer.close()

结果词向量

最后训练完，把词向量保存，是embed_matrix;embed_matrix.eval()是numpy.ndarray类型。所以，不管训练用的多少层，只需记录最开始的词向量矩阵即可，

代码

通过步骤的分解，整体的实现如下所示，含预处理和训练实现，

#coding=utf-8

from collections import Counter
import random
import os
import sys
sys.path.append('..')
import zipfile

import numpy as np
from six.moves import urllib
import tensorflow as tf

import utils

# Parameters for downloading data
DOWNLOAD_URL = 'http://mattmahoney.net/dc/'
EXPECTED_BYTES = 31344016
DATA_FOLDER = 'data/'
FILE_NAME = 'text8'

def download(file_name, expected_bytes):
    """ Download the dataset text8 if it's not already downloaded """
    file_path = DATA_FOLDER + file_name
    if os.path.exists(file_path):
        print("Dataset ready")
        return file_path
    print(DOWNLOAD_URL + file_name)
    file_name, _ = urllib.request.urlretrieve(DOWNLOAD_URL + file_name, file_path)

    file_stat = os.stat(file_path)
    if file_stat.st_size == expected_bytes:
        print('Successfully downloaded the file', file_name)
    else:
        raise Exception('File ' + file_name +
                        ' might be corrupted. You should try downloading it with a browser.')
    return file_path

def read_data(file_path):
    """
    大约有17,005,207词（含标点符号）
    把所有的词读进一个list中，英文是空格分割，含标点符号，
    """
    with zipfile.ZipFile(file_path) as f:
        words = tf.compat.as_str(f.read(f.namelist()[0])).split()
        # tf.compat.as_str() 输入的转换成string
    return words

def build_vocab(words, vocab_size):
    """ Build vocabulary of VOCAB_SIZE most frequent words
    建立字典，给定的vocab_size大小，建立；按照频率选取前vocab_size个词，
    返回词的索引，数组和字典，
    """
    dictionary = dict()
    count = [('UNK', -1)]
    count.extend(Counter(words).most_common(vocab_size - 1))
    index = 0
    utils.make_dir('processed')
    with open('processed/vocab_1000.tsv', "w") as f:
        for word, _ in count:
            dictionary[word] = index
            if index < 1000:
                f.write(word + "\n")
            index += 1
    index_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return dictionary, index_dictionary

def convert_words_to_index(words, dictionary):
    """
    把词替换成词典中的索引
    """
    return [dictionary[word] if word in dictionary else 0 for word in words]

def generate_sample(index_words, context_window_size):
    """
    skip-gram模型，生成训练对，

     """
    for index, center in enumerate(index_words):
        context = random.randint(1, context_window_size)
        # 随机选择中心词的前一个后一个，还是前几个后几个，
        for target in index_words[max(0, index - context): index]:
            yield center, target
        # get a random target after the center wrod
        for target in index_words[index + 1: index + context + 1]:
            yield center, target

def get_batch(iterator, batch_size):
    """ Group a numerical stream into batches and yield them as Numpy arrays. """
    while True:
        center_batch = np.zeros(batch_size, dtype=np.int32)
        target_batch = np.zeros([batch_size, 1])
        for index in range(batch_size):
            center_batch[index], target_batch[index] = next(iterator)
            #next是从迭代器中挨个取值，这里是词对，生成的词对
            #中心词索引-->前skip个（后skip个）词索引，这样的对，
        yield center_batch, target_batch

def process_data(vocab_size, batch_size, skip_window):
    #file_path = download(FILE_NAME, EXPECTED_BYTES)
    file_path = "E:/workspace/DeepLearnings/tensorflow-learning/data/text8.zip"
    print(file_path)
    words = read_data(file_path)#读取训练所有词，存储在一个list中，
    dictionary, _ = build_vocab(words, vocab_size)#建立词典，根据指定的词典大小，词典的构建依赖于词在训练样本中的频率
    index_words = convert_words_to_index(words, dictionary)#得到词典，把训练集替换成索引，即文字-->数字
    del words # 词存入内存
    single_gen = generate_sample(index_words, skip_window)#获取样本，根据中心词选取前后skip个词，构成词对，
    return get_batch(single_gen, batch_size)

def get_index_vocab(vocab_size):
    file_path = download(FILE_NAME, EXPECTED_BYTES)
    words = read_data(file_path)
    return build_vocab(words, vocab_size)

训练的实现，

#coding=utf-8

""" The mo frills implementation of word2vec skip-gram model using NCE loss.
Author: Chip Huyen
Prepared for the class CS 20SI: "TensorFlow for Deep Learning Research"
cs20si.stanford.edu
"""

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
import tensorflow as tf
from tensorflow.contrib.tensorboard.plugins import projector

from process_data import process_data

VOCAB_SIZE = 50000 #词典大小
BATCH_SIZE = 128 # 每个批次的大小，即每个批次包含的样本量
EMBED_SIZE = 128 # 每个词的嵌入向量大小，即词向量维度
SKIP_WINDOW = 1 # 窗口，即中心词调几个词
NUM_SAMPLED = 64    # 抽样时取的负样本个数
LEARNING_RATE = 1.0
NUM_TRAIN_STEPS = 20000
SKIP_STEP = 2000 # 优化loss，训练多少次

def word2vec(batch_gen):
    """ Build the graph for word2vec model and train it """
    # 使用placeholders设置输入和输出，即中心词和目标词
    with tf.name_scope('data'):
        center_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE], name='center_words')
        target_words = tf.placeholder(tf.int32,shape=[BATCH_SIZE,1],name='target_words')

    # 词嵌入矩阵[词典大小,词向量大小]，
    with tf.name_scope('mbedding_matrix'):
        embed_matrix = tf.Variable(tf.random_uniform([VOCAB_SIZE, EMBED_SIZE], -1.0, 1.0), name='embed_matrix')

    # 模型
    # tf.nn.embedding_lookup，根据input_ids中的id，寻找embedding中的对应向量（一行，从0开始计数）然后组成新的矩阵，
    # embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embed')
    # 开始训练，设置w和b，
    with tf.name_scope('loss'):
        embed = tf.nn.embedding_lookup(embed_matrix,center_words,name='embed')
        # Step 4: construct variables for NCE loss
        nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE],stddev=1.0 / (EMBED_SIZE ** 0.5)), name='nce_weight')
        nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias')
        # nce损失函数
        loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight,
                                            biases=nce_bias,
                                            labels=target_words,
                                            inputs=embed,
                                            num_sampled=NUM_SAMPLED,#负样本数
                                            num_classes=VOCAB_SIZE), name='loss')


    # 优化
    optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss)

    #迭代训练
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        total_loss = 0.0 #总的loss
        writer = tf.summary.FileWriter('.my_graph/no_frills/', sess.graph)
        for index in range(NUM_TRAIN_STEPS):
            centers, targets = next(batch_gen)
            loss_batch,_ = sess.run([loss,optimizer],feed_dict={center_words:centers,target_words:targets})

            total_loss += loss_batch
            if (index + 1) % SKIP_STEP == 0:
                print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
                total_loss = 0.0
        writer.close()
        #最后训练完是embed_matrix;embed_matrix.eval()是numpy.ndarray类型
def main():
    batch_gen = process_data(VOCAB_SIZE, BATCH_SIZE, SKIP_WINDOW)
    word2vec(batch_gen)

if __name__ == '__main__':
    main()

应用

把词转为词向量后，即把中文词转换成了数字，并且是一维向量，可以在很多数学模型中使用。

比如，作为其他模型的特征输入；向量之间计算词的相似性；通过词计算短语或句子的向量（当然有其他模型训练短语和句子的向量）；

总之，词向量的得到，可以作为其他应用的辅助或基础，另外，词向量的作用与lda-主题模型同理，可以得到词的向量表示。然而，主题模型，可以把词形象的标记在不同主题的概率，而词向量只是在同一空间内，把词进行重新分置，视具体的应用场景选择lda的向量表示还是词向量的表示。

模型复用

上述训练word2vec，可以看出代码没有类，使得代码在使用上欠缺复用性，下面把训练的模型做改变，

class SkipGramModel:
    def __init__(self, params):
       pass
    def _create_placeholders(self):
       pass
    def _create_embedding(self):
       pass
    def _create_loss(self):
       pass
    def _create_optimizer(self):
       pass
    def _create_summaries(self):
       pass
    def build_graph(self):
       pass

之后，把相关的逻辑填充在类的方法里即可。详细的代码实现.

山上掏金

每天早上起床就是为了比昨天更快乐，掏金者的一天是新的开始.