Learning note cb012: LSTM simple implementation, complete implementation, torch, novel training word2vec LSTM robot

Time:2019-12-10

Really master an algorithm, the most practical method, and write it completely.

LSTM (long short TEM memory) is a special recurrent neural network, in which neurons store historical memory and solve the problem that natural language processing statistics can only consider the latest n words and ignore the older ones. Purpose: Word Representation (embedding), sequence to sequence learning, machine translation, speech recognition, etc.

More than 100 lines of original Python code are implemented based on the LSTM binary adder. Https://iamtrask.github.io/20…, translation http://blog.csdn.net/zzukun/a…:

import copy, numpy as np
np.random.seed(0)
At the beginning, numpy library and matrix operation were introduced.

def sigmoid(x):

output = 1/(1+np.exp(-x))
return output

Declare sigmoid activation function, basic content of neural network, commonly used activation functions sigmoid, Tan, relu, etc., sigmoid value range [0, 1], Tan value range [- 1, 1], X is vector, return output is vector.

def sigmoid_output_to_derivative(output):

return output*(1-output)

Declare sigmoid derivative function.
Idea of adder: binary addition is the addition of binary bits, recording full binary and one carry. During training, random C = a + B samples, input a, B output C is the whole LSTM prediction process, training various conversion matrices and weights from a, B binary to C, neural network.

int2binary = {}
Declaration dictionary, from integer number to binary, save up do not need to calculate at any time, save in advance to read faster.

binary_dim = 8
largest_number = pow(2,binary_dim)
Declare the binary number dimension, 8. Binary can express the maximum integer 2 ^ 8 = 256, largest [number].

binary = np.unpackbits(

                   np.array([range(largest_number)],dtype=np.uint8).T,axis=1)

for i in range(largest_number):

int2binary[i] = binary[i]

Store the integer to binary conversion dictionary in advance.

alpha = 0.1
input_dim = 2
hidden_dim = 16
output_dim = 1
Set the parameter, alpha is the learning speed, input “dim” is the input layer vector dimension, input a and B are two numbers, which is 2, hidden “dim” is the hidden layer vector dimension, hidden layer neuron number, output “dim” is the output layer vector dimension, output a C is one dimension. The weight matrix from the input layer to the hidden layer is 216 dimensions, the weight matrix from the hidden layer to the output layer is 161 dimensions, and the weight matrix from the hidden layer to the hidden layer is 16 * 16 dimensions

synapse_0 = 2*np.random.random((input_dim,hidden_dim)) – 1
synapse_1 = 2*np.random.random((hidden_dim,output_dim)) – 1
synapse_h = 2*np.random.random((hidden_dim,hidden_dim)) – 1
2x-1, np.random.random generates random floating-point numbers from 0 to 1, and 2x-1 makes its value range [- 1, 1].

synapse_0_update = np.zeros_like(synapse_0)
synapse_1_update = np.zeros_like(synapse_1)
synapse_h_update = np.zeros_like(synapse_h)
Declare three matrix updates, Delta.

for j in range(10000):
10000 iterations.

a_int = np.random.randint(largest_number/2)
a = int2binary[a_int]
b_int = np.random.randint(largest_number/2)
b = int2binary[b_int]
c_int = a_int + b_int
c = int2binary[c_int]
Randomly generated samples, including binary a, B, C, C = a + B, a uuint, B uuint, C uuint are integer formats corresponding to a, B, C respectively.

d = np.zeros_like(c)
D store the predicted value of C in the model.

overallError = 0
Global error, observe the effect of the model.
layer_2_deltas = list()
Store the residuals of the second layer (output layer), and the formula for calculating the residuals of the output layer is derived from http://deep learning.stanford.

layer_1_values = list()
layer_1_values.append(np.zeros(hidden_dim))
Store the output value of the first layer (hidden layer), and assign a value of 0 as the previous time value.

for position in range(binary_dim):
Traverses each bit of the binary.

X = np.array([[a[binary_dim – position – 1],b[binary_dim – position – 1]]])
y = np.array([[c[binary_dim – position – 1]]]).T
X and y are the position bits of binary values of sample input and output, respectively. X has two values for each sample, a and B corresponding to the position bits. The samples are divided into each binary bit for training. The binary addition with carry mark is suitable for long-term and short-term memory training with LSTM. Each sample has 8 binary bits which are a time series.

layer_1 = sigmoid(np.dot(X,synapse_0) + np.dot(layer_1_values[-1],synapse_h))
Formula CT = sigma (w0 · XT + wh · CT-1)

layer_2 = sigmoid(np.dot(layer_1,synapse_1))
The formula used here is C2 = sigma (W1 · C1),

layer_2_error = y – layer_2
Calculate the error between the predicted value and the real value.

layer_2_deltas.append((layer_2_error)*sigmoid_output_to_derivative(layer_2))
Reverse conduction, calculate Delta, add to array layer ﹣ 2 ﹣ deltas

overallError += np.abs(layer_2_error[0])
Calculate the total accumulated error for display and observation.

d[binary_dim – position – 1] = np.round(layer_20)
Stores the predicted position bit output value.

layer_1_values.append(copy.deepcopy(layer_1))
The stored intermediate procedure generates hidden layer values.

future_layer_1_delta = np.zeros(hidden_dim)
Store the historical memory value of the hidden layer in the next time period, and assign a null value first.

for position in range(binary_dim):
Traverses each bit of the binary.

X = np.array([[a[position],b[position]]])
Take out the x value, update from the big bit, reverse conduction update one level at a time sequence.

layer_1 = layer_1_values[-position-1]
Take out the output of the corresponding hidden layer.

prev_layer_1 = layer_1_values[-position-2]
Take out the bit corresponding to the sequence output on the hidden layer.

layer_2_delta = layer_2_deltas[-position-1]
Take out the output layer delta corresponding to the bit.

layer_1_delta = (future_layer_1_delta.dot(synapse_h.T) + layer_2_delta.dot(synapse_1.T)) * sigmoid_output_to_derivative(layer_1)
Neural network reverse conduction formula, plus hidden layer value.

synapse_1_update += np.atleast_2d(layer_1).T.dot(layer_2_delta)
The accumulated weight matrix is updated, and the partial derivative of the weight (weight matrix) is equal to the delta point multiplication of the output of this layer and the next layer.

synapse_h_update += np.atleast_2d(prev_layer_1).T.dot(layer_1_delta)
The weight matrix of the previous time sequence hidden layer is updated, and the output of the previous time sequence hidden layer is multiplied by the delta point of this time sequence.

synapse_0_update += X.T.dot(layer_1_delta)
Input layer weight matrix update.

future_layer_1_delta = layer_1_delta
Record delta of this timing hidden layer.

synapse_0 += synapse_0_update * alpha
synapse_1 += synapse_1_update * alpha
synapse_h += synapse_h_update * alpha
Weight matrix update.

synapse_0_update *= 0
synapse_1_update *= 0
synapse_h_update *= 0
Update variable to zero.

if(j % 1000 == 0):

    print "Error:" + str(overallError)
    print "Pred:" + str(d)
    print "True:" + str(c)
    out = 0
    for index,x in enumerate(reversed(d)):
        out += x*pow(2,index)
    print str(a_int) + " + " + str(b_int) + " = " + str(out)
    print "------------"

Output the total error information for every 1000 training samples, and observe the convergence process at runtime.
LSTM is the simplest implementation. It does not consider bias variables, only two neurons.

Full LSTM Python implementation. The code source is https://github.com/nicodjimen…, the author explains http://nicodjimenez.github.io…, and the specific process is shown in the figure http://colah.github.io/posts /.

import random
import numpy as np
import math

def sigmoid(x):

return 1. / (1 + np.exp(-x))

Declare the sigmoid function.

def rand_arr(a, b, *args):

np.random.seed(0)
return np.random.rand(*args) * (b - a) + a

Generate random matrix, value range [a, b], shape is specified by args.

class LstmParam:

def __init__(self, mem_cell_ct, x_dim):
    self.mem_cell_ct = mem_cell_ct
    self.x_dim = x_dim
    concat_len = x_dim + mem_cell_ct
    # weight matrices
    self.wg = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len)
    self.wi = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len)
    self.wf = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len)
    self.wo = rand_arr(-0.1, 0.1, mem_cell_ct, concat_len)
    # bias terms
    self.bg = rand_arr(-0.1, 0.1, mem_cell_ct)
    self.bi = rand_arr(-0.1, 0.1, mem_cell_ct)
    self.bf = rand_arr(-0.1, 0.1, mem_cell_ct)
    self.bo = rand_arr(-0.1, 0.1, mem_cell_ct)
    # diffs (derivative of loss function w.r.t. all parameters)
    self.wg_diff = np.zeros((mem_cell_ct, concat_len))
    self.wi_diff = np.zeros((mem_cell_ct, concat_len))
    self.wf_diff = np.zeros((mem_cell_ct, concat_len))
    self.wo_diff = np.zeros((mem_cell_ct, concat_len))
    self.bg_diff = np.zeros(mem_cell_ct)
    self.bi_diff = np.zeros(mem_cell_ct)
    self.bf_diff = np.zeros(mem_cell_ct)
    self.bo_diff = np.zeros(mem_cell_ct)

Lstmparam class transfers parameters, MEM cell CT is the number of LSTM neurons, X dim is the input data dimension, concat len is the sum of MEM cell CT and X dim length, WG is the input node weight matrix, wi is the input gate weight matrix, WF is the forget gate weight matrix, wo is the output gate weight matrix, BG, Bi, BF, Bo are the input node, input gate, forget gate, output gate bias, WG diff, w I-diff, wf-diff and wo-diff are the weight loss of input node, input gate, forgetting gate and output gate respectively. Bg-diff, bi-diff, bf-diff and bo-diff are the offset loss of input node, input gate, forgetting gate and output gate respectively. Initialization is based on the matrix dimension, and the loss matrix returns to zero.

def apply_diff(self, lr = 1):
    self.wg -= lr * self.wg_diff
    self.wi -= lr * self.wi_diff
    self.wf -= lr * self.wf_diff
    self.wo -= lr * self.wo_diff
    self.bg -= lr * self.bg_diff
    self.bi -= lr * self.bi_diff
    self.bf -= lr * self.bf_diff
    self.bo -= lr * self.bo_diff
    # reset diffs to zero
    self.wg_diff = np.zeros_like(self.wg)
    self.wi_diff = np.zeros_like(self.wi)
    self.wf_diff = np.zeros_like(self.wf)
    self.wo_diff = np.zeros_like(self.wo)
    self.bg_diff = np.zeros_like(self.bg)
    self.bi_diff = np.zeros_like(self.bi)
    self.bf_diff = np.zeros_like(self.bf)
    self.bo_diff = np.zeros_like(self.bo)

The process of weight updating is defined, in which the loss is reduced first and then the loss matrix is reset to zero.

class LstmState:

def __init__(self, mem_cell_ct, x_dim):
    self.g = np.zeros(mem_cell_ct)
    self.i = np.zeros(mem_cell_ct)
    self.f = np.zeros(mem_cell_ct)
    self.o = np.zeros(mem_cell_ct)
    self.s = np.zeros(mem_cell_ct)
    self.h = np.zeros(mem_cell_ct)
    self.bottom_diff_h = np.zeros_like(self.h)
    self.bottom_diff_s = np.zeros_like(self.s)
    self.bottom_diff_x = np.zeros(x_dim)

Lstmstate stores LSTM neuron states, including G, I, F, O, s, H. s is the internal state matrix (memory), h is the hidden layer neuron output matrix.

class LstmNode:

def __init__(self, lstm_param, lstm_state):
    # store reference to parameters and to activations
    self.state = lstm_state
    self.param = lstm_param
    # non-recurrent input to node
    self.x = None
    # non-recurrent input concatenated with recurrent input
    self.xc = None

Lstmnode corresponds to the sample input, X is the input sample x, XC is to use hstack to splice X and recursive input node matrix (hstack is the horizontal matrix, vstack is the vertical matrix).

def bottom_data_is(self, x, s_prev = None, h_prev = None):
    # if this is the first lstm node in the network
    if s_prev == None: s_prev = np.zeros_like(self.state.s)
    if h_prev == None: h_prev = np.zeros_like(self.state.h)
    # save data for use in backprop
    self.s_prev = s_prev
    self.h_prev = h_prev

    # concatenate x(t) and h(t-1)
    xc = np.hstack((x,  h_prev))
    self.state.g = np.tanh(np.dot(self.param.wg, xc) + self.param.bg)
    self.state.i = sigmoid(np.dot(self.param.wi, xc) + self.param.bi)
    self.state.f = sigmoid(np.dot(self.param.wf, xc) + self.param.bf)
    self.state.o = sigmoid(np.dot(self.param.wo, xc) + self.param.bo)
    self.state.s = self.state.g * self.state.i + s_prev * self.state.f
    self.state.h = self.state.s * self.state.o
    self.x = x
    self.xc = xc

Bottom and top are two directions. The input sample is input from the bottom, and the reverse conduction is conducted from the top to the bottom. Bottom “data” is the process o f input sample. The X and the previous input are spliced into a matrix. The G, I, F, O values are calculated with the formula Wx + B respectively, and the tanh and sigmoid functions are activated.
Each sequential neural network has four neural network layers (activation function), the leftmost forgetting gate, which directly takes effect to memory c, the second input gate, which depends on the input sample data, affects memory C in a certain “proportion”, and the “proportion” is realized by the third layer (tanh). The value range is [- 1,1] which can have a positive or negative impact, and the last output gate, which is generated by each sequential The output not only depends on the input sample X and the previous time sequence output, but also depends on the memory c, which is designed to imitate the memory function of biological neurons.

def top_diff_is(self, top_diff_h, top_diff_s):
    # notice that top_diff_s is carried along the constant error carousel
    ds = self.state.o * top_diff_h + top_diff_s
    do = self.state.s * top_diff_h
    di = self.state.g * ds
    dg = self.state.i * ds
    df = self.s_prev * ds

    # diffs w.r.t. vector inside sigma / tanh function
    di_input = (1. - self.state.i) * self.state.i * di
    df_input = (1. - self.state.f) * self.state.f * df
    do_input = (1. - self.state.o) * self.state.o * do
    dg_input = (1. - self.state.g ** 2) * dg

    # diffs w.r.t. inputs
    self.param.wi_diff += np.outer(di_input, self.xc)
    self.param.wf_diff += np.outer(df_input, self.xc)
    self.param.wo_diff += np.outer(do_input, self.xc)
    self.param.wg_diff += np.outer(dg_input, self.xc)
    self.param.bi_diff += di_input
    self.param.bf_diff += df_input
    self.param.bo_diff += do_input
    self.param.bg_diff += dg_input

    # compute bottom diff
    dxc = np.zeros_like(self.xc)
    dxc += np.dot(self.param.wi.T, di_input)
    dxc += np.dot(self.param.wf.T, df_input)
    dxc += np.dot(self.param.wo.T, do_input)
    dxc += np.dot(self.param.wg.T, dg_input)

    # save bottom diffs
    self.state.bottom_diff_s = ds * self.state.f
    self.state.bottom_diff_x = dxc[:self.param.x_dim]
    self.state.bottom_diff_h = dxc[self.param.x_dim:]

Reverse conduction is the core of the whole training process. Assuming that LSTM outputs the predicted value H (T) at time t, the actual output value is y (T), and the difference between them is loss, assuming that the loss function is l (T) = f (H (T), y (T)) = ||||||||||||||||||||||||||||||||||. The final goal is to use gradient descent method to minimize L (T), find an optimal weight W to minimize L (T). When w changes slightly, l (T) will not change any more, and achieve local optimum, that is, the gradient of L-W partial derivative is 0.
DL / DW represents how much the unit change l changes when w changes, DH (T) / DW represents how much the unit change H (T) changes when w changes, DL / DH (T) represents how much the unit change l changes when h (T) changes, (DL / DH (T)) * (DH (T) / DW) represents how much the unit change l changes when the i-th memory unit w of the t-th time series changes, and the sum of all I from 1 to m and all t from 1 to t is the whole DL / DW.

In the I-T h memory unit, H (T) changes in units. The sum of all local losses L in the whole sequence from 1 to t is DL / DH (T). H (T) only affects the local losses L in the sequence from t to t.

Suppose L (T) is the sum of losses from t to t, l (T) = ∑ L (s).

Derivative of H (T) to W.

L (T) = l (T) + L (T + 1), DL (T) / DH (T) = DL (T) / DH (T) + DL (T + 1) / DH (T), the current time series derivative is obtained by using the next time series derivative, the rule is derived, the derivative at time t is calculated and pushed forward. At time t, DL (T) / DH (T) = DL (T) / DH (T).

class LstmNetwork():

def __init__(self, lstm_param):
    self.lstm_param = lstm_param
    self.lstm_node_list = []
    # input sequence
    self.x_list = []

def y_list_is(self, y_list, loss_layer):
    """
    Updates diffs by setting target sequence
    with corresponding loss layer.
    Will *NOT* update parameters.  To update parameters,
    call self.lstm_param.apply_diff()
    """
    assert len(y_list) == len(self.x_list)
    idx = len(self.x_list) - 1
    # first node only gets diffs from label ...
    loss = loss_layer.loss(self.lstm_node_list[idx].state.h, y_list[idx])
    diff_h = loss_layer.bottom_diff(self.lstm_node_list[idx].state.h, y_list[idx])
    # here s is not affecting loss due to h(t+1), hence we set equal to zero
    diff_s = np.zeros(self.lstm_param.mem_cell_ct)
    self.lstm_node_list[idx].top_diff_is(diff_h, diff_s)
    idx -= 1

    ### ... following nodes also get diffs from next nodes, hence we add diffs to diff_h
    ### we also propagate error along constant error carousel using diff_s
    while idx >= 0:
        loss += loss_layer.loss(self.lstm_node_list[idx].state.h, y_list[idx])
        diff_h = loss_layer.bottom_diff(self.lstm_node_list[idx].state.h, y_list[idx])
        diff_h += self.lstm_node_list[idx + 1].state.bottom_diff_h
        diff_s = self.lstm_node_list[idx + 1].state.bottom_diff_s
        self.lstm_node_list[idx].top_diff_is(diff_h, diff_s)
        idx -= 1

    return loss

From IDX traversing from t to 1, calculate the loss of layer.bottom-diff and the next time sequence bottom-diff as diff.
loss_layer.bottom_diff:

def bottom_diff(self, pred, label):
    diff = np.zeros_like(pred)
    diff[0] = 2 * (pred[0] - label)
    return diff

L (T) = f (H (T), y (T)) = | H (T) – Y (T) | ^ 2 derivative L ‘(T) = 2 * (H (T) – Y (T))
。 When s (T) changes, the source of L (T) change s (T) affects H (T) and H (T + 1), and l (T).
H (T + 1) does not affect L (T).
The left formula (DL (T) / DH (T)) * (DH (T) / DS (T)), from t + 1 to t, we can deduce DL (T) / DS (T) step by step.
Neuron self. State. H = self. State. S self. State. O, H (T) = s (T) O (T), DH (T) / DS (T) = O (T), DL (T) / DH (T) is top_diff_h.

Top_diff_is, bottom means input to the layer, top means output of the layer. Caffe also uses this terminology. Bottom represents the input of neural network layer, top represents the output of neural network layer, which is consistent with the concept of Caffe.
def top_diff_is(self, top_diff_h, top_diff_s):
Top ﹣ diff ﹣ h represents the current T-sequence DL (T) / DH (T), and top ﹣ diff ﹣ s represents the T + 1 time sequence memory unit DL (T) / DS (T).

    ds = self.state.o * top_diff_h + top_diff_s
    do = self.state.s * top_diff_h
    di = self.state.g * ds
    dg = self.state.i * ds
    df = self.s_prev * ds

The prefix D represents the derivative of the error l to a certain term.
DS is to calculate the current time series DL (T) / DS (T) according to the formula DL (T) / DS (T).
Do is calculated as DL (T) / do (T), H (T) = s (T) O (T), DH (T) / do (T) = s (T), DL (T) / do (T) = (DL (T) / DH (T)) (DH (T) / do (T)) = top_diff_h * s (T).
Di is the calculated DL (T) / di (T). s(t) = f(t) s(t-1) + i(t) g(t)。 dL(t)/di(t) = (dL(t)/ds(t)) (ds(t)/di(t)) = ds g(t)。
DG is calculated as DL (T) / DG (T), DL (T) / DG (T) = (DL (T) / DS (T)) (DS (T) / DG (T)) = DS I (T).
DF is calculated as DL (T) / DF (T), DL (T) / DF (T) = (DL (T) / DS (T)) (DS (T) / DF (T)) = DS s (t-1).

    di_input = (1. - self.state.i) * self.state.i * di
    df_input = (1. - self.state.f) * self.state.f * df
    do_input = (1. - self.state.o) * self.state.o * do
    dg_input = (1. - self.state.g ** 2) * dg

Sigmoid function derivative, tanh function derivative. D i_input, (1. – self. State. I) * self. State. I, sigmoid derivative, when I neuron input changes in units, how much the output value changes, then multiplying Di indicates how much the error L (T) changes when I neuron input changes in units, DL (T) / D i_input (T).

    self.param.wi_diff += np.outer(di_input, self.xc)
    self.param.wf_diff += np.outer(df_input, self.xc)
    self.param.wo_diff += np.outer(do_input, self.xc)
    self.param.wg_diff += np.outer(dg_input, self.xc)
    self.param.bi_diff += di_input
    self.param.bf_diff += df_input
    self.param.bo_diff += do_input
    self.param.bg_diff += dg_input

W? Diff is the weight matrix error, and B? Diff is the offset error for updating.

    dxc = np.zeros_like(self.xc)
    dxc += np.dot(self.param.wi.T, di_input)
    dxc += np.dot(self.param.wf.T, df_input)
    dxc += np.dot(self.param.wo.T, do_input)
    dxc += np.dot(self.param.wg.T, dg_input)

Add up the input x diff. X Works in four places. Add diff in four places and then make xdiff.

    self.state.bottom_diff_s = ds * self.state.f
    self.state.bottom_diff_x = dxc[:self.param.x_dim]
    self.state.bottom_diff_h = dxc[self.param.x_dim:]

The bottom diff us is the f-fold relation between the s-change in T-1 time series and the s-change in T-Time series. DXC is a horizontal merging matrix of X and h. two parts of the diff information, bottom diff * and bottom diff * are taken respectively.

def x_list_clear(self):

    self.x_list = []

def x_list_add(self, x):
    self.x_list.append(x)
    if len(self.x_list) > len(self.lstm_node_list):
        # need to add new lstm node, create new state mem
        lstm_state = LstmState(self.lstm_param.mem_cell_ct, self.lstm_param.x_dim)
        self.lstm_node_list.append(LstmNode(self.lstm_param, lstm_state))

    # get index of most recent x input
    idx = len(self.x_list) - 1
    if idx == 0:
        # no recurrent inputs yet
        self.lstm_node_list[idx].bottom_data_is(x)
    else:
        s_prev = self.lstm_node_list[idx - 1].state.s
        h_prev = self.lstm_node_list[idx - 1].state.h
        self.lstm_node_list[idx].bottom_data_is(x, s_prev, h_prev)

Add training samples and input x data.

def example_0():

# learns to repeat simple sequence from random inputs
np.random.seed(0)

# parameters for input data dimension and lstm cell count
mem_cell_ct = 100
x_dim = 50
concat_len = x_dim + mem_cell_ct
lstm_param = LstmParam(mem_cell_ct, x_dim)
lstm_net = LstmNetwork(lstm_param)
y_list = [-0.5,0.2,0.1, -0.5]
input_val_arr = [np.random.random(x_dim) for _ in y_list]

for cur_iter in range(100):
    print "cur iter: ", cur_iter
    for ind in range(len(y_list)):
        lstm_net.x_list_add(input_val_arr[ind])
        print "y_pred[%d] : %f" % (ind, lstm_net.lstm_node_list[ind].state.h[0])

    loss = lstm_net.y_list_is(y_list, ToyLossLayer)
    print "loss: ", loss
    lstm_param.apply_diff(lr=0.1)
    lstm_net.x_list_clear()

Initialize lstmparam, specify 100 memory storage units and 50 input sample x dimension. Initialize lstmnetwork training model, generate 50 random numbers in 4 groups, respectively, train with [- 0.5,0.2,0.1, – 0.5] as y-value, feed 50 random numbers and one y-value each time, iterate 100 times.
LSTM inputs a series of continuous prime numbers to estimate the next prime number. Small test, generate prime numbers within 100, cycle out 50 prime number sequences as X, the 51st prime number as y, take out 10 samples to participate in training for 1W times, mean square error from 0.17973 to 1.05172e-06, almost completely correct:

import numpy as np
import sys

from lstm import LstmParam, LstmNetwork

class ToyLossLayer:

"""
Computes square loss with first element of hidden layer array.
"""
@classmethod
def loss(self, pred, label):
    return (pred[0] - label) ** 2

@classmethod
def bottom_diff(self, pred, label):
    diff = np.zeros_like(pred)
    diff[0] = 2 * (pred[0] - label)
    return diff

class Primes:

def __init__(self):
    self.primes = list()
    for i in range(2, 100):
        is_prime = True
        for j in range(2, i-1):
            if i % j == 0:
                is_prime = False
        if is_prime:
            self.primes.append(i)
    self.primes_count = len(self.primes)
def get_sample(self, x_dim, y_dim, index):
    result = np.zeros((x_dim+y_dim))
    for i in range(index, index + x_dim + y_dim):
        result[i-index] = self.primes[i%self.primes_count]/100.0
    return result

def example_0():

mem_cell_ct = 100
x_dim = 50
concat_len = x_dim + mem_cell_ct
lstm_param = LstmParam(mem_cell_ct, x_dim)
lstm_net = LstmNetwork(lstm_param)

primes = Primes()
x_list = []
y_list = []
for i in range(0, 10):
    sample = primes.get_sample(x_dim, 1, i)
    x = sample[0:x_dim]
    y = sample[x_dim:x_dim+1].tolist()[0]
    x_list.append(x)
    y_list.append(y)

for cur_iter in range(10000):
    if cur_iter % 1000 == 0:
        print "y_list=", y_list
    for ind in range(len(y_list)):
        lstm_net.x_list_add(x_list[ind])
        if cur_iter % 1000 == 0:
            print "y_pred[%d] : %f" % (ind, lstm_net.lstm_node_list[ind].state.h[0])

    loss = lstm_net.y_list_is(y_list, ToyLossLayer)
    if cur_iter % 1000 == 0:
        print "loss: ", loss
    lstm_param.apply_diff(lr=0.01)
    lstm_net.x_list_clear()

if name == “__main__”:

example_0()

All prime lists are divided by 100. The training data of this code must be less than 1.

Torch is a deep learning framework. 1) tensorflow, Google’s main push, is the most popular nowadays. Both small-scale experiments and large-scale calculations are available. Based on python, the disadvantages are relatively difficult to start with, and the speed is average; 2) torch, Facebook’s main push, is used for small-scale experiments, and there are many open-source applications. Based on Lua, it’s faster to start with, and online files are more complete, and the disadvantages are that Lua language is relatively cold; 3) mxnet, Amazon’s main push, is mainly used for large-scale calculations , based on Python and R, the disadvantage is that there are few open-source projects on the Internet; 4) Caffe, the main push of Facebook, for large-scale computing, based on C + +, python, the disadvantage is that the development is not very convenient; 5) theano, the speed is general, based on python, the evaluation is very good.

There are many implementation projects of LSTM on torch GitHub.

Install torch on the Mac. https://github.com/torch/torc… 。

git clone https://github.com/torch/dist… ~/torch –recursive
cd ~/torch; bash install-deps;
./install.sh
QT installation is not successful. Install by yourself.

brew install cartr/qt4/qt
After installation, it needs to be added to ~ /. Bash_profilemanually.

. ~/torch/install/bin/torch-activate
After the source ~ /. Bash_profile, execute th to use torch.
Installation of torch depends on

brew install zeromq
brew install openssl
luarocks install luacrypto OPENSSL_DIR=/usr/local/opt/openssl/

git clone https://github.com/facebook/i…
cd iTorch
luarocks make
Image recognition is realized by convolution neural network.
Create pattern ﹣ recognition.lua:

require ‘nn’
require ‘paths’
if (not paths.filep(“cifar10torchsmall.zip”)) then

os.execute('wget -c https://s3.amazonaws.com/torch7/data/cifar10torchsmall.zip')
os.execute('unzip cifar10torchsmall.zip')

end
trainset = torch.load(‘cifar10-train.t7’)
testset = torch.load(‘cifar10-test.t7’)
classes = {‘airplane’, ‘automobile’, ‘bird’, ‘cat’,
‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck’}
setmetatable(trainset,
{__index = function(t, i)

return {t.data[i], t.label[i]}

end}
);
trainset.data = trainset.data:double() — convert the data from a ByteTensor to a DoubleTensor.

function trainset:size()

return self.data:size(1)

end
mean = {} — store the mean, to normalize the test set in the future
stdv = {} — store the standard-deviation for the future
for i=1,3 do — over each image channel

mean[i] = trainset.data[{ {}, {i}, {}, {}  }]:mean() -- mean estimation
print('Channel ' .. i .. ', Mean: ' .. mean[i])
trainset.data[{ {}, {i}, {}, {}  }]:add(-mean[i]) -- mean subtraction

stdv[i] = trainset.data[{ {}, {i}, {}, {}  }]:std() -- std estimation
print('Channel ' .. i .. ', Standard Deviation: ' .. stdv[i])
trainset.data[{ {}, {i}, {}, {}  }]:div(stdv[i]) -- std scaling

end
net = nn.Sequential()
net:add(nn.SpatialConvolution(3, 6, 5, 5)) — 3 input image channels, 6 output channels, 5×5 convolution kernel
net:add(nn.ReLU()) — non-linearity
net:add(nn.SpatialMaxPooling(2,2,2,2)) — A max-pooling operation that looks at 2×2 windows and finds the max.
net:add(nn.SpatialConvolution(6, 16, 5, 5))
net:add(nn.ReLU()) — non-linearity
net:add(nn.SpatialMaxPooling(2,2,2,2))
net:add(nn.View(1655)) — reshapes from a 3D tensor of 16x5x5 into 1D tensor of 1655
net:add(nn.Linear(1655, 120)) — fully connected layer (matrix multiplication between input and weights)
net:add(nn.ReLU()) — non-linearity
net:add(nn.Linear(120, 84))
net:add(nn.ReLU()) — non-linearity
net:add(nn.Linear(84, 10)) — 10 is the number of outputs of the network (in this case, 10 digits)
net:add(nn.LogSoftMax()) — converts the output to a log-probability. Useful for classification problems
criterion = nn.ClassNLLCriterion()
trainer = nn.StochasticGradient(net, criterion)
trainer.learningRate = 0.001
trainer.maxIteration = 5
trainer:train(trainset)
testset.data = testset.data:double() — convert from Byte tensor to Double tensor
for i=1,3 do — over each image channel

testset.data[{ {}, {i}, {}, {}  }]:add(-mean[i]) -- mean subtraction
testset.data[{ {}, {i}, {}, {}  }]:div(stdv[i]) -- std scaling

end
predicted = net:forward(testset.data[100])
print(classes[testset.label[100]])
print(predicted:exp())
for i=1,predicted:size(1) do

print(classes[i], predicted[i])

end
correct = 0
for i=1,10000 do

local groundtruth = testset.label[i]
local prediction = net:forward(testset.data[i])
local confidences, indices = torch.sort(prediction, true)  -- true means sort in descending order
if groundtruth == indices[1] then
    correct = correct + 1
end

end

print(correct, 100*correct/10000 .. ‘ % ‘)
class_performance = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
for i=1,10000 do

local groundtruth = testset.label[i]
local prediction = net:forward(testset.data[i])
local confidences, indices = torch.sort(prediction, true)  -- true means sort in descending order
if groundtruth == indices[1] then
    class_performance[groundtruth] = class_performance[groundtruth] + 1
end

end

for i=1,#classes do

print(classes[i], 100*class_performance[i]/1000 .. ' %')

end
Execute th pattern? Recognition.lua.

First, download the cifar10torchsmall.zip sample. There are 50000 training pictures and 10000 test pictures, all of which are labeled respectively, including 10 categories such as airplane and auto mobile. Bind \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ Double type tensor. Initialize the convolution neural network model, including two layers of convolution, two layers of pooling, a full connection and a softmax layer, for training, with a learning rate of 0.001 and five iterations. After the model is trained, predict the No. 100 picture of the testing machine, print out the overall accuracy and each classification accuracy. https://github.com/soumith/cv… 。

Torch can easily support GPU calculation and needs to modify the code.

The popular seq2seq is basically implemented by the encoder decoder model composed of LSTM, and most of the open source implementations are based on one hot embedding (no word vector to express large amount of information). There is only one LSTM unit robot in word 2vec vector seq2seq model.

Download the original story of Zhen Huan. On the Internet, Baidu “screen ring transmission TXT”, download it, transcode the file to UTF-8 code, and replace the windows carriage return character with n for subsequent processing.

Pass words to Zhen Huan. Word segment.py, a word segmentation tool, can be downloaded from GitHub at https://github.com/warmheart.

python ./word_segment.py zhenhuanzhuan.txt zhenhuanzhuan.segment
Generate word vectors. Use word2vec, word2vec source https://github.com/warmheart. Make compile to execute.

./word2vec -train ./zhenhuanzhuan.segment -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
A vectors. Bin file is generated, which is based on the word vector file generated by the original text.

Training code.

– coding: utf-8 –

import sys
import math
import tflearn
import chardet
import numpy as np
import struct

seq = []

max_w = 50
float_size = 4
word_vector_dict = {}

def load_vectors(input):

"" "load the word vector from vectors.bin, and return a word vector dict dictionary. Key is the word, and value is the 200 dimensional vector
"""
print "begin load vectors"

input_file = open(input, "rb")

#Get word list number and vector dimension
words_and_size = input_file.readline()
words_and_size = words_and_size.strip()
words = long(words_and_size.split(' ')[0])
size = long(words_and_size.split(' ')[1])
print "words =", words
print "size =", size

for b in range(0, words):
    a = 0
    word = ''
    #Read a word
    while True:
        c = input_file.read(1)
        word = word + c
        if False == c or c == ' ':
            break
        if a < max_w and c != 'n':
            a = a + 1
    word = word.strip()

    vector = []
    for index in range(0, size):
        m = input_file.read(float_size)
        (weight,) = struct.unpack('f', m)
        vector.append(weight)

    #Save words and their corresponding vectors into Dict
    word_vector_dict[word.decode('utf-8')] = vector

input_file.close()
print "load vectors finish"

def init_seq():

"" "read the cut word text file and load all word sequences
"""
file_object = open('zhenhuanzhuan.segment', 'r')
vocab_dict = {}
while True:
    line = file_object.readline()
    if line:
        for word in line.decode('utf-8').split(' '):
            if word_vector_dict.has_key(word):
                seq.append(word_vector_dict[word])
    else:
        break
file_object.close()

def vector_sqrtlen(vector):

len = 0
for item in vector:
    len += item * item
len = math.sqrt(len)
return len

def vector_cosine(v1, v2):

if len(v1) != len(v2):
    sys.exit(1)
sqrtlen1 = vector_sqrtlen(v1)
sqrtlen2 = vector_sqrtlen(v2)
value = 0
for item1, item2 in zip(v1, v2):
    value += item1 * item2
return value / (sqrtlen1*sqrtlen2)

def vector2word(vector):

max_cos = -10000
match_word = ''
for word in word_vector_dict:
    v = word_vector_dict[word]
    cosine = vector_cosine(vector, v)
    if cosine > max_cos:
        max_cos = cosine
        match_word = word
return (match_word, max_cos)

def main():

load_vectors("./vectors.bin")
init_seq()
xlist = []
ylist = []
test_X = None
#for i in range(len(seq)-100):
for i in range(10):
    sequence = seq[i:i+20]
    xlist.append(sequence)
    ylist.append(seq[i+20])
    if test_X is None:
        test_X = np.array(sequence)
        (match_word, max_cos) = vector2word(seq[i+20])
        print "right answer=", match_word, max_cos

X = np.array(xlist)
Y = np.array(ylist)
net = tflearn.input_data([None, 20, 200])
net = tflearn.lstm(net, 200)
net = tflearn.fully_connected(net, 200, activation='linear')
net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1,
                                 loss='mean_square')
model = tflearn.DNN(net)
model.fit(X, Y, n_epoch=500, batch_size=10,snapshot_epoch=False,show_metric=True)
model.save("model")
predict = model.predict([test_X])
#print predict
#for v in test_X:
#    print vector2word(v)
(match_word, max_cos) = vector2word(predict[0])
print "predict=", match_word, max_cos

main()

Load vectors loads word vectors from vectors.bin, and init SEQ loads the word segmentation text of screening ring to coexist in a sequence. Vector2word finds the nearest word from a vector, and the model has only one LSTM unit.
After 500 epochs training, the mean square loss is reduced to 0.33673, and the next word is predicted by cosine similarity of 0.941794432002.
Powerful GPU, adjustment of parameters, training of the whole article, modification of the code predict part, continuous output of the next word, automatically spit out the screening ring body. Based on tflearn implementation, tflearn official document examples implementation seq2seq directly calls tensorflow / Python / OPS / seq2seq.py in tensorflow. Based on one hot embedding method, there must be no word vector effect.
Please read the original for details