## LSTM–Digging Deep Part 1

#### Introduction

LSTM (**Long Short Term Memory**) is gaining a lot of recognition in recent past. LSTM are an interesting type of deep learning network, they are used in some fairly complex problem domain such as language translation, automatic image captioning and text generation. LSTM are designed specifically for sequence prediction problem. This post starts of with an introduction to LSTM, there importance , the mechanics , lstm architectures and close with getting most out of LSTM models

LSTM is primarily used to solve the **sequence prediction problem**. There are many types of LSTM some of which covered are Vanilla,Stacked, CNN LSTM, Encoder Decoder LSTM, Bidirectional LSTM and Generative LSTM.

To start with lets understand what is sequence prediction and how are they different from predictive modelling.

A sequence has an implicit for the observation , the order is important as it is what is useful in the formulation of prediction problem and solution.

Example of a simple sequence

Input : 1,2,3,4,5

Output: 6

**Sequence Classification-** involves predicting a class label given an input sequence example

Input : 1,2,3,4,5

Output : Good

Objective here is build a classification model using a labelled dataset so that the model can be used to predict the class label of an unseen sequence.
This is called as Discrete sequence classification and is used in

- DNA Sequence classification given a DNA sequence A,C,G and T values predict whether the sequence is for coding or non-coding.
- Anomaly Detection given a sequence of observation predict the sequence is anomalous or not.
- Sentiment Analysis - given a sequence of text predict the sentiment of text.

**Sequence Generation:**

Involves predicting the new output sequence that the same general characteristics as the other sequence for example

Input : [1,3,5] [7,9,11]

Output: [3,5,7]

RNN can be trained for sequence generation by processing real data sequences one step at a time and predicting what comes next.
Assuming the predictions are probabilistic, novel sequences can be generated from trained network by iterative sampling from the network's output distribution and then feeding the sample as an input at the next step. In other words making network treat its invention as if they are real.

Application for sequence generation are

- Text generation,
- Handwriting prediction and
- Music generation.

Sequence generation may also refer to generation of sequence given a single observation as input example image caption generation.

**Sequence to Sequence Prediction**

Involves predicting an output sequence given an input sequence.

Input: 1,2,3,4,5

Output: 6,7,8,9,10

In Deep neural network while coding the requirement is vectors used for input and output are fixed length. In real life situation the sequences may /may not have fixed length for example speech recognition and machine translation.Likewise question answering can also be seen as mapping a sequence of word representing the question to a sequence of words representing the answer.

Predicting the next value in the sequence , the new sequence predicted may or may not have the same length or be of same time as the input sequence. In abbreviation sequence to sequence is denoted as seq2seq.

Seq2seq at its core uses RNN to map variable length to variable length output. If the input and output are time series then the problem is referred to as Multi step time series forecasting.

Some example of sequence to sequence problems are

- Multi-step time series forecasting i.e predict a sequence of observation for a range of future time steps,
- Text summarization given a document or a text predict a short sequence of text that describes the salient parts of the source document,
- Program execution - given the textual description or mathematical equation predict the sequences of characters that describes the correct output.

MLP and neural net are generally better suited to handle time series forecasting or sequence prediction as they are robust to noise, non linear by nature, They can have multivariate inputs and outputs. Application of MLP for sequence prediction requires an input sequence be divided into smaller overlapping sub sequences to generate a prediction. The time steps of the input sequence becomes an input feature . The subsequence's are overlapping to simulate a window slid among the sequence in order to generate the output.

This can work well but there are limitations

1.**Stateless- MLP** learn a fixed function approximation. Any inputs that are conditional on the context of the input sequence must be generalized and frozen into the network weights.

2.**Unaware of Temporal Structure**- Time steps are modelled as input features meaning the network has no explicit handling or understanding of temporal structure or order between observations.

3.**Messy Scaling** - For problems that require modelling multiple parallel input sequences, the number of input features increased by a factor of the size of the sliding window without any explicit separation of time steps of series.

4.**Fixed sized inputs** - Size of the sliding window is fixed and must be imposed on all networks

5.**Fixed sized outputs**- Size of output is fixed and any outputs which don't conform must be forced.

**RNN to the rescue**- LSTM network is type of RNN are special type of neural network specifically designed for sequence problems. Given the standard feed forward network RNN can be thought of as addition loops to the architecture.

For example in a given layer each neuron may pass its signal sideways in addition to forward to the next layer. The output of the network may feedback as an input to the network with next input vector and so on.

The recurrent connection adds state or memory to the network and allow it learn and harness the ordered nature of observations of the input sequences.

RNN contains cycles of that feed the network activations from a previous time step of inputs to the network to influence predictions at the current time step. These activation are stored in the internal states of the network which can in principle hold long term temporal contextual information. This mechanism allows RNN to exploit a dynamically changing contextual window over input sequence history.

The addition of a sequence is a new dimension to the function been approximated. Instead of mapping inputs to outputs alone, network is capable of learning a mapping function for inputs over time to an output. The internal memory can mean outputs are conditional on the recent context in the input sequence , not just what has been processed as input to the network. In a sense this capability unlocks time series for neural network.

LSTM is able to solve many time series unsolvable by feedforward NN using fixed size time windows. RNN can learn and harness the temporal dependence from the data.

LSTM have an internal state they are explicitly aware of the temporal structure in the inputs, are able to model multiple parallel input series separately and can step through varied length into sequences to produce variable length output sequences, one observation at a time.

Like RNN, LSTM have recurrent connections so that the state of previous activations of the neuron from the previous time step is used in context formation of output. But unlike RNN , LSTM have a different formulation that allows it to avoid the problems that prevent the training and scaling of other RNNs.

**Key technical historical challenge** in RNN is how to train them effectively. Experiments show how this was where the weight update procedure resulted in weight changes that quickly became so small as have no effect (vanishing gradients) or so large as to result in very large changes or even overflow (exploding gradients).

LSTM overcomes this by design RNN is limited in terms of accessing the range of contextual information. The problem is that the influences of a given input on the hidden layer and therefore the network output, either decays or blows up, as it cycles around network recurrent.

For the complete working of LSTM refer this link http://colah.github.io/posts/2015-08-Understanding-LSTMs/

**Application of LSTM's**

- Automatic Image Caption Generation A sequence classification problem. Automatic Image Captioning is the task given an image the system should generate a caption describing the image. Use CNN to detect the objects in the image then use LSTM to turn the labels into coherent sentences.
- Automatic translation of text Given a text in one language translate into another language. Model must learns translation of words the context where the translation is modified and support input and output sequences that may vary in both length both generally with regards to each other. This is a classic sequence to sequence problem
- Automatic Handwriting Generation A sequence generation problem , The task is given a corpus of handwritten examples, new handwriting for a given word or phrase is generated.

Starting out with Vanilla LSTM code defined as

- Input Layer
- Fully connected LSTM hidden layer
- Fully connected hidden layer Properties of Vanilla LSTM
- Sequence classification conditional on multiple distributed input time steps
- Memory of precise input observations over thousands of time steps
- Sequence prediction as a function of prior time steps
- Robust to the insertion of random time steps on the input sequences
- Robust to the placement of signal data on input sequence

### Basic LSTM (Vanilla)

import keras from random import randint from numpy import array from numpy import argmax # generate a sequence of random numbers in [0, n_features) def generate_sequence(length, n_features): return [randint(0, n_features-1) for _ in range(length)] # one hot encode sequence def one_hot_encode(sequence, n_features): encoding = list() for value in sequence: vector = [0 for _ in range(n_features)] vector[value] = 1 encoding.append(vector) return array(encoding) # decode a one hot encoded string def one_hot_decode(encoded_seq): return [argmax(vector) for vector in encoded_seq] # generate random sequence sequence = generate_sequence(25, 100) print(sequence) # one hot encode encoded = one_hot_encode(sequence, 100) print(encoded) # one hot decode decoded = one_hot_decode(encoded) print(decoded) def generate_example(length, n_features, out_index): #generate sequence sequence = generate_sequence(length,n_features) # one hot encode encoded = one_hot_encode(sequence, n_features) # reshape X = encoded.reshape(1,length, n_features) # select output y= encoded[out_index].reshape(1,n_features) return X,y # Define and Compile the Model # Lets start to reduce the length of the sequence to 5 integers as 100 can be too much. We will eventually get to a 100 # Lets use a single hidden layer LSTM with 25 memory units , chosen with a little trail and error . # Output layer is connected to a Dense Layer with 10 neuron for 10 possible integers as an output # Softmax activation function is used on the output layer to allow the network to learn and out the distribution over possble output values # Log loss is used while training , suitable for multiclass classification problem and efficient Adam optimization algorithm # Accuracy metric reported each training epoch to give an idea of the skill of the model in addition to the loss length= 5 n_features = 10 sequence = generate_sequence(length, n_features) encoded = one_hot_encode(sequence, n_features) # Lets start to build the model # What we have decided is to reduce the length to 5 and range of features is from 0-10. The sequence generated is of length 5 and # and has numbers between 0-10 , The encoded sequence converts that to binary hot encoding i.e 10 array representation of a number # between 1-10 # so the input is 5 and output is 10 which represent the probabilities of output between 0-10 # The hidden layer is an lstm of 25 memory cell (why 25 ) from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM out_index = 2 model= Sequential() model.add(LSTM(25,input_shape=(length, n_features))) model.add(Dense(n_features,activation="softmax")) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc']) print(model.summary()) #Fit the Model for i in range(10000): X,y = generate_example(length, n_features,out_index) model.fit(X,y,epochs=1, verbose=2) correct = 0 for i in range(100): X,y = generate_example(length, n_features, out_index) yhat = model.predict(X) if one_hot_decode(yhat) == one_hot_decode(y): correct+=1 X,y = generate_example(length,n_features,out_index) yhat = model.predict(X) print( 'Sequence: %s' % [one_hot_decode(x) for x in X]) print('Expected %s' % one_hot_decode(y)) print('Expected %s' % one_hot_decode(yhat))

Above is a simple example of Vanilla LSTM, A point to note the value of y is set to

y= encoded[out_index].reshape(1,n_features).out_index is how far ahead or back is the value of y set to.

Before dwelling complex scenario, lets take another example with some real data. The below example consists of a dataset of air passenger travel (#) daily ,

The idea here is to build an LSTM to predict the # of air passenger for the future.

This example uses a single layer or vanilla LSTM

import numpy

import math

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import LSTM

from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import mean_squared_errornumpy.random.seed(11)

air_passengers = pandas.read_csv('a1.csv',usecols=[1], engine='python', skipfooter=3)

air_passengers = air_passengers.values

air_passengers = air_passengers.astype('float32')scaler = MinMaxScaler(feature_range=(0,1))

scaled_air_passengers = scaler.fit_transform(air_passengers)

train_size = int(len(scaled_air_passengers) * 0.67)

test_size = len(scaled_air_passengers) - train_size

train, test = scaled_air_passengers[0:train_size,:], scaled_air_passengers[train_size:len(scaled_air_passengers),:]

print (len(train), len(test))scaled_air_passengers.shape

# The data feed in LSTM gives a y value dependent on the past sequence the understand on how far back do we look into is conversatoin for

# another day, For this example we go back is taken as previous step

def create_dataset(dataset, look_back=1):

dataX, dataY = [], []

for i in range(len(dataset)-look_back-1):

a = dataset[i:(i+look_back), 0]

dataX.append(a)

dataY.append(dataset[i + look_back, 0])

return numpy.array(dataX), numpy.array(dataY)look_back= 3

trainX, trainY = create_dataset(train,look_back)

testX, testY = create_dataset(test,look_back)# reshape input to be [samples, time_steps,features]

# reshape input to be [samples, time steps, features]

trainX = numpy.reshape(trainX, (trainX.shape[0], trainX.shape[1], 1))

testX = numpy.reshape(testX, (testX.shape[0], testX.shape[1], 1))model = Sequential()

model.add(LSTM(4, input_shape=(look_back, 1)))

model.add(Dense(1))

model.compile(loss='mean_squared_error', optimizer='adam')

model.fit(trainX, trainY, epochs=100, batch_size=1, verbose=2)

trainPredict = model.predict(trainX)

testPredict = model.predict(testX)

trainPredict = scaler.inverse_transform(trainPredict)

testY = scaler.inverse_transform([testY])

testPredict = scaler.inverse_transform(testPredict)trainPredict = model.predict(trainX)

testPredict = model.predict(testX)

trainPredict = scaler.inverse_transform(trainPredict)

trainY = scaler.inverse_transform([trainY])

testPredict = scaler.inverse_transform(testPredict)

testY = scaler.inverse_transform([testY])

trainScore = math.sqrt(mean_squared_error(trainY[0], trainPredict[:,0]))

print('Train Score: %.2f RMSE' % (trainScore))

testScore = math.sqrt(mean_squared_error(testY[0], testPredict[:,0]))

print('Test Score: %.2f RMSE' % (testScore))# shift train predictions for plotting

trainPredictPlot = numpy.empty_like(dataset)

trainPredictPlot[:, :] = numpy.nan

trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict

# shift test predictions for plotting

testPredictPlot = numpy.empty_like(dataset)

testPredictPlot[:, :] = numpy.nan

testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict

# plot baseline and predictions

plt.plot(scaler.inverse_transform(dataset))

plt.plot(trainPredictPlot)

plt.plot(testPredictPlot)

plt.show()

Part II we get into Stacked LSTM, CNN LSTM and deeper LSTM’s

Find the complete code at the github location here https://github.com/ajayso/LSTM-Digging-Deep