Sunday, 8 September 2019

Azure Data Share - Quick Facts



Azure Data Share a new feature to share data between organizations. The regions are slightly limited to 3 as of now but will grow in the future. Under the hood it’s a PaaS based share service which expose selected data from Azure storage to the selected parties and has a couple of features thrown in
-Versioning
-Tracking and auditing capabilities like who, when accessed the data.
-Subscribe to the data share.
-Policies around data usage
-Uses the same security primitives as Azure.
Quick Facts: Data can be shared in/out to any other organization which are on Azure.
The sharing and receiving possibility in/out are limited to

Data shared via invitation works best with Microsoft suite of authentication schemes or more specifically Azure, so if there is sharing information to other forms authentication that will not be possible.
Typical use cases
  1.  Data Sharing between organizations.
  2.  Planning to find an easier solution then FTP , drop box other forms for sharing.

Good to see in future allow data share of Azure database which add complexity to multifold and will be competing with other existing Azure offering (SQL Azure Sync, ADF etc.…)

Saturday, 7 September 2019

Blockchain and MDM Synergies



As the team at Morning blaze continue to build in the intelligent way of doing Data Management. The world around us has one harder topic to consider i.e. Blockchain.
The current state of data is we are highly distributed by nature, for example banking, healthcare, transportation, energy, manufacturing, and other sectors, the trend is decentralized locations and teams managing local data. But it’s a trend that comes with the potential for chaos — especially for master data, where accuracy, security, and conformity are essential.
Most organization would have move on the idea of building Master Data Management capabilities as organic path of technology growth, they would write tons of ETL data pipelines on cloud of there choice OR on premise.  ETL data pipelines or alternate technologies can be costly and after the multi man year project they still don’t seem to come to gospel truth of the “Single Source of Truth”.
A Quick catch up Blockchain
Blockchain is a distributed ledger that is encrypted and immutable. Each new block that is added to the chain needs to be verified by the previous block with a unique identifier. Blockchains are cloud-hosted (hence the term ‘distributed’).
So, while this is obviously useful for transactions and has traditional finance institutes scrambling because Blockchain can cut the middle man, Blockchain’s native safety features make it a great choice for what’s known as ‘the single source of truth’ as well.
How can Blockchain help in Master Data Management?
Master Data Management (MDM) depends on creating consensus truth for the enterprise which is e-commerce business. If Enterprise A merges with Enterprise B, their big stores of master data need to merge as well. It’s critical that the process reliably matches customer records when it should, while carefully avoiding false matches. Business can depend on the accuracy of the master data matching process.
Traditionally, matching has meant linking the records within the two different databases, based on identifiers like Customer Name, Address, date of birth, drivers’ license information, and so on. The MDM system could write the linkage information to a central database accessible from different locations. But having a single copy of the linkage data in a single location has meant that admins need to take special care to ensure that the data is highly available and secure. Private blockchain networks (also called ‘permissioned networks’) offer an intriguing alternative.
A better MDM solution with Blockchain
The Digital Ledger has much more to offer for example over time, large enterprises will adopt distributed ledger models to record and manage biographic and biometric data. For example, imagine hospitals, banks, and governments all wanting to maintain their master data on the blockchain. But those organizations will need ways to match and link that data across private networks.
The submission here is “One can build an MDM without necessarily having to go down the path of ETL, Data pipeline leveraging distributed ledger”
Consider Enterprise A and Enterprise B. If they each maintain their customer records, how will they combine those records in the event of a merger?
The enterprises could first create a business network using the blockchain technology. That offers an advantage because data sharing then happens on the blockchain network as opposed to being centralized. Once the teams create the network and begin sharing data on the network, sophisticated algorithms kick in to perform matching and linking — and the linking information is also stored natively on the blockchain.
Teams could also choose whether each node should maintain its own copy of the linkage information on the ledger. If not, the node can simply consume the linkage information that’s maintained elsewhere on the network. That option keeps transaction activity from swamping any nodes that might have less compute power or connectivity, while helping to ensure that the linkage data is stored redundantly across multiple nodes.
Hopefully, the e-commerce example puts a compelling argument on the potential advantages of MDM on the blockchain, but the gains don’t stop there. Consider…
ü  Data reconciliation: When every participating business unit is part of the blockchain network, there’s no longer a need to move data between the business units. With traditional MDM, data movement can consume an enormous amount of time and energy.
ü  Cost and Trust: Maintaining a central infrastructure is expensive and prone to security compromise. With the blockchain system, transactions aren’t committed without the consensus of the whole system.
ü  Organizational efficiency: The blockchain eliminates the need for complex reconciliation between different nodes, whether the nodes are branch banks, health clinics, distribution centers, or other peers in the system.
ü  Disintermediation: Eliminates central intermediaries and reduces the fear of arbitrage within the ecosystem.
ü  Transparency: Enables audit trails to be established for assets and transactions, minimizing disputes.
Next Steps….
Like all big data, master data offers important opportunities for machine learning analytics. Obviously, embedded analytics of anonymized master data can yield powerful insights, but machine learning can also play a role further upstream.
Morning Blaze find ways to apply machine learning to the matching process itself to ensure even higher confidence for the linkages between records.
Ultimately, the goal is to make Master Data Management as easy and intuitive as possible. New tools will give non-technical users across industries the ability to manage master data flexibly, efficiently, securely — and with perfect confidence.


Saturday, 2 February 2019

LSTM for Stock Markets


Trying out LSTM for stock markets

Introduction

LSTM Basics got us to a point of understanding simple LSTM. When it comes real life scenarios the picture gets more complicated a simple 1 layer LSTM just doesnt do the job. Usually multi layer LSTM are required where each layer does a part of the job then sends the output to next layer and so on. Building a deep RNN stacking multiple recurrent hidden layers allows the hidden state at each level to operate at a different timescale. Starting of with Stacked LSTM which has multiple hidden LSTM layers the architecture where the depth of the network mattered more the number of memory cells in a given layer, A Stacked LSTM can be defined an LSTM model comprising of multiple LSTM. The implementation example needs to be more close to real life scenarios. Stock market prediction are always intriguing. Technical Stock Analysis believes the complete intelligence of predicting movement of a stock is in OHLC (Open , High, Low and Close) and may be volume.

Stock market data is time series data by nature. Quandl has used as the data source. Some key aspects to understand while predicting the trend or price here.Most analyst would look into moving average or exponential moving average over a certain period to define the trend. A more realistic approach would be to also include trend indicators and price indicators into the dataset.

One Step Ahead Prediction Average We try to predict the future stock market prices (for example, xt+1) as an average of the previously observed stock market prices within a fixed size window (for example, xt-N, ..., xt) (say previous 100 days)

Exponential Moving Average calculate xt+1 as, xt+1 = EMAt = γ × EMAt-1 + (1-γ) xt where EMA0 = 0 and EMA is the exponential moving average value we maintain over time. The above equation basically calculates the exponential moving average from t+1 time step and uses that as the one step ahead prediction. γ decides what the contribution of the most recent prediction is to the EMA.

For the sake of simplicity we are going to take 1 feature close and not include any averaging technique for now.

GitHub: https://github.com/ajayso/LSTM_StockMarkets

In [89]:
#Imports
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import LSTM,Dense
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import quandl
from pandas import DataFrame
In [101]:
# Data Read 
# Quandl is used to read Open, High , Low .....
quandl.ApiConfig.api_key ="Td2oA_m_SYUdi1X9Htdi" # Please replace with API key can get one from quandl.com
auto = quandl.get("NSE/NIFTY_AUTO")
c_auto = auto.dropna() # Some cleansing ....
c_auto.hist(bins=50, figsize=(20,15))
plt.show()
# Ideally one would do a complete EDA on the data to arrive a complete understanding of relationship between features.

Data pulled from Quandl is an Index for the Automobile companies in India.The NIFTY Auto Index is designed to reflect the behaviour and performance of the Indian automobiles sector.The NIFTY Auto Index is computed using free float market capitalization method with a base date of January 1, 2004 indexed to a base value of 1000.

The data for auto has open, high, low , close , share traded and turnover. There are couple of NA entries which have been cleaned.The goal is build a RNN model with LSTM cell to predict prices for auto nifty index. For sake of simplicity the close feature is used later the other features will be included. Before dwelling into the code a quick recap of RNN, it is a type of Neural Network with a self loop in its hidden layer, this enables RNN to use the previous state to learn the current state given the new input. RNN is suited for processing sequential data. LSTM is a specially designed working unit that helps RNN better memorize long term contracts.

In [109]:
# Some more plot
c_auto['Close'].plot(figsize=(20,10), linewidth=5, fontsize=20) # Trying to get a sense how is this index doing
plt.xlabel('Year', fontsize=20);
c_auto.to_csv("auto.csv", sep=',', encoding='utf-8') # Saving for future use 
In [131]:
# Using the close for LSTM to start with 
data = pd.read_csv('auto.csv')
cl = data.Close
cl = cl.values # Convert to Array
In [132]:
# Scaling is a neccessary evil for algo to perform better
scl = MinMaxScaler()
cl = cl.reshape(cl.shape[0],1)
cl = scl.fit_transform(cl)
cl
Out[132]:
array([[0.02639892],
       [0.01790945],
       [0.01775437],
       ...,
       [0.64306885],
       [0.62752588],
       [0.63235074]])

Data Preparation

The idea behind LSTM is more around learning from a set of input sequence to predict the right output sequence. Of course with the backpropagation or how far back the algo has to go, to decide or arrive on the output sequence. The data is a time series on length N defined as p0, p1.. pN where pi is close of price on day i. Consider we have sliding window of fixed width sized w and every time we move right by size w so there is no overlap between the data in all sliding windows. In this example we use a 15 day sliding window, under realistic consideration we would use moving or exponential averages of the close here as mentioned earlier for sake of simplicity close has been used. image.png

In [113]:
# The key function for building out the data for LSTM consumption.
def BuildData(data, lb):
    X,y =[],[]
    for i in range(len(data) - lb - 1):
        X.append(data[i: (i + lb), 0])
        y.append(data[(i + lb),0])
    return (np.array(X), np.array(y))
X,y = BuildData(cl,15) # 15  days back
X_train, X_test = X [:int(X.shape[0] * 0.8)] , X [int(X.shape[0] * 0.8):]
y_train, y_test = y [:int(y.shape[0] * 0.8)] , y [int(y.shape[0] * 0.8):]
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[113]:
((1402, 15), (351, 15), (1402,), (351,))
In [114]:
# LSTM Stacked 2 layers ,with 512 memory cells, 1 Dense to conclude the output.
# Optimizer adam and loss is mse.

model = Sequential()
model.add(LSTM(512, return_sequences=True, input_shape=(15, 1)))
model.add(LSTM(512))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
In [115]:
# Build test, train set and finally fit into the model
# Note the Model.fit may take a long time depending on the compute and GPU capacity feel free to bring down the memory cells and 
# epochs
# Current configuration took ~ 3 hrs on 32 gb 1 gpu based machine.
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1],1)
X_test  = X_test.reshape(X_test.shape[0], X_test.shape[1],1)
history = model.fit(X_train,y_train, epochs = 300, validation_data = (X_test, y_test),shuffle=False)
In [116]:
X_test[0]
Out[116]:
array([[0.87985502],
       [0.87369183],
       [0.87312318],
       [0.86555848],
       [0.86898184],
       [0.87175039],
       [0.87260049],
       [0.86511046],
       [0.86341601],
       [0.8423589 ],
       [0.83532263],
       [0.8360636 ],
       [0.83439787],
       [0.83119278],
       [0.84735609]])
In [117]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
Out[117]:
[<matplotlib.lines.Line2D at 0x1de2bf064a8>]
In [142]:
Xt = model.predict(X_test)
plt.plot(scl.inverse_transform(y_test.reshape(-1,1)))
plt.plot(scl.inverse_transform(Xt))
# The predicted is quite in line with actual. The thing one can ascertain from this curve in the later part is that the index is 
# trending downwards...
Out[142]:
[<matplotlib.lines.Line2D at 0x1de1c3008d0>]
In [147]:
# To get an idea of a specific predicted value
act = []
pred = []
i=350
Xt = model.predict(X_test[i].reshape(1,15,1))
print('predicted:{0}, actual:{1}'.format(scl.inverse_transform(Xt),scl.inverse_transform(y_test[i].reshape(-1,1))))
pred.append(scl.inverse_transform(Xt))
act.append(scl.inverse_transform(y_test[i].reshape(-1,1)))
# We see there is a difference however when we move into multiple features including trend indicators the gap reduces considerably
predicted:[[9081.99]], actual:[[8767.35]]
In [148]:
result_df = pd.DataFrame({'pred':list(np.reshape(pred, (-1))),'act':list(np.reshape(act, (-1)))})
In [149]:
result_df
Out[149]:
pred act
0 9081.990234 8767.35