Ajay Solanki: Big Data Analysis

Showing posts with label Big Data Analysis. Show all posts

Monday, 27 March 2017

Parallelization of R code using Azure Infrastructure

Working on large data sets, exploring which machine learning algorithm fits the bill is a daunting task. Moreover these ML algorithm can run into hours and days in certain cases. There is always a need of having compute resources available on the fly. R in principle is single threaded by nature. To support parallel constructs like parallel for , apply functions we have the parallel package in R, which supports multi core and cluster based parallel execution. The cluster supports both PSOCK and FORK implementation.

doAzureParallel R package is a lightweight R package built on top of Azure Batch Service (job scheduler service) that allows use of Azure compute resources from the R session. doAzureParallel supports the foreach parallel construct.

Getting started with doAzureParallel

Below video will walk you through on the basics of doAzureParallel.

doAzureParallel does not have parallel constructs for apply function, If one did require to use them they can use the parallel package on the node on the cluster and get the best of parallel apply functions. With parallelism comes a degree of complexity of memory management and caching and understand how can FORK help for same. The below video explains how to use parallel package and take the parallel execution of the code down to a core level.

Running Parallel constructs along with DoAzureParallel

Parallelization to MLR algorithms

DoAzureParallel in its current form supports foreach , it needs to graduate to support parallel apply functions. Taking the discussion to the next level it would be lovely if doAzureParallel would support mlr (classification, regression) set of algorithm to run in parallel. The current set of algorithm like parallelmap, batchjobs and mlr solve the problem of running the Mlr algorithms . It’s pretty easy to see how a larger model, more iterations or a different choice of methods could result in unacceptably long run-times. One could use multi-core or socket level parallelism, but ideally taking advantage of as much computing resource is better choice,.

Apparently the batchjobs package doesnt support azure batch service.

ParallelMap is now directly integrated into mlr, and this makes scaling to parallel back-ends seamless. Our choice of back-end is parameterized so we can write algorithms once and choose the parallel back-end depending on the resources we have available when we run the model. To illustrate this, we re-run the same model, but instead of running the model on a single node, we run it on a clustered environment running OpenLava, an open-source Platform LSF compatible workload manager now supported by BatchJobs.

Below video explains how to use parallemap, mlr in a mult-core scenario along with doAzureParallel.

Demo codebase can be found here - https://github.com/ajayso/DoAzureParallel

Saturday, 28 January 2017

Real-time Financial Stocks Analysis Architecture

In the prior 2 posts, the focus was more on using machine learning techniques like regression to predict gold buy / sell signals. While the models that we built ,give an idea on how to get to a final buy and sell signal for gold with the assumption data is clean and always available. Without relevant clean data, the model predictions would be of zero relevance to the business.

In every big data analysis project which heavily rely on real time data, a lot is dependent on the underlying software architecture which is responsible to deliver the data in an edible form for the models. In this post I have attempted to put together high level architecture of a real time stock analysis platform. This is a high level architecture of XTrade platform is current in production for one of the customers.

Additionally I have a starters kit available at https://github.com/ajayso/XeusTrade.git which comprises of templates for all the components.

A Brief on XTrade - Day trading can be risky business, human analysis of real time data without intelligent insights can be detrimental. XTrade is real time stock technical analysis platform which ingest real time stock feed , industry data and news and analyse to provide predictions, correlations (weak and strong) called quants. XTrade interfaces with trading systems to execute the actions or provide these actionable insights to an average trader or analysts.

The architectures for most real-time system are in line with the lambda architecture. This post will focus on the speed (real-time processing) area of the lambda architecture.

Getting the Data In ……

Data comes in from multiple sources and can be varying formats and segregating relevant data needs a specialized software. XTrade had multiple data sources below are some of the more relevant ones.

Stock feeds
Industry feeds
News data
Other relevant data

The requirement is to pull data feeds from the data sources at a specified frequency( in minutes). Stale data management (dont pull stale data) and transformation to standard format in this case json is something which Apache NiFi provides for absolute ease. Apache NiFi is basic architectural building block for data ingest , transformation. NiFi has many processors with the options of writing your own processors in java.

Apache NiFi is an enterprise integration and dataflow automation tool that allows a user to send, receive, route, transform, and sort data, as needed, in an automated and configurable way. Similar tools exist, but NiFi is different because of its user-friendly drag-and-drop graphical user interface and the ease with which it can be customized on the fly for specific needs. Think of creating a simple flow chart of what you want to do with your data; that is how easy it is to create a dataflow in NIFi. It is also highly scalable and can run on something as simple as a laptop or clustered across many high-performance servers.

Below is example of XTrade Nifi data flows,

In the starter kit the NiFi folder has 2 templates which can be reused, these are data sources for individual stocks and news.

Implementation Details: The data flow will pull the feeds and process them, post which these need to put into a messaging systems, In this case we have used Apache Kafka. JSON is the standard data format used within this architecture.One has the option of persisting these data feeds to hdfs or any other persisted data store. Apache NiFI runs on a cluster and highly fault tolerant.

Messaging …..

The requirement of having low latency reliable messaging system is really important. Apache Kafka -is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.

Apache Kafka is been used as basic architecture building block for messaging in the architecture. Below is the high level representation of Kafka implementation for XTrade

Implementation Details: Brokers are segregated based on the following

Individual Stocks: Since XTrade handles prediction and data insight for individual registered stocks from the customers, the decision to have separate broker for the same was taken to handle future scale out requirements.
Broker(Industry) messaging system for industry stock prices and news.
Broker(misc,) messaging exposure and risk management data coming in from customer systems and other public data sources.

Core Analysis and Data Decision Making…..

Fast processing of the data streams coming from the messaging layer can really help cut down latency of the overall lifecyle. Stream processing and calling analysis model in R, spark and send back the prediction and data insights in matter on minutes is key here. A platform which has the flexibility of supporting multiple programming languages was the need of the hour.

Apache Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!

Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Implementation Details: Stock and Industry data coming in from the Kafka is fetched by Storm with a Kafka Spout and further processes news and stock analysis calling R , Spark model which will emit the prediction back to Kafka topic (Broker Results), In the next post I will detail out he Spark, R model for stock technical analysis. The prediction are also written to Cassandra.

A ready skeletal code for Storm (eclipse project) written in java can be found here.

Storm architecture for XTrade

Putting it all together….

The entire solution can deployed to aws / azure , Managing the clusters across this distributed environment is a daunting task.

Apache Mesos looked to be a good option for the same. Apache Mesos is a centralised fault-tolerant cluster manager. It’s designed for distributed computing environments to provide resource isolation and management across a cluster of slave nodes. It schedules CPU and memory resources across the cluster in much the same way the Linux Kernel schedules local resources. Mesos support for Hadoop, NiFi, Kafka , Storm and Cassandra exists.

Wednesday, 28 December 2016

Machine Learning Basics–Regression…Part II

Financial investments in the capital markets in the form of equity, options or commodities one often looks at ways to maximize the profits. Most start with the rudimentary ways as tips from analyst and eventually start digging deeper below the surface to understand technical indicators. The technical indicators at a level 100 does make a lot of sense. Taking these technical indicator and pushing the boundaries of analysis with multi year data OR real time quant trading data one can find many new opportunities to make money. Of course a full fledged equity market prediction needs multiple data sources both historical and real time.

Historical data with some real-time market data with machine learning can help in coming up with reasonable prediction accuracy.

Code on github can be found here https://github.com/ajayso/ML-Regression-Analysis.

Generalized Linear Models

In the earlier post we covered linear regression(LR), In LR

1. We assumed that y data points have a normal distribution.

2. The mean of the y data points lies on the line.

Linear regression assumes a normal distribution of the data and go with a line where mean of the y data points lies on that line, However in certain situation we may desire to go with different types of data distribution example Binomial, Poisson , Hypergeometric etc..

A Briefer on data distribution

A brief on data distribution can be found here

When delving into broader set of data distribution we resort to a support for different types data distribution models. The next is to generalize the above model this can be done generalize the distribution that’s the y , the functions of the explanatory variables – x and finally how to link explanatory variables to the mean of the distribution, this is the basic idea of Generalized Linear Models.

The generalized linear models (GLMs) are a broad class of models that include linear regression, ANOVA, Poisson regression, log-linear models etc. The table below provides a good summary of GLMs

Predicting Buy/Sell for Gold

Code Example for GLM

Problem Statement

Given historical price data up to and including a given day, the idea to play around with the gold price (historical) with primary technical indicators can help you fix of gold price the following day will increase or decrease relative to the current day’s price fix. The predictions are based primarily on technical indicators calculated from historical price data for gold as well as for a variety of financial variables.

Feature Definition

The price data alone provides limited insights into the future price movement. The idea here is to identify the right feature set that best captures the movement of the gold prices and provides information not about the past and current movements but should be in a position to predict the future movement with a reasonably good accuracy.

Technical Indicators

Trendlines

Trend does provide vital information hence the current trend is vital for the accuracy of the prediction of the future prices. Trends can provide information of the continued price movement and more importantly it can prove useful for trend reversal (uptrend or downtrend). However, the gold prices cannot depend on the trend alone attributing to the fluctuations which are driven by multiple other variables or factors, unearthing these can be differentiator for a good model. Trend can be one of the variable in addition to many others to get a high accuracy for the prediction model.

The simple linear regression with least squares cost function will help to fit the trend line over last n days. The equation of the same is as indicated below.

y = -0.0394777790495269 * x + 1881.30223607803

y is the gold price, x is rate of change here.

The R implementation we have a user defined function written called slope

Rate of Change

Momentum measures the difference between the price on day x and price n days before.

ROC is calculated as - ROCn(x) = P(x) – P(x-n)/ P(x-n)

ROC can indicate the trend if > 0 it’s an uptrend < 0 it’s a downtrend. ROC over a period gives a more definite in witnessing a weak or a trend reversal.

Calculating ROC in R we use the TTR package which gives reasonable good functions for ROC.

Ratios

Ratio between ROC calculated over different time intervals particularly

(ROCn/ ROCm where m < n) it lends insight on how the change in price is changing over time.

The R implementation we have a user defined function written called ratios.

Stochastic Oscillator

It is used to determine the overbought or oversold levels of a stock or commodity, Overbought means the means the price is increased significantly over a short period and may be artificially high, this means the underlying asset is overvalued and the market will soon adjust bringing the price back down. For a better understanding on stochastic oscillator refer here.

The stochastic oscillator assumes that in uptrends, prices will close near the upper end of the recent price range and downtrend will close near the lower end. Adapting this to use the daily price fix rather than close prices.

Calculating the oscillator as follows on a day x over an n day period as follows.

Ln = lowest price over past n days

Hn= highest price over past n days

P(x) = price on day x

%K = (P(x) – Ln)/ (Hn – Ln) x 100%

If %K is less than 20% generate a buy signal and sell if greater than 80%.

The R implementation we have a user defined function written called Stochastic_Oscillator.

Basic Feature Selection

The final list of feature we have is

Slope
ROC
Ratios of ROC
Oscillator for period of 14 – We use oscillator function to determine the Buy ,Sell and stored in BuySellFlag

Gaussian Regression

Using the set of features selected above, the first algorithm generalized liner model GLM in R can be found here. We pick the data from 2014 onwards, The technical indicators alone may not be enough. But however for the example we are using the same. Considering Gaussian distribution is more or less normal distribution and more over we are predicting the Buy , Sell and Hold on gold based on the daily price and indicators input. We could not use logistic regression as we have more then 2 outcomes here.

The Gaussian Regression on GLM gives an accuracy of 15.78% which is not acceptable.

So further digging we decided to use Multinomial logistics regression, which is a linear regression analysis to conduct when the dependent variable is nominal with more than two levels. Thus it is an extension of logistic regression.

The multinomial logistics regression gives an accuracy of 90.131% which is reasonably acceptable.

In conclusion we can predict the gold buy , sell or hold signal daily at 90% accuracy ratio.

However this may not be enough in case one decides to take the accuracy up further we need to include more data

Intermarket Variables

Gold prices, and commodity prices in general, may also be related to other financial variables. For instance, gold prices are commonly thought to be related to stock prices; interest rates; the value of the dollar; and other factors. Therefore, I wanted to explore whether these other variables could be effective inputs to the prediction of gold prices. I collected the following data over the same time period as the gold price fix data, from early 2007 to late 2013:

USStockIndices: Dow Jones Industrial Average (DJI); S&P 500 (GSPC); NASDAQ Composite (IXIC)
WorldStockIndices: Ibovespa (BVSP); CAC 40 (FCHI); FTSE 100 (FTSE); DAX (GDAXI); S&P/TSX Composite (GSPTSE); Hang Seng Index (HSI); KOSPI Composite (KS11); Euronext 100 (N100); Nikkei 225
(N225); Shanghai Composite (SSEC); SMI (SSMI)
COMEX Futures: Gold futures; Silver futures; Copper futures; Oil futures
FOREX Rates: EUR-USD (Euro); GBP-USD (British Pound); USD-JPY (Japanese yen); USD-CNY (Chinese yuan)
Bond Rates: US 5-year bond yield; US 10-year bond yield; Eurobund futures
Dollar Index: Measures relative value of US dollar

The variables which are high effective in prediction of Buy Sell of gold are (Correlation Coefficients)

1. Gold Futures	0.72
2. Silver Futures	0.50
3. Copper Futures	0.27
4. EUR-USD	0.24

One can add these features to the dataset the accuracy will improve further.

Deploying this code to a fully function Gold Buy/Sell prediction

FYI this is full functional code , one can deploy this code on R Server write the integration code / bot which can get the real-time data of gold and call the predicted output from the model to get the buy/ sell/ hold signal. Preferable deploy to azure bot service and use azure ml to host algorithm.

Wednesday, 7 December 2016

Machine Learning Basics–Regression Analysis–Part 1

Data is all around us, its happening, At the simplest form data is a set of variables (of course with values). While working on a database on any domain , once we have a relatively okay understanding of the entities and we start to look for relationships within the entity or the variable. We do stumble upon and have a gut feeling which tells a variable y is dependent on a variable x OR a set of variables (x1,,x2…xn).

To a certain degree we can predict the new value of y based on x. We see there is a clear relationship between y and x. The process of estimating relationships among variables is termed as Regression Analysis. Here the focus is between a dependent variable and one of more independent variables ( called the predictors).

There are various types of Regression Analysis example Linear, Logistic, Polynomial…. each of them are used for a specific purpose.

Regression Analysis requires a prior knowledge on the dataset and in order to estimate the forecasted value of the dependent variable. There is a need of having a dataset with defined outcomes of the dependent variable which will be used to train the algorithm. Regression set of algorithm are a part of supervised learning.

Different Type of Regression

Linear Regression – The relationship between the dependent and independent variables is such that the nature of the regression line is linear.

Linear Regression establishes a relationship between dependent variable (Y) and one or more independent variables (X) using a best fit straight line (also known as regression line).The best fit straight line is generally achieved with Least Square Method. It is most commonly used for fitting a regression line. It calculates the best fit line for the observed data by minimizing the sum of squares of the vertical deviations from each data points to the line.

Example on Linear Regressions.

As an example for Linear Regression we are taking the

Gold and Silver Price Correlation

The basic plot of the same can be seen below. The data and code for the same can be found here. The data consists of last 1 year data, If one needs the complete data for last 10 years you can download the same from many of the free data sources or refer to http://www.macrotrends.net/2517/gold-prices-vs-silver-prices-historical-chart.

For trying out linear regression the code is pretty straightforward

install.packages("Quandl")
library(Quandl)
install.packages("devtools")
library(devtools)
install_github("quandl/quandl-r")

goldprices = Quandl("LBMA/GOLD")
silverprices = Quandl("LBMA/SILVER")

goldpricesshortterm = goldprices[1:730,]
silverpricesshortterm = silverprices[1:730,]

shtsilver = data.frame(silverpricesshortterm,goldpricesshortterm)
sfit = lm(shtsilver$USD..AM.~shtsilver$USD)
summary(sfit)

Used Quandl for gold and silver price data. Have applied Linear Regression Model to last 2 years of data.

Interpreting the results of lm

The summary of the lm result what we look for

1. Residuals are essentially the difference between the actual observed response , We look for a symmetrical distribution across these points on the mean value zero (0). In this case we see a symmetrical distribution

2. The t-statistic values are relatively far away from zero and are large relative to the standard error, which could indicate a relationship exists. In general, t-values are also used to compute p-values

3.In our example, the RSquared we get is 0.8028. Or roughly 80% of the variance found in the response variable (gold prices) can be explained by the predictor variable (silver price)

Friday, 18 July 2014

Azure Machine Learning–K Means Clustering…

Machine Learning (ML) has been around almost over 5 decades now. In the last couple of years with the cloud computing and big data been the dominant colours in the IT Industry, ML has found a unique space in the Big Data problem.

A Brief on Azure ML

Azure ML is Machine Learning is simpler Microsoft offering from quick and easy ML advent. Its definitely a good starting point to get to use to the Machine Learning. As one may start using this more often will realize the Azure ML is limiting in terms of choices of Algorithms , data manipulation operations & ability to run as part to run with bigger scheme of things.

Most folks will start with Azure ML and figure out that there are multiple places where the constructs are limiting, so as a good citizen MSFT went and added the ExecuteR where one could program on R Studio for test and development, eventually for larger dataset, code port or intelligent copy to ExecuteR. A good video on how to use ExecuteR in Azure ML can be found here http://channel9.msdn.com/Blogs/Windows-Azure/R-in-Azure-ML-Studio

K-Means Clustering in Azure ML Video-

The data analysis starts of with initial task of having to classify data. There are various algorithm which one may employ to classify data. The single most simplest and widely used algorithm is the K-Means Clustering. This session talks about K-Means Clustering and how to do the same using Azure ML.

The session take away from Azure ML is great not without R.

Files

Presentation Shared Here-https://drive.google.com/file/d/0B5lmX16jC3ZEcEprM2F0aW9FOVU/edit?usp=sharing

DataSets & Demo Here

RScripts - https://drive.google.com/file/d/0B5lmX16jC3ZES1c5MEdiamFCN0E/edit?usp=sharing

DataSets- https://drive.google.com/file/d/0B5lmX16jC3ZEY2R0T3R3WDU3NzQ/edit?usp=sharing