Ajay Solanki: Hadoop Azure

Showing posts with label Hadoop Azure. Show all posts

Saturday, 28 January 2017

Real-time Financial Stocks Analysis Architecture

In the prior 2 posts, the focus was more on using machine learning techniques like regression to predict gold buy / sell signals. While the models that we built ,give an idea on how to get to a final buy and sell signal for gold with the assumption data is clean and always available. Without relevant clean data, the model predictions would be of zero relevance to the business.

In every big data analysis project which heavily rely on real time data, a lot is dependent on the underlying software architecture which is responsible to deliver the data in an edible form for the models. In this post I have attempted to put together high level architecture of a real time stock analysis platform. This is a high level architecture of XTrade platform is current in production for one of the customers.

Additionally I have a starters kit available at https://github.com/ajayso/XeusTrade.git which comprises of templates for all the components.

A Brief on XTrade - Day trading can be risky business, human analysis of real time data without intelligent insights can be detrimental. XTrade is real time stock technical analysis platform which ingest real time stock feed , industry data and news and analyse to provide predictions, correlations (weak and strong) called quants. XTrade interfaces with trading systems to execute the actions or provide these actionable insights to an average trader or analysts.

The architectures for most real-time system are in line with the lambda architecture. This post will focus on the speed (real-time processing) area of the lambda architecture.

Getting the Data In ……

Data comes in from multiple sources and can be varying formats and segregating relevant data needs a specialized software. XTrade had multiple data sources below are some of the more relevant ones.

Stock feeds
Industry feeds
News data
Other relevant data

The requirement is to pull data feeds from the data sources at a specified frequency( in minutes). Stale data management (dont pull stale data) and transformation to standard format in this case json is something which Apache NiFi provides for absolute ease. Apache NiFi is basic architectural building block for data ingest , transformation. NiFi has many processors with the options of writing your own processors in java.

Apache NiFi is an enterprise integration and dataflow automation tool that allows a user to send, receive, route, transform, and sort data, as needed, in an automated and configurable way. Similar tools exist, but NiFi is different because of its user-friendly drag-and-drop graphical user interface and the ease with which it can be customized on the fly for specific needs. Think of creating a simple flow chart of what you want to do with your data; that is how easy it is to create a dataflow in NIFi. It is also highly scalable and can run on something as simple as a laptop or clustered across many high-performance servers.

Below is example of XTrade Nifi data flows,

In the starter kit the NiFi folder has 2 templates which can be reused, these are data sources for individual stocks and news.

Implementation Details: The data flow will pull the feeds and process them, post which these need to put into a messaging systems, In this case we have used Apache Kafka. JSON is the standard data format used within this architecture.One has the option of persisting these data feeds to hdfs or any other persisted data store. Apache NiFI runs on a cluster and highly fault tolerant.

Messaging …..

The requirement of having low latency reliable messaging system is really important. Apache Kafka -is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.

Apache Kafka is been used as basic architecture building block for messaging in the architecture. Below is the high level representation of Kafka implementation for XTrade

Implementation Details: Brokers are segregated based on the following

Individual Stocks: Since XTrade handles prediction and data insight for individual registered stocks from the customers, the decision to have separate broker for the same was taken to handle future scale out requirements.
Broker(Industry) messaging system for industry stock prices and news.
Broker(misc,) messaging exposure and risk management data coming in from customer systems and other public data sources.

Core Analysis and Data Decision Making…..

Fast processing of the data streams coming from the messaging layer can really help cut down latency of the overall lifecyle. Stream processing and calling analysis model in R, spark and send back the prediction and data insights in matter on minutes is key here. A platform which has the flexibility of supporting multiple programming languages was the need of the hour.

Apache Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!

Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Implementation Details: Stock and Industry data coming in from the Kafka is fetched by Storm with a Kafka Spout and further processes news and stock analysis calling R , Spark model which will emit the prediction back to Kafka topic (Broker Results), In the next post I will detail out he Spark, R model for stock technical analysis. The prediction are also written to Cassandra.

A ready skeletal code for Storm (eclipse project) written in java can be found here.

Storm architecture for XTrade

Putting it all together….

The entire solution can deployed to aws / azure , Managing the clusters across this distributed environment is a daunting task.

Apache Mesos looked to be a good option for the same. Apache Mesos is a centralised fault-tolerant cluster manager. It’s designed for distributed computing environments to provide resource isolation and management across a cluster of slave nodes. It schedules CPU and memory resources across the cluster in much the same way the Linux Kernel schedules local resources. Mesos support for Hadoop, NiFi, Kafka , Storm and Cassandra exists.

Monday, 26 November 2012

Narwhal- Big Data for US Election

Codename Narwhal is Obama secret data integration project which started 9 months has paid off. The team of data scientists , developers, and digital advertising experts, putting there heads together to really get big data to help the team make better decision.

4Gb/s, 10k requests per second, 2,000 nodes, 3 datacenters, 180TB and 8.5 billion requests. Design, deploy, dismantle in 583 days to elect the President…

Key Take Away

While the entire platform is built on Amazon, its a greatly proven architecture, a lot of the application were build around the start up strategy “the open source culture” and the “idea of core Platform Services in the form of Narwhal Services” has made the overall picture very simple.

Background

At a very high level Narwhal integrated data across multiple applications

Facebook,
National list of voters and lot more data from the swing state.
Swing state data : As a standard sales principle the 60% of customer who are on the fence are to be targeted well.
Public voting records
Responses coming directly from the voters
Tracking voters across the web
Serving ads to public with targeted messages on the campaign sites
Analysing what does a voter read online
Obama supporter on Facebook – cross sell sending emails to a supported about there friends in the swing states encouraging them to vote.

The starting of the data architecture was the database of registered voters from Democratic National Committee & keeping it up to date. Playing around with this data in terms of adding voter data mix seeing the trends.

Data collection has been highly private and running analysis this data to decide next probable strategies.

High Level Analytics

If we look into the Analytics -The Obama campaign had a list of every registered voter in the battleground states. The job of the campaign’s much-heralded data scientists was to use the information they had amassed to determine which voters the campaign should target— and what each voter needed to hear.

What did the data help them in

Deep Targeting of voters

Race wise targeting example Latino community using diversity

The data scientist really came down to the following Individual estimates for each swing state voter’s behaviour.

These four numbers were included in the campaign’s voter database, and each score, typically on a scale of 1 to 100, predicted a different element of how that voter was likely to behave.
Two of the numbers calculated voters’ likelihood of supporting Obama, and of actually showing up to the polls. These estimates had been used in 2008. But the analysts also used data about individual voters to make new, more complicated predictions.
If a voter supported Obama, but didn’t vote regularly, how likely was he or she to respond to the campaign’s reminders to get to the polls?

The final estimate was the one that had proved most elusive to earlier campaigns—and that may be most influential in the future.

Micro targeting another numerical scoring mechanism is been used widely so more data on the same is here.

The Complete Architecture

The central piece or key application block is Amazon's cloud computing services for computing and storage power. At its peak, the IT infrastructure for the Obama campaign took up "a significant amount of resources in AWS's Northern Virginia data center,".

Narwhal Services

The key architectural decision marking in an ambiguous architecture situation is to get the core perfect. The Obama team build the core Narwhal a set of services that acted as an interface to a shared data stores for all application. Moreover the service layer was REST based which allowed building applications in any development language and platform making it possible to quickly develop new applications and to integrate existing ones into the campaign's system. Those apps include sophisticated analytics programs like Dreamcatcher, a tool developed to "microtarget" voters based on sentiments within text. And there's Dashboard, the "virtual field office" application that helped volunteers communicate and collaborate.

With introduction of Narwhal Services Layer this gave the option decoupling all application and allow each application scale up individually and at the same time allow to share data across all application. Given the nature of the business and the need to build the application on the fly it was important to build something like a Narwhal Services Layers

Platform Agnostic Development

With all services exposed as REST Based the option for developers to build an application in any language and platform.

The team

The idea of recruiting people who already knew the territory, snapping up both local talent and people from out of town with Internet bona fides—veterans from companies like Google, Facebook, Twitter, and TripIt.

"All these guys have had experience working in startups and experience in scaling apps from nothing to huge in really tight situations like we were in the campaign,".

The need to hire engineers who understand APIs—engineers that spend a lot of time on the Internet building platforms.

The Technical Stack

Narwhal. Written in Python, the API side of Narwhal exposes data elements through standard HTTP requests. While it was designed to work on top of any data store, the Obama tech team relied on Amazon's MySQL-based Relational Database Service (RDS). The "snapshot" capability of RDS allowed images of databases to be dumped into Simple Storage Service (S3) instances without having to run backups.

Even with the rapidly growing sets of shared data, the Obama tech team was able to stick with RDS for the entire campaign—though it required some finesse.

They were some limitations with RDS but they were largely self-inflicted . They were able to work around those and stretch how far we were able to take RDS. If the campaign had been longer, it would have definitely had to migrate to big EC2 boxes with MySQL on them instead."

The team also tested Amazon's DynamoDB "NoSQL" database when it was introduced. While it didn't replace the SQL-based RDS service as Narwhal's data store, it was pressed into service for some of the other parts of the campaign's infrastructure. In particular, it was used in conjunction with the campaign's social networking "get-out-the-vote" efforts.

The integration element of Narwhal was built largely using programs that run off Amazon's Simple Queue Service (SQS). It pulled in streams of data from NGP VAN's and Blue State Digital's applications, polling data providers, and many more, and handed them off to worker applications—which in turn stuffed the data into SQS queues for processing and conversion from the vendors' APIs. Another element of Narwhal that used SQS was its e-mail infrastructure for applications, using worker applications to process e-mails, storing them in S3 to pass them in bulk from one stage of handling to another.

Initially, Narwhal development was shared across all the engineers. As the team grew near the beginning of 2012, however, Narwhal development was broken into two groups—an API team that developed the interfaces required for the applications being developed in-house by the campaign, and an integration team that handled connecting the data streams from vendors' applications.

The applications

As the team supporting Narwhal grew, the pace of application development accelerated as well, with more applications being put in the hands of the field force. Perhaps the most visible of those applications to the people on the front lines were Dashboard and Call Tool.

Written in Rails, Dashboard was launched in early 2012. "It's a little unconventional in that it never talks to a database directly—just to Narwhal through the API," Ecker said. "We set out to build this online field office so that it would let people organize into groups and teams in local neighbourhoods, and have message boards and join constituency groups."

An Obama campaign video demonstrating how to use Dashboard.

Enlarge / The Dashboard Web application, still live, helped automate the recruitment and outreach to would-be Obama campaign volunteers.

Dashboard didn't replace real-world field offices; rather, it was designed to overcome the problems posed by the absence of a common tool set in the 2008 election, making it easier for volunteers to be recruited and connected with people in their area. It also handled some of the metrics of running a field organization by tracking activities such as canvassing, voter registration, and phone calls to voters.

The Obama campaign couldn't mandate Dashboard's use. But the developer team evolved the program as it developed relationships with people in the field, and Dashboard use started to pick up steam. Part of what drove adoption of Dashboard was its heavy social networking element, which made it a sort of Facebook for Obama supporters.

Enlarge / Call Tool offered supporters a way to join in on specific affinity-group calling programs.

Call Tool was the Obama campaign's tool to drive its get-out-the-vote (GOTV) and other voter contact efforts. It allowed volunteers anywhere to join a call campaign, presenting a random person's phone number and a script with prompts to follow. Call Tool also allowed for users to enter notes about calls that could be processed by "collaborative filtering" on the back end—identifying if a number was bad, or if the person at that number spoke only Spanish, for instance—to ensure that future calls were handled properly.

Both Call Tool and Dashboard—as well as nearly all of the other volunteer-facing applications coded by the Obama campaign's IT team—integrated with another application called Identity. Identity was a single-sign-on application that tracked volunteer activity across various activities and allowed for all sorts of campaign metrics, such as tracking the number of calls made with Call Tool and displaying them in Dashboard as part of group "leaderboards." The leaderboards were developed to "gamify" activities like calling, allowing for what Ecker called "friendly competition" within groups or regions.

All of the data collected through various volunteer interactions and other outreach found its way into Narwhal's data store, where it could be mined for other purposes. Much of the data was streamed into Dreamcatcher and into a Vertica columnar database cluster used by the analytics team for deep dives into the data.

A good comparison http://communities-dominate.blogs.com/brands/2012/11/orca-meets-narwhal-how-the-obama-ground-game-crushed-romney-a-look-behind-the-math.html

Solving real business problems with Cloud………………. Just the beginning

Sunday, 23 September 2012

Hadoop, Sql Azure the BI and Analytics Dilemma

MSFT teams in Redmond haven’t made any official statement on Analysis Service (SSAS) for Sql Azure. With Azure making big commitment on Hadoop the dilemma becomes two fold building BI & Analytical capabilities for Sql Azure and Hadoop. I’m sure countless discussions have happened in those Redmond buildings deciding the same. Some guess work around this “MSFT will release Analysis Services for Sql Azure early next year”. The Analysis Service should involve support for both Sql and No Sql.
I started with a simple application in the financial industry vertical in cloud. I soon reached a point where I was pushed to corner to make some hard decision on cloud as the application besides the transactional features required tons of BI and Analytics spread across both structured and unstructured world.
Sql Azure unwillingly became the choice of structured database and Hadoop and Azure became willing choice for unstructured data. Hadoop provide very quick and optimal search across terra bytes of data and Sql Azure with its limited on cloud offering gave transactional. The BI and Analytical capability became a nightmare.
The MSFT teams on Sql Azure are tight lipped about an “Analysis framework in Sql Azure which can do both structured & un structured”. For right now I have a Sql 2012 Analysis Service running on Azure virtual machines which did pretty much what I wanted. But then “unwillingly I have to say SSAS of Sql 2012 is coming in Sql Azure” pure guesswork.
The application as of current looks like this

What below is my thought process pure guesswork? The shift to No Sql is evident MSFT – Azure platform has to embrace the No Sql platform completely this involves extending the BI and Analytics. What seems to be emergent is an Analysis Service architecture is expected to support both Sql and No Sql running out of Sql Azure platform. The Sql Server 2012Analysis Service - xvelocity an in memory analytics engine is step to move into the PaaS model for Sql Azure.This new engine is delivered within the following modules

xVelocity for Data Warehousing: is a memory optimized columnstore index for high speed data querying (relational queries).
xVelocity for Business Intelligence: is the in-memory analytics engine for Analysis Services (Tabular Model) and PowerPivot.

As the name in-memory engine implies all the data is stored in memory. Although todays computer systems are equipped with gigabytes of memory, memory still is an expensive resource. Therefore we need to be able to analyze the memory usage of the Analysis Services in-memory engine to understand how much memory is consumed by the different applications. The SQL Azure Analysis Service Framework should deliver a massively parallel processing infrastructure with a software solution that embeds both SQL and MapReduce analytic processing for deeper analytic insights on multi-structured data and new analytic capabilities driven by data science. Analysis service is most likely to uses an integrated MapReduce analytics engine for embedded analytic processing, simplifying enterprise access for big data analytics. Sql Server Analysis Service formally supported only SQL so that any business intelligence tool that generates standard SQL or any business analyst that knows SQL can immediately invoke the power of data science without having to learn programming languages or new interfaces. What is expected Sql Analysis Services Framework expected to look like is

While writing my applications I have found pushing data into Hadoop Azure cumbersome. I have resorted to writing data poll mechanism which will pull data from the Azure Blob storage push to Hadoop Head Node. This is not a documented way, it works. Find the code here. https://github.com/ajayso/AzureHeadNodeFileReceiver.git