In the prior 2 posts, the focus was more on using machine learning techniques like regression to predict gold buy / sell signals. While the models that we built ,give an idea on how to get to a final buy and sell signal for gold with the assumption data is clean and always available. Without relevant clean data, the model predictions would be of zero relevance to the business.
In every big data analysis project which heavily rely on real time data, a lot is dependent on the underlying software architecture which is responsible to deliver the data in an edible form for the models. In this post I have attempted to put together high level architecture of a real time stock analysis platform. This is a high level architecture of XTrade platform is current in production for one of the customers.
Additionally I have a starters kit available at https://github.com/ajayso/XeusTrade.git which comprises of templates for all the components.
A Brief on XTrade - Day trading can be risky business, human analysis of real time data without intelligent insights can be detrimental. XTrade is real time stock technical analysis platform which ingest real time stock feed , industry data and news and analyse to provide predictions, correlations (weak and strong) called quants. XTrade interfaces with trading systems to execute the actions or provide these actionable insights to an average trader or analysts.
The architectures for most real-time system are in line with the lambda architecture. This post will focus on the speed (real-time processing) area of the lambda architecture.
Getting the Data In ……
Data comes in from multiple sources and can be varying formats and segregating relevant data needs a specialized software. XTrade had multiple data sources below are some of the more relevant ones.
- Stock feeds
- Industry feeds
- News data
- Other relevant data
The requirement is to pull data feeds from the data sources at a specified frequency( in minutes). Stale data management (dont pull stale data) and transformation to standard format in this case json is something which Apache NiFi provides for absolute ease. Apache NiFi is basic architectural building block for data ingest , transformation. NiFi has many processors with the options of writing your own processors in java.
Apache NiFi is an enterprise integration and dataflow automation tool that allows a user to send, receive, route, transform, and sort data, as needed, in an automated and configurable way. Similar tools exist, but NiFi is different because of its user-friendly drag-and-drop graphical user interface and the ease with which it can be customized on the fly for specific needs. Think of creating a simple flow chart of what you want to do with your data; that is how easy it is to create a dataflow in NIFi. It is also highly scalable and can run on something as simple as a laptop or clustered across many high-performance servers.
Below is example of XTrade Nifi data flows,
In the starter kit the NiFi folder has 2 templates which can be reused, these are data sources for individual stocks and news.
Implementation Details: The data flow will pull the feeds and process them, post which these need to put into a messaging systems, In this case we have used Apache Kafka. JSON is the standard data format used within this architecture.One has the option of persisting these data feeds to hdfs or any other persisted data store. Apache NiFI runs on a cluster and highly fault tolerant.
Messaging …..
The requirement of having low latency reliable messaging system is really important. Apache Kafka -is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
Apache Kafka is been used as basic architecture building block for messaging in the architecture. Below is the high level representation of Kafka implementation for XTrade
Implementation Details: Brokers are segregated based on the following
- Individual Stocks: Since XTrade handles prediction and data insight for individual registered stocks from the customers, the decision to have separate broker for the same was taken to handle future scale out requirements.
- Broker(Industry) messaging system for industry stock prices and news.
- Broker(misc,) messaging exposure and risk management data coming in from customer systems and other public data sources.
Core Analysis and Data Decision Making…..
Fast processing of the data streams coming from the messaging layer can really help cut down latency of the overall lifecyle. Stream processing and calling analysis model in R, spark and send back the prediction and data insights in matter on minutes is key here. A platform which has the flexibility of supporting multiple programming languages was the need of the hour.
Apache Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
Implementation Details: Stock and Industry data coming in from the Kafka is fetched by Storm with a Kafka Spout and further processes news and stock analysis calling R , Spark model which will emit the prediction back to Kafka topic (Broker Results), In the next post I will detail out he Spark, R model for stock technical analysis. The prediction are also written to Cassandra.
A ready skeletal code for Storm (eclipse project) written in java can be found here.
Storm architecture for XTrade
Putting it all together….
The entire solution can deployed to aws / azure , Managing the clusters across this distributed environment is a daunting task.
Apache Mesos looked to be a good option for the same. Apache Mesos is a centralised fault-tolerant cluster manager. It’s designed for distributed computing environments to provide resource isolation and management across a cluster of slave nodes. It schedules CPU and memory resources across the cluster in much the same way the Linux Kernel schedules local resources. Mesos support for Hadoop, NiFi, Kafka , Storm and Cassandra exists.