Monday, 27 March 2017

Parallelization of R code using Azure Infrastructure


Working on large data sets, exploring which machine learning algorithm fits the bill is a daunting task. Moreover these ML algorithm can run into hours and days in certain cases. There is always a need of having compute resources available on the fly. R in principle is single threaded by nature.  To support parallel constructs like parallel for , apply functions we have the parallel package in R, which supports multi core and cluster based parallel execution.  The cluster supports both PSOCK and FORK implementation.

doAzureParallel R package is a lightweight R package built on top of Azure Batch Service (job scheduler service) that allows use of Azure compute resources from the R session. doAzureParallel supports the foreach parallel construct.

Getting started with doAzureParallel

Below video will walk you through on the basics of doAzureParallel.


 doAzureParallel does not have parallel constructs for apply function, If one did require to use them they can use the parallel package on the node on the cluster and get the best of parallel apply functions. With parallelism comes a degree of complexity of memory management and caching and understand how can FORK help for same. The below video explains  how to use parallel package and take the parallel execution of the code down to a core level.

Running Parallel constructs along with DoAzureParallel


Parallelization to MLR algorithms

DoAzureParallel in its current form supports foreach , it needs to graduate to support parallel apply functions. Taking the discussion to the next level it would be lovely if doAzureParallel would support mlr (classification, regression) set of algorithm to run in parallel.  The current set of algorithm like parallelmap, batchjobs and mlr solve the problem of running the Mlr algorithms . It’s pretty easy to see how a larger model, more iterations or a different choice of methods could result in unacceptably long run-times. One could use multi-core or socket level parallelism, but ideally taking advantage of as much computing resource is better choice,.

Apparently the batchjobs package doesnt support azure batch service.

ParallelMap is now directly integrated into mlr, and this makes scaling to parallel back-ends seamless. Our choice of back-end is parameterized so we can write algorithms once and choose the parallel back-end depending on the resources we have available when we run the model. To illustrate this, we re-run the same model, but instead of running the model on a single node, we run it on a clustered environment running OpenLava, an open-source Platform LSF compatible workload manager now supported by BatchJobs.

Below video explains how to use parallemap, mlr in a mult-core scenario along with doAzureParallel.



Demo codebase can be found here -


Saturday, 4 March 2017

Azure Data Factory–DE glossed



Having worked on the Apache stack for sometime, I decided to look at Azure Big Data stack.  My starting point is data ingest.For most big data projects the journey starts out with data ingest, clean, transform and have it ready for analysis. Azure Data factory is MSFT Azure offering for cloud based data integration service that automates the movement and transformation of data. At a very basic level below is a representation of data lifecycle in big data projects



Azure Data Factory has the following constructs

- Linked Services have the define where the data has to be sourced from/to.

- Pipeline and Activities – Pipelines are a logical group of activities that performs the job of moving data from/ to.

- DataSets – Linked services interfaces the Data Factory to the external data sources. Datasets are a representation of the data store.


Linked Services provides for the interfaces to external sources, currently the support is limited to Azure, Databases, File based, Salesforce, OData a complete list can be found here.

From a customization stand point of view one can create custom activities. I have the linked service limited.  On the contrary Apache NiFi seems to have a better in multiple ways

- Intuitive UI - NiFi designer.Dataflows can become quite complex. Being able to visualize those flows and express them visually can help greatly to reduce that complexity and to identify areas that need to be simplified. NiFi enables not only the visual establishment of dataflows but it does so in real-time. Rather than being design and deploy it is much more like molding clay.

-  Better support for external sources linked services in Azure Data Factory a compared to processors in NiFi, have seen NiFi comes out better , list can be found here.

- NiFi is highly fault tolerent

- Superior Exception Handling – finer details here.