Sunday 7 April 2013

Large-scale Implementation on Azure Platform


I have been closely reading the Azure CAT (Customer Advisory Team) which helps a lot of customer deliver large, complex projects. The Azure CAT site can be found here guidance on how to use different architectural artefacts in Azure. After some digging this what I have come to understand are some of the largest implementation on Azure Platform. This is however some amount of reverse engineering and research. Hope this is helpful to the readers.

So what has Azure handled so far from a number standpoint of view?

There can various architectures which Azure addresses but in a nutshell its enterprise ready. From a number standpoint of view this is what Azure is handle for applications where each line item below is an individual applications

  • Largest sharded SQL database – 20 TB- Sql Azure maximum container size of 150 GB, so there have been multiple containers used to resolve the size issues.Of course the querying strategy around this has to be well defined.
  • Largest number of database- 11,000
  • Most number of worker instances – 24,000- This is an on demand application which spins 24k worker role instance to perform some complex calculation and shuts down. This is more like HPC to address complex algorithms.
  • Largest Customer Application- 50 PB

So what are the largest case studies in Azure?

Florida Presidential Election 2012:

This is not the largest but it was highly mission critical. Find the complete write up here.

This was a mission critical for a couple of days with some very high volume numbers to manage.

Florida Election Presidential Election 2012 had the following metrics

  • Max peak 40k page hits / second
  • 6 million hits peaked in 1 hour
  • Caching in front DB was a big architectural success.- The very nature of the application been short lived it made sense to push as much data into the cache and have separate update strategy
    • 3 minute TTL
    • Separate Worker Role to refresh cache.

What does the architecture look like?



Enight Florida Presidential Election – From an architectural decomposition standpoint of view below it is the layered architectural explanation below.

Presentation Layer:

Main URL - using Azure Traffic Manager with Availability , performance fail over between the Primary Site US East Coast & US North Central.

The Primary site of the application was hosted on US East Coast.

Enight Public Site: This is a set of web roles hosting voter turnout, election result Florida state wide and other real time election related data. The Application had a lot of real time data which was expected to come in from various data sources. The web role will read the data from the cache if valid if not will get the data from the database.

Enight SOE-  At a very high level the Enight SOE acts as data synchronizer between the primary and the secondary. Additionally it reads the data from the blobs and also pushes it to the secondary.

Caching Layer

Azure Dedicated Cache Roles Used. Given the frequency of reads and writes been very high in a short period of time, It was advisable to go with in memory database i.e. cache. All the entities would continue to remain in the database until 3 minutes TTL. A separate worker role to refresh the cache. The Sql Azure Database reads, writes ended up using a CQRS based patterns to manage the reads, writes – this is a guess.

Database Layer

The Storage Layer comprises of Sql Azure and Blob Storage( data coming in from various counties.

Identity & Access Management

The IDM used here is the standard provided by Azure Access Control Service. The out of box features of Azure ACS like integration with Facebook and other social media very typically used.

Azure Auto Scaling Application

Given that peak traffic could get to 40k pages hit/second. The need of a good auto scaling application must have been required to get the elastic advantage.

* Monitoring Tools most probably used were Cerebrata.

Bing Games

Bing Games- MSFT has finally got to a culture of eating its own dog food, Bing Games is one such example where the Scoring Tracking and Ranking System was entirely on Azure. Find the articles here.

For some reason case study has been removed from the MSFT site.Thanks to Google I managed to get the cached page here.

Below are the high level metrics

  • 1900 instances- app servers (various roles)
  • 398 SQL Azure database- Scale Out
  • 30 million unique users/ month
  • 200k concurrent users
  • 3 months, 7 developers.
  • 99.975% uptime in past 12 months

Why did they pick up Azure Database?

Given the database access patterns it was better to pick Sql Azure over Table Storage. This is however not a defacto standard it depends on the requirement. The choice of Sql Azure was based on the fact they had an existing application already in Sql. The testability i.e easy to pre-populate with millions of records faster.

Partition Strategy

Each user the data would remain the same database. So the scale out based on users was easier and faster. The partitioning strategy is static by nature.

Production Statistics

  • 1200 Azure database request/second spread across all partitions during peak loads.
  • 18k connections in Connection Pool and which could grow with traffic


  • 90-10 read vs. writes is similar to facebook is a private social network for groups. This is a mobile only application.

The Initial Architecture


The Initial Architecture pretty straight forward REST based API and table, queue or blob storage. Additional components such as social collaboration and analytics. 

My personal opinion is this architecture is not quite right, below section I explain why and the glassboard folks have found & corrected the same.

The Wrong Architecture for Devices – Which otherwise would be correct in all cases……..

What started up as REST API based programming turned out a total disaster as the underlying storage was table storage. Its not that the table storage was a disaster. Its just that when we start designing the very academic way of doing thing we have an API for every call, soon to be realized the table cost for each read will start hitting your pocket. With a little fiddler exercise one can realize the same

<excerpt from Glassboard site>

To demonstrate the problem I have set up my Glassboard API instance to do no caching whatsoever. I then ran a unit test which simulates a user getting his Newsfeed, then posting a status. I highlighted each set of repeated calls in a different color.

Repeated Storage Requests

</excerpt from Glassboard site> Find the complete commentary here

Not to get alarmed the architectural change on going the feeds way i.e one call like a feed which has pretty much all the data at startup and incorporating caching helped.

Especially when getting into device based programming we need to bring in some of the client server concepts in here this helps a ton.

The solution is here.

Samsung – Worldwide TV Management

Samsung SMART TV started as a concept now a reality, the device has a base software which needed to be updated from time to time. The cloud was the perfect solution assuming these TV’s are connected to the internet. Azure or AWS became the obvious choice not knowing which way the sales would go for Samsung, betting the update and management solutions on the Platform As A A Service was a good decision. Some Key Features

  • Frequent updates with new applications and software changes for better support and compatibility.
  • Due to high sales the need of a scalable and elastic system was required.
  • Utilized 20 large size web roles with ASP.NET.
  • Have a good Web API layer for the same functionality find the developer documentation here.
  • Caching seems the one candidate which all azure application need for rescue.

Solution Architecture


The Architecture is fairly straightforward.

  • Firmware Download Website- Set of web roles which connects with the Smart TV via set of REST based API post authentication will check for the updates and push the updates to the device.
  • Administration Firmware Upload Website – A set of web roles which provide basic administration and reporting functionality.
  • Worker Role  which does task automation – firmware encryption and batch updates. Additionally push the logs to the Sql Azure Database. The updates goes to a blob storage and uses the Azure CDN functionality to push to edge servers.


MYOB(Manage Your Business) is a large Australian based ISV developing Accounting software for small business. The new release AccountRight Live, lets users run their account on a PC or a cloud. Or both at once depending on their preferences.  The hybrid arrangement was not however a cunning innovation design to catapult the company and its customer into a bold and cloudy future.  Based on surveys run with the customer did MYOB decide on rolling out a hybrid strategy. Users count about 150k.  Find the link to the site here.

Based on the multi-tenancy requirement each customer required a separate database. Caching is used as standard feature, reserved CPU per database.

MYOB Solution Architecture

  • Each user installs the client software via a box offering
  • Choice to use the business and data tier either on Azure or on premise
  • The application is developed using C#/.NET using LINQ to SQL and Entity framework. Which is very bad……. LINQ is fairly single threaded process which works very well on an Intel processor with a high clock rate. On a web, worker role AMD processor with very low clock rate, the performance on Azure will be slower. Work around is run LINQ on small CPU with a single core.
  • Database on premises and Azure are kept in sync via Sync framework
  • Each customer has their own Azure SQL Database per business entity.
  • DB to be backed up nightly using DAC Import/Export services keeping 2 days rolling backup files in blob storage.


Identity Management used is Azure ACS (STS inbuilt) most probably with a Custom Identity Provider with option of windows live. With some guess work I see MYOB Identity Management use ADFS to integrate with corporates Active Directory as well.

A simple layered architecture.

  • User Interfaces can be a browser, Client Desktop thick client.
  • Services Layer – This includes the Collaboration Services, Authentication Service, Customer File Service, Huxley Services (Transactional Services). All services are exposed as REST API. The Client Desktop connects with the Collab Service post authentication which uses the Azure ACS receives the ACS token. The Collaboration Services validates with the Billing System and post that connects to the correct User Database depicted as User DB1… DBn.
  • Storage Services: This is basically a Sql Azure set of databases.

What I love about this architecture is the simplicity. Keep tenancy at a database perhaps may not be the most economical solution but its simple.

Caching has been used at every layer. Judging from the speeds they seem to have a set of dedicated caching servers how many is guess work again.

MYOB Implementation had some key lessons to learn – what are they

  • Cloud Platforms
    • Enable massive scalability
    • HA at lower costs
    • Expose rich cloud based API’s
  • Identity Foundation
    • Well integrated with WCF and highly customisable Scaling Database
    • Sharding is the foundation

Issues on WCF throttling had been handled with different architectural solutions. Some of the solutions are here

Key Takeaways 

Couple of Key Areas to watch out for in Azure Application

  • If Mobile is a part of the overall architecture special considerations are required, well have a separate post on the same.
  • Study Storage choice properly Tables, Sql Azure – there is no straightforward answer it varies.Every Sql Azure database has been allocated a maximum of 180 concurrent threads
  • Use Caching wherever possible. This is an architectural decision not a developer.
  • Code Right-  Profile the code as much as possible.
  • Sql Azure is a Relational Database as a Service.SQL Server is the core engine and Sql Azure is logical abstraction over the same. SQL Azure is a subset of SQL Server features. It provides tremendous scale out features.


No comments: