Ajay Solanki

Sunday, 23 June 2013

IaaS Inherent Part of the PaaS Architecture

We wish for a perfect world, honestly this exist only in utopian terms and so does an Architect realize while architecting a solution for the Cloud PaaS there is not perfect architecture. There is an architecture which fits the bill for the moment based on the shortcomings.

Normally while migrating an on premise application on Cloud PaaS we run into multiple scenarios where we see PaaS is not a perfect fit and one soon starts to ponder what are the alternatives and easiest is “Have that application or component build and deployed on an IaaS” as this would be very similar what we already have on premise.

So IaaS is something which is a solution in the short term. With Windows Azure correcting there mistake and bringing Persistent IaaS in 2012 and providing better basic features like clustering and support early this year, IaaS does become an attractive choice.

So what does Persistent VM in Azure really have to offer?

Storage: Persistent Storage – Easily Add new Storage.
Deployment: Build the VHD in the cloud or build on premise and deploy,
Networking:Internal end points are open as default. Access control with firewall or guest OS. Input endpoints controlled through management portal , services and API.
Primary Use: Application that requires persistent storage easily run on Windows Azure.

What OS images Azure IaaS comes with?

Windows Server 2008 R2
Windows Server 2008 R2 with Sql Server 2012 Evaluation
Windows Server 2008 R2 with BizTalk Server 2012.
Windows Server 2012
Open SUSE 12.1
CentOS 6.2
Ubuntu 12.04
Suse Linux Enterprise Server SP2.

Which key Server Applications does Azure IaaS support?

Sql Server 2008 , 2008 R2, 2012 ---> Note Sql Azure comes with very strip version of Sql Server so in case one is planning on using anything beyond transactional , one has to look at Sql Server on IaaS ( SSAS, SSRS, SSIS…..)
SharePoint 2010 , 2013(assuming) –> Note: Given Sharepoint Online is strip down version of SharePoint one will have to look at SharePoint Server on IaaS for more functionalities.
BizTalk Server 2010 – The BizTalk PaaS is in its infancy very limited features EDI / EAI integrations. A complete reference can be found here.
Windows Server 2008 R2, 2012

* The biggest work load on the cloud for any enterprise application is Sql Server on IaaS. This list is going to grow over time. There is also customer support for the above list.

What is the difference between Virtual Machines and Cloud Services?

The Virtual Machines that one creates are implicitly on Cloud Services. The Cloud Services may appear to be segregated from VM but apparently they are not.

To explain things better. Lets take an example.

Let assume we have a cloud services with Web Role ( 3 instances) and Worker Role ( 3 Instances). The Cloud Services acts more like a container consisting of Web and Worker Role,

its like a management container when one deletes/update the cloud services it deletes all the entities in it.
Its also a security boundary i.e roles in the same cloud service can interact with one another which cannot be done across cloud services unless they explicitly allow it.
Its a network boundary, each of the roles are visible to each other on the network.

When creates a Virtual Machine (which are roles with exactly one instance) they are in an implicit cloud service.

When one creates a VM it appears in the VM section of management portal and not under the cloud services. The implicit cloud service is the dns name which is been assigned to the virtual machine. So for example if one has created the first virtual machine with the name mymachinedemo and creates the second virtual machine with the same name mymachinedemo and chooses to create the virtual machine to connect with an existing virtual machine it will give a list of existing virtual machines. So essentially the cloud services act as a container for the virtual machines.

When one creates multiple virtual machine via the option of “connect to an existing virtual machine” what it does it places the new virtual machine under the same cloud service and then the dns name will start showing in the list of cloud services.

The hiding of the cloud service only happens in the portal

Images and Disks, What are these?

Images are base images provided by the create from gallery functionality where one has a bunch of pre-existing images of Operating System, Post creation of the Images you get is an OS disk which your specific operating system disk and associated with these are data disk. By the way the disk are writable disks for Virtual Machines. The VM sizes supported by MSFT currently and subject to change, Additionally you have 28 & 56 GB RAM sizes as well.

The Data disk can go up to 1TB in size. One can attach multiple data disk with one VM.

Images and Disk are stored as Windows Azure Storage Blobs, Data is triplicated i.e 3 copies. It also supports Disk Caching read and readwrite.

OS disk size is about 127 GB.

What is availability story around virtual machines?

The service level agreement for 99.95% for multiple role instances(web and worker) which 4.38 hours of downtime/year. Multiple role instance ideally means 2 VM in the same role. So idea is to a minimum of 2 vm in a role. What’s included in the 99.95% is

compute hardware failure (disk, cpu, memory),
Data Center failure - network and power failure.
Hardware upgrades- Software Maintenance – Host OS Updates

What is not included? – VM Container crashes, Guest OS Updates.

What does this SLA means VM ?

It means if one deploys 2 instances of the same virtual machine in the same cloud service (dns name) which the same availability set then one gets a 99.95% SLA.

What is the concept of availability set?

By default for every role which has 2 instances Windows Azure create 2 instances in Fault Domain and 2 instances in Update Domain. i.e if you have defined Fault and Update Domain . Fault Domain gets defined on the basis of single point of failure in this case its the top of rack router.

Fault Domain represents groups of resources which are anticipated to fail together i.e same rack or same server. Fabric spreads instances across fault or at least 2 fault domains.

Update Domain represent groups of resources that can be updated together.

The availability set comes with the same concept of a fault and update domain concept. So for example if you had 2 instance of the same vm defined in an avail. set , you are going to get instances of same vm in fault and update domain i.e a bare minimum. So in all there are 6 instances of VM running.

The story would be incomplete without proper Networking capability of Azure IaaS.What are the options?

So what has MSFT done for Azure IaaS networking, some of the features include

Full control over machine names
Windows Azure provided DNS- Resolves VM’s by name within the same cloud services. Machine names are modelled and explicitly published in the DNS services
Use an on premise DNS Server.

Note: In PaaS Web and Worker communication happens via messaging in the VM world its DNS lookup.

Protocols Supported

UDP traffic supported – Load balancing incoming traffics and allows outbound traffic
Support all IP Based Protocols (VM to VM communication)- Instance to instance communication TCP, UDP & ICMP.
Port Forwarding – Direct communication to multiple VM’s in the same cloud service.
Custom Load Balancer Health Probes- Health check with probe timeouts. HTPP based probing, allowing granular control of health checks.

Load Balanced Sets for IaaS

Similar to Avail. Set is Load Balance Sets which allows a set of VM within the same cloud service to be load balanced.

Load Balance with Custom Probes

In IaaS there are no agents installed on the VM , so there was a requirement to define a point which could be probed example /health.aspx which is the probe path, if we get an HTTP 200 it assumes everything is healthy.

Cross Premise Connectivity

Connecting with on Premise Active Directory or connecting on premise network Windows Azure Connect has been around, but not very well accepted. Windows Azure Connect using IP Sec tunnelling concept and agent hosted on both the machines which need to communicate. If one is doing a domain join with Azure VM the problem has been is to have the Agent install on the DC which has not gone very well with the many.

Alternatively Site to Site Connectivity – Windows Azure Network came into play. It provides a virtual network and gateway. The gateway using a standard VPN device.

What would one need to take care when migrating application to Azure IaaS?

Sql Server installed on IaaS needs to be clustered so having 2 instances of the Sql Server in the same cloud service will help. One may need size up the data disks required.
Built in Load Balancing Support, So if you deploy Web Application on IaaS one can be relieved on LB.
Integrated Management and Monitoring provided by Azure itself for VM;s
Fault and Upgrade Domain for all VM’s are a must.
Windows Azure Network can be used to connect with an on premise application, domain join
Hourly Billing Support: In addition to making it easier and faster to get started, these SQL Server and BizTalk Server images also enable an hourly billing model which means you don’t have to pay for an upfront license of these server products – instead you can deploy the images and pay an additional hourly rate above the standard OS rate for the hours you run the software. This provides a very flexible way to get started with no upfront costs (instead you pay only for what you use). You can learn more about the hourly rates .
Workloads to be shifted on cloud need to be looked from a compute, storage & networking stand point of view.

The philosophy for the cloud world is lift and shift workloads.

As of 2013 we still need to use a lot of application as is include to cloud and IaaS is an inherent part of overall PaaS. May be in years to come it will be a complete PaaS architecture.

Wednesday, 22 May 2013

Mobile Devices Application Architecture

Mobile Platforms are increasing finding there right places in the enterprises. The Mobile Platform vary from an insignificant non UI based devices to a full-fledged tablets and SMART TV. The compute and storage are a fast moving segment for mobile device capabilities, building applications for these platforms can be a daunting task for developers. Most mobile devices have connectivity in some form or the other and the applications build for devices connect with services which are hosted on premise OR cloud. Concentrating on the mobile device Architecture it would be good to have some guidance for developers what constitute a typical mobile architecture.

High Level Device Architecture

Since most mobile devices as of current are UI intensive the starting point is more around the UI architecture.

The High Level Mobile /Device Application Architecture includes the following components. It’s typically a layered architecture.

Presentation Layer

The UI architecture is heavily driven but certain patterns

MVC – The Model View Controller is an architectural pattern that divides an interactive applications into 3 components. The model contains the core functionality and data. View displays information to the user. Controller handles the user input. View and controller together comprise the user interface. A change propagation mechanism ensures consistency between the user interface and model. This architectural pattern came up first with Smalltalk 80 and post that a large number of UI frameworks have been built around MVC and now become a defacto standard for UI development. For more on MVC refer here.
Delegation- Most User Interface in mobile devices are rich from a functionality standpoint view and have delegate i.e. transfer information, data and processing to another object typically referred to as a background object. Delegation is a design pattern. For more on Delegation refer here. The delegation pattern is a design pattern in object-oriented programming where an object, instead of performing one of its stated tasks, delegates that task to an associated helper object. There is an Inversion of Responsibility in which a helper object, known as a delegate, is given the responsibility to execute a task for the delegator. The delegation pattern is one of the fundamental abstraction patterns that underlie other software patterns such as composition (also referred to as aggregation), mixins and aspects.
Target actions- The User Interface is divided up into View, Controller which dynamically establish relationships by telling each other which object they should target and what action or message to send to that target when an event occurs. This is especially useful when implementing graphical user interfaces, which are by nature event-driven. Most UI on devices are event driven based on the user input raise the desired event and the event handler associated with the same will executed by the event handler associated.
Block objects – Most device based application interact with services or other applications for data services or more. The need of having asynchronous call backs will help saving compute. The services have to modelled differently for devices that’s a whole different topic will be covered in a separately.

Management Layer

The management is crucial piece in devices ranging from memory management, state management all the way to device management. Below are some of the key components.

Memory Management-

Device have less usable memory and storage in comparison to a desktop computer, all applications built need to be very aggressive on deleting unneeded objects and be lazy about creating objects.

Foreground and Background Application Management

Applications on device have to be managed differently when in foreground and background. The operating system limits what your application can do in the background in order to improve the battery life and the user experience with the foreground application. The OS notifies your application when it moves from background to foreground which requires special handling for data loading and UI refresh. So typically from transition between these states what does one really have to take care?

Moving to Foreground: Respond appropriately to the state transitions that occur. Not handling these transitions properly can lead to data loss and a bad user experience.
Moving to Background make sure your app adjusts its behaviour appropriately. Devices which support multitasking have this option in other cases application is terminated. The elementary steps in taking a snapshot image of the current User Interface, save user data and information and free up as much memory as possible. Background application continue to stay in the memory until a low memory situation occurs and OS decides to kill your application. Practically speaking, your app should remove strong references to objects as soon as they are no longer needed. Removing strong references gives the compiler the ability to release the objects right away so that the corresponding memory can be reclaimed. However, if you want to cache some objects to improve performance, you can wait until the app transitions to the background before removing references to them.

Examples of objects that you should remove strong references to as soon as possible include:

Image objects
Large media or data files that you can load again from disk
Any other objects that your app does not need and can recreate easily later

Handling Interrupts

The devices are more complicated than one can think classic scenario handling interrupts like an incoming phone call, when an alert-based interruption occurs, such as an incoming phone call, the app moves temporarily to the inactive state so that the system can prompt the user about how to proceed. The app remains in this state until the user dismiss the alert. At this point, the app either returns to the active state or moves to the background stat. In most devices in this state the application don’t receive notification and other types of events. There needs to be some nature of application state management.

State Management

Irrespective what state the application is in foreground, background or suspended, the application’s data has to stored and restored. Even if your app supports background execution, it cannot run forever. At some point, the system might need to terminate your app to free up memory for the current foreground app. However, the user should never have to care if an app is already running or was terminated. From the user’s perspective, quitting an app should just seem like a temporary interruption. When the user returns to an app, that app should always return the user to the last point of use, so that the user can continue with whatever task was in progress. Preserving and restoring the view controllers and view is something which has to be implemented application specific. State preservation needs to considered at the following scenarios

Application delegate object, which manages the top level state
View Controller object which manages the overall state of the app’s user interface.
Custom View, custom data.

State preservation and restoration is an opt-in feature and requires help from your app to work. When thinking about state preservation and restoration, it helps to separate the two processes first. State preservation occurs when your app moves to the background. The restoration process uses the preserved data to reconstitute your interface. The creation of actual objects is handled by your state management.

Core Data Management

The Model in the MVC in the Presentation Layer holds references to business data which may be displayed in the views. The Models generally can end up holding a lot of data can become a major performance issue. The Models should ideally load data which is most relevant to the scenario with support for lazy loading. The responsibility of loading Core Business Data in the Business entities is handled by the Core Data Management. Core Data is generally a schema driven object graph management and persistent framework. Fundamentally, Core Data helps save the model objects, retrieve model objects from the business layer.

Core Data provides an infrastructure for managing all the changes to your model objects. This gives you automatic support for undo and redo, and for maintaining reciprocal relationships between objects.
It allows you to keep just a subset of your model objects in memory at any given time,
It uses a schema to describe the model objects.
It allows you to maintain disjoint sets of edits of your objects. This is useful if you want to, for example, allow the user to make edits in one view that may be discarded without affecting data displayed in another view.
It has an infrastructure for data store versioning and migration. This lets you easily upgrade an old version of the user’s file to the current version.
It interacts with Services or Business Level API to perform CRUD on the business entities.

Core Data Management is runs on a background thread(s).

Application Resource Management

Aside from images and media files, most devices have a lot more capability ranging from accelerometer, camera, Bluetooth, gps, gyroscope, location services, telemetry, magneto meter, microphone, telephony, Wi-Fi….. Most of these have platform based API to access the same. In certain cases there may be a requirement to manage this resources for example video telephony module. The nature of the API provided by platform provides direct access mechanism to access these resources.

Services Helper Layer

The Services Helper Layer on the device serve the purpose providing caching, api, logging or notification services. The Service Helper Layer function more as an assistance to the device hiding the complexity of the inner working thus simplifying the programming model for the developer and not having to worry about low level functions.

Caching Services

The application running on the devices does have limited memory and compute. Most UI elements, pages, data may require to be in memory while they may not be displayed. A efficient disk and memory based caching strategy needs to implemented for the application for better performance.

API Services

In the cloud world or on premise world the devices are likely to talk via services. These services are infact well defined API’s of the business. APIfying the services is a core concept on the cloud which will be covered later. A standard API is

The API Services will interact with the Services API and manage the lifecycle of the service call asynchronously. It will manage aspects such as format conversion, lazy loading and much more.

Logging Services

This is a generic services provides for application level and business level logging. The data collected will be sent back to the Services serves multi-purpose use.

Notification Services

This is asynchronous module which will manage application based notification and events from the services and other application

Friday, 3 May 2013

Azure SDK 2.0 - Quick Snapshot

Azure SDK 2.0 released 2 days ago pretty good features a lot of work done in Diagnostic areas this was long over due, high memory VM and the wondering Message Pump in Service Bus.

Here is a quick snapshot of the important features - Blog to Azure 2.0 Features

Stream Diagnostic Logging is a good feature looks like limited to Web Site - This may change are logging capabilities.
Cloud Service Support for High Memory VM instances 4 core x 28GB RAM (A6) and 8 core x 56GB RAM (A7) VM sizes.
Faster Deployment Support for Simultaneous Update Options- this is more of a parallel update if you cloud package consists of multiple web and worker roles this will do parallel update as opposed to earlier sequential
Cloud Services also has a separate diagnostic tab - The custom plan is pretty rich and enables fine grain control over error levels, performance counters, infrastructure logs, collection intervals and more.
View Diagnostic on a Live Service, the interesting part here is dynamically turn on detailed diagnostic capturing without having to redeploy.
Storage 2.0 looks a little improved with capabilities of create and delete Windows Azure tables from Visual Studio Explorer.
Service Bus Enhancement - Message Browse Support enables to view messages available in queue without locking the message or performing an explicit receive operation on it.

New Message Pump Programming Model - Similar to event driven or push based processing model approach support concurrent message processing & enables processing message at variable rate. This is a pure replacement to the polling based mechanism to a pure event driven approach.
// Example Code for Pump Programming Model
var eventDrivenMessagingOptions = new OnMessageOptions();
eventDrivenMessagingOptions.AutoComplete = true; – // This indicates post reading the message it’s removed from the queue
eventDrivenMessagingOptions.ExceptionReceived += OnExceptionReceived; à Cleaner Exception Management Handler
eventDrivenMessagingOptions.MaxConcurrentCalls = 5; // Multiple Readers at the same time
// Subscribe for messages.
var queueClient = QueueClient.Create("customers");
queueClient.OnMessage(OnMessageArrived, eventDrivenMessagingOptions);-//On MessageArrived is the Handler which is written when a message arrives.

Find example for the same here Message Pump Programming Sample

· Power Shell Advancements

Additionals

Sunday, 7 April 2013

Large-scale Implementation on Azure Platform

I have been closely reading the Azure CAT (Customer Advisory Team) which helps a lot of customer deliver large, complex projects. The Azure CAT site can be found here guidance on how to use different architectural artefacts in Azure. After some digging this what I have come to understand are some of the largest implementation on Azure Platform. This is however some amount of reverse engineering and research. Hope this is helpful to the readers.

So what has Azure handled so far from a number standpoint of view?

There can various architectures which Azure addresses but in a nutshell its enterprise ready. From a number standpoint of view this is what Azure is handle for applications where each line item below is an individual applications

Largest sharded SQL database – 20 TB- Sql Azure maximum container size of 150 GB, so there have been multiple containers used to resolve the size issues.Of course the querying strategy around this has to be well defined.
Largest number of database- 11,000
Most number of worker instances – 24,000- This is an on demand application which spins 24k worker role instance to perform some complex calculation and shuts down. This is more like HPC to address complex algorithms.
Largest Customer Application- 50 PB

So what are the largest case studies in Azure?

Florida Presidential Election 2012:

This is not the largest but it was highly mission critical. Find the complete write up here.

This was a mission critical for a couple of days with some very high volume numbers to manage.

Florida Election Presidential Election 2012 had the following metrics

Max peak 40k page hits / second
6 million hits peaked in 1 hour
Caching in front DB was a big architectural success.- The very nature of the application been short lived it made sense to push as much data into the cache and have separate update strategy
- 3 minute TTL
- Separate Worker Role to refresh cache.

What does the architecture look like?

Enight Florida Presidential Election – From an architectural decomposition standpoint of view below it is the layered architectural explanation below.

Presentation Layer:

Main URL - https://enlight.elections.myflorida.com using Azure Traffic Manager with Availability , performance fail over between the Primary Site US East Coast & US North Central.

The Primary site of the application was hosted on US East Coast.

Enight Public Site: This is a set of web roles hosting voter turnout, election result Florida state wide and other real time election related data. The Application had a lot of real time data which was expected to come in from various data sources. The web role will read the data from the cache if valid if not will get the data from the database.

Enight SOE- At a very high level the Enight SOE acts as data synchronizer between the primary and the secondary. Additionally it reads the data from the blobs and also pushes it to the secondary.

Caching Layer

Azure Dedicated Cache Roles Used. Given the frequency of reads and writes been very high in a short period of time, It was advisable to go with in memory database i.e. cache. All the entities would continue to remain in the database until 3 minutes TTL. A separate worker role to refresh the cache. The Sql Azure Database reads, writes ended up using a CQRS based patterns to manage the reads, writes – this is a guess.

Database Layer

The Storage Layer comprises of Sql Azure and Blob Storage( data coming in from various counties.

Identity & Access Management

The IDM used here is the standard provided by Azure Access Control Service. The out of box features of Azure ACS like integration with Facebook and other social media very typically used.

Azure Auto Scaling Application

Given that peak traffic could get to 40k pages hit/second. The need of a good auto scaling application must have been required to get the elastic advantage.

* Monitoring Tools most probably used were Cerebrata.

Bing Games

Bing Games- MSFT has finally got to a culture of eating its own dog food, Bing Games is one such example where the Scoring Tracking and Ranking System was entirely on Azure. Find the articles here.

For some reason case study has been removed from the MSFT site.Thanks to Google I managed to get the cached page here.

Below are the high level metrics

1900 instances- app servers (various roles)
398 SQL Azure database- Scale Out
30 million unique users/ month
200k concurrent users
3 months, 7 developers.
99.975% uptime in past 12 months

Why did they pick up Azure Database?

Given the database access patterns it was better to pick Sql Azure over Table Storage. This is however not a defacto standard it depends on the requirement. The choice of Sql Azure was based on the fact they had an existing application already in Sql. The testability i.e easy to pre-populate with millions of records faster.

Partition Strategy

Each user the data would remain the same database. So the scale out based on users was easier and faster. The partitioning strategy is static by nature.

Production Statistics

1200 Azure database request/second spread across all partitions during peak loads.
18k connections in Connection Pool and which could grow with traffic

Database

90-10 read vs. writes

Glassboard.com

Glassboard.com is similar to facebook is a private social network for groups. This is a mobile only application.

The Initial Architecture

The Initial Architecture pretty straight forward REST based API and table, queue or blob storage. Additional components such as social collaboration and analytics.

My personal opinion is this architecture is not quite right, below section I explain why and the glassboard folks have found & corrected the same.

The Wrong Architecture for Devices – Which otherwise would be correct in all cases……..

What started up as REST API based programming turned out a total disaster as the underlying storage was table storage. Its not that the table storage was a disaster. Its just that when we start designing the very academic way of doing thing we have an API for every call, soon to be realized the table cost for each read will start hitting your pocket. With a little fiddler exercise one can realize the same

To demonstrate the problem I have set up my Glassboard API instance to do no caching whatsoever. I then ran a unit test which simulates a user getting his Newsfeed, then posting a status. I highlighted each set of repeated calls in a different color.

</excerpt from Glassboard site> Find the complete commentary here

Not to get alarmed the architectural change on going the feeds way i.e one call like a feed which has pretty much all the data at startup and incorporating caching helped.

Especially when getting into device based programming we need to bring in some of the client server concepts in here this helps a ton.

The solution is here.

Samsung – Worldwide TV Management

Samsung SMART TV started as a concept now a reality, the device has a base software which needed to be updated from time to time. The cloud was the perfect solution assuming these TV’s are connected to the internet. Azure or AWS became the obvious choice not knowing which way the sales would go for Samsung, betting the update and management solutions on the Platform As A A Service was a good decision. Some Key Features

Frequent updates with new applications and software changes for better support and compatibility.
Due to high sales the need of a scalable and elastic system was required.
Utilized 20 large size web roles with ASP.NET.
Have a good Web API layer for the same functionality find the developer documentation here.
Caching seems the one candidate which all azure application need for rescue.

Solution Architecture

The Architecture is fairly straightforward.

Firmware Download Website- Set of web roles which connects with the Smart TV via set of REST based API post authentication will check for the updates and push the updates to the device.
Administration Firmware Upload Website – A set of web roles which provide basic administration and reporting functionality.
Worker Role which does task automation – firmware encryption and batch updates. Additionally push the logs to the Sql Azure Database. The updates goes to a blob storage and uses the Azure CDN functionality to push to edge servers.

MYOB

MYOB(Manage Your Business) is a large Australian based ISV developing Accounting software for small business. The new release AccountRight Live, lets users run their account on a PC or a cloud. Or both at once depending on their preferences. The hybrid arrangement was not however a cunning innovation design to catapult the company and its customer into a bold and cloudy future. Based on surveys run with the customer did MYOB decide on rolling out a hybrid strategy. Users count about 150k. Find the link to the site here.

Based on the multi-tenancy requirement each customer required a separate database. Caching is used as standard feature, reserved CPU per database.

MYOB Solution Architecture

Each user installs the client software via a box offering
Choice to use the business and data tier either on Azure or on premise
The application is developed using C#/.NET using LINQ to SQL and Entity framework. Which is very bad……. LINQ is fairly single threaded process which works very well on an Intel processor with a high clock rate. On a web, worker role AMD processor with very low clock rate, the performance on Azure will be slower. Work around is run LINQ on small CPU with a single core.
Database on premises and Azure are kept in sync via Sync framework
Each customer has their own Azure SQL Database per business entity.
DB to be backed up nightly using DAC Import/Export services keeping 2 days rolling backup files in blob storage.

Identity Management used is Azure ACS (STS inbuilt) most probably with a Custom Identity Provider with option of windows live. With some guess work I see MYOB Identity Management use ADFS to integrate with corporates Active Directory as well.

A simple layered architecture.

User Interfaces can be a browser, Client Desktop thick client.
Services Layer – This includes the Collaboration Services, Authentication Service, Customer File Service, Huxley Services (Transactional Services). All services are exposed as REST API. The Client Desktop connects with the Collab Service post authentication which uses the Azure ACS receives the ACS token. The Collaboration Services validates with the Billing System and post that connects to the correct User Database depicted as User DB1… DBn.
Storage Services: This is basically a Sql Azure set of databases.

What I love about this architecture is the simplicity. Keep tenancy at a database perhaps may not be the most economical solution but its simple.

Caching has been used at every layer. Judging from the speeds they seem to have a set of dedicated caching servers how many is guess work again.

MYOB Implementation had some key lessons to learn – what are they

Cloud Platforms
- Enable massive scalability
- HA at lower costs
- Expose rich cloud based API’s
Identity Foundation
- Well integrated with WCF and highly customisable Scaling Database
- Sharding is the foundation

Issues on WCF throttling had been handled with different architectural solutions. Some of the solutions are here

Key Takeaways

Couple of Key Areas to watch out for in Azure Application

If Mobile is a part of the overall architecture special considerations are required, well have a separate post on the same.
Study Storage choice properly Tables, Sql Azure – there is no straightforward answer it varies.Every Sql Azure database has been allocated a maximum of 180 concurrent threads
Use Caching wherever possible. This is an architectural decision not a developer.
Code Right- Profile the code as much as possible.
Sql Azure is a Relational Database as a Service.SQL Server is the core engine and Sql Azure is logical abstraction over the same. SQL Azure is a subset of SQL Server features. It provides tremendous scale out features.

Saturday, 16 March 2013

Failsafe Computing in Cloud

Origins of the Cloud Computing Platform are closely linked to Service Oriented Architecture. In the cloud we think of everything as Service. These services have SLO’s ( Service Level Objective) very similar to SLA’s.

Why Failsafe Computing in Cloud?

With more and more organization increasing adopting the cloud platform, the perils of the same are many. We have seen amazon go down a couple of times in the 2012,2013 http://www.slashgear.com/amazon-com-is-down-its-not-just-you-update-back-in-business-31267671/ & Azure Outage http://www.zdnet.com/microsofts-december-azure-outage-what-went-wrong-7000010021/. The need for Failsafe Computing in Cloud is Now.

Failsafe Computing in Cloud is not an after thought like any other architecture discipline its one of the non functional requirement which has found its rightful place.

Any application build for cloud is structured around services. These services have workload associated with them. Services can be as generic as Sales Force Automation or Retail as an overarching services which can comprise of many other services to make it happen. The workload is a more broader concept example below.

What Failsafe Services really mean?

We architect the cloud platform on the guidelines of SOA, We define Service as basic unit of deployment of course it starts from conceptual architecture. For example a retail service is an independent functionality in the cloud going about doing its regular business. We’d expect this service to have defined SLO. The high level attributes of FailSafe Services are

Software into Service: In the cloud platform everything is in term of Services. The delivery of cloud projects are in terms of services with defined SLO’s (availability …..)
Services not Servers: In the cloud world we have our services deployed on logical vm and have the option of scale out. We no more think in terms of Servers.
Decomposition by Workload:Cloud computing provides a layer of abstraction where the physical infrastructure, and even the appearance of physical infrastructure, has less of an impact on the overall application architecture. So instead of an application being required to run on a server, it can be decomposed into a set of loosely coupled services that have the freedom to run in the most appropriate fashion. This is the foundation of the workload model because what may be considered an appropriate way to run for one part of an application may be wildly different for another, hence the need to separate out the different parts of an application so that they can be dealt with separately. An example for workload is “Consider an example of an e-commerce application and the two distinct features of catalogue search and order placement. Even though these features are used by the same user (consumer) their underlying requirements differ. Ordering requires more data rigour and security, whereas search needs to optimally scale. A search being slightly incorrect is tolerable, whereas an incorrect order is not. In this example, the single use case of searching for and ordering a product can be decomposed into two different workloads, and a third if we count the back-end integration with the order fulfilment process.”
Utilize Scale Units: Design by Scale Units is around how to define a capacity block for the Service. The model of the capacity of the block addresses unit of scalability, availability for that services. One may argue that adding more vm promotes elastic but on the contrary a scale unit could be a set of vm which can added or removed on the fly.
Design for Operations: “Every services which runs on cloud has to satisfy some operational asks”. For example all services have to emit a basic level of telemetry like logging on health, issues, exception.

What does a service comprise of?

A service can comprise a number of web or worker role or persistent vm roles and storage (tables, queues, blob or sql azure) and is dependent on other services as well. The services can have inter or intra service dependencies.

What are the SLA’s around services availability?

The 9’s around services are dependent on the cloud platform which 99.9, but there is strong dependence on the code which one writes in those services, for example dependency on external services. Below is what Azure Platform provides as a SLA.

What has throttling got to do with Services?

The services hosting in the cloud have to fulfil a certain request and run on the resources provided to it, there are chances when these resources run into been scarce or unavailable. One may end up using queues and also setting the Maximum number of messages beyond which it may not accept new messages. Throttling is a standard pattern noticed on shared resources. Throttling is an area which needs to be dealt at the time of architecture for ex: If Service A throttles after 5k request/sec use multiple accounts in your architecture. Another classical example is Facebook maintains a 99.99% avail but it has lot more constraints in the fine print like if you pound the site with over x request/second we will throttle you.

More on Decomposition by Workload?

Taking off where the earlier question on workload.

Decomposition is essentially an Architectural Pattern [POSA].

When architecting for the cloud, we don't create all of these decomposed services just because the platform allows it. After all, it does increase the complexity and require more effort to build. In the context of cloud computing, this architectural pattern has, amongst others, the following benefits:

Availability — well-separated services create fault isolation zones, so that a failure of one part of the application will not bring everything down.
Increased scalability — where parts of the application that require high scalability are not held back by those that do not.
Continuous release and maintainability — different parts of the application can be upgraded without application-wide downtime.
Operational responsiveness — operators can monitor, plan and respond better to different events.

The workload model requires that features be decomposed into discrete workloads that are identifiable, well named, and can be referenced. These workloads form the basis of the services that will deliver the required functionality. The workloads are also used in other ALM models to establish the architectural boundaries of services as they apply to specific models.

Decomposing Workloads

There are no easy rules for decomposing workloads which is why it should only be tackled by an experienced architect. An architect with little cloud computing experience will probably err on the side of not enough decomposition. The challenge is identifying the workloads for your particular application. Some are obvious, while others less so, and too much decomposition can create unnecessary complexity. Workloads can be decomposed by use case, features, data models, releases, security, and so on.

As the architect works through the functionality, some key workloads may become clear early on. For example:

Separating the front-end workloads (where an immediate request response is required) can be easily distinguished from back-end workloads (where processing can be offloaded to an asynchronous process).
Scheduled jobs, such as ETL, need to be kicked off at a particular time of day.
Integration with legacy systems.
Low load internal administrative features, such as account administration.

Indicators of differing workloads

Determining how to decompose workloads is the responsibility of the architect, and experienced architects should take to it quite easily. The following indicators of differing workloads are only a guide, as the particular application and environment may have differing indicators.

Feature roll-out

The separation of features into sets that are rolled-out over time are often indicators of separate workloads. For example, an e-commerce application may have the viewing of product browsing history in the first release, with viewing of product recommendations based on browsing history in a subsequent release. This indicates that product recommendations can be in a separate workload to simple browsing history.

Use case

A single user, in a single session, may access different features that appear seamless to the user but are separate use cases. The separate use cases may indicate separate workloads. For example, the primary use case on Twitter of viewing your timeline and tweeting is separate from the searching use case. Searching is a separate workload, which is implemented on Twitter as a completely separate service.

User wait times

Some features require that the service provides a quick response, while others have a longer time that the user is prepared to wait. For example, a user expects that items can be removed from a shopping basket immediately, but are prepared to wait for order confirmations to be e-mailed through. This difference in wait time indicates that there are separate workloads for basket features and order confirmation.

Model differences

The importance of workload decomposition in the design phase is because all other models that need to be developed in design (such as the data model, security model, operational model, and so on) are influenced by the various workloads. Using our e-commerce example, without identifying search and ordering as separate workloads, we would get stuck when developing the security model as we would either end up with too much security for search (which is essentially public data, and has low security) by lumping it together with the higher security requirements for orders, or the reverse, where we are exposed to hacking because orders are insecure.

In the process of working through the models, a clue that workloads are incorrectly defined is when a model doesn't seem to fit cleanly with the workload. This may indicate that there are two workloads that need to be separated out. Whilst it is better to clearly define the workloads early on, it is possible that some will emerge later in the design, or indeed as requirements change during development. The problem, of course, is that when new workloads are identified they need to be reviewed against models that have already been developed, as at least one model would have changed.

Below are some examples where a difference in a model indicates the possibility that the feature is composed of two different workloads:

Availability model — When developing the availability model, if one feature has higher availability requirements than another, then it may indicate that there are separate workloads. For example, the Twitter API (as used by all Twitter clients) needs to be far more available than search.

Lifecycle model — The lifecycle model may show that a particular feature is subject to spiky traffic or high load. In order to be able to scale that feature, it should be in a separate workload to those that have flatter usage patterns. For example, hotel holiday bookings may be spiky because of promotions, seasons or other influences, but the reviewing of hotels by guests may be a lot flatter. So, hotel reviews may be in a separate workload.

Data model — The data model separates data into schemas that may be based on workloads, so getting the workload model and the data model aligned is important. Features that use different data stores indicate possible workload separation. For example, the product catalogue may be in a search optimised data store, such as SOLR, whereas the rest of the application stores data in SQL. This may indicate that search is a distinct and separate workload.
Security model — Features or data that have different security requirements can indicate separate workloads. For example, in question and answer applications the reading of questions may be public, but asking and answering questions requires a login. This may indicate that viewing and editing are separate workloads.
Integration model — Different integration points often require separate workloads. While some integration may require immediate data, such as a stock availability lookup and will be in the same workload as other functionality, the overnight updating of stock on-hand may be a separate workload.
Deployment model — Some functionality may be subject to frequent changes while others remain static, indicating the possibility of separate workloads. For example, the consumer-facing part of an application may update frequently as features are added and defects fixed, whereas the admin user interface stays unchanged for longer periods. The need to deploy one set of functionality without having to worry about others can be helped by separating the workloads.

Implementing workloads as services

Workloads are a logical construct, and the decision about what workloads to put into what services remains an implementation decision. Ultimately, many workloads will be grouped into the single services, but this should not impact the logical separation of the workloads. For example, the web application service may contain many front-end workloads because they work better together as a single service. Another example is the common pattern to have a single worker role processing messages from multiple queues, resulting in a number of workloads being handled by a single role.

The decision to group workloads together should happen late in the development cycle, after most of the ALM models have been completed, as the differences across models may be significant enough to warrant separate implemented services.

Identified workloads

The primary output of the workload model is a list of workloads, with some of their characteristics, so that they can be used and referenced in other ALM models. For each identified workload:

Name the workload.
Provide a contextual description of the workload. Bias the description towards the business requirement so that all stakeholders can understand it.
Briefly highlight relevant technical aspects of the workload that may influence the model. For example, the workload may have special latency requirements, or need to interface with an external system. These aspects should be quick and easy to read through for all workloads when developing the models.

How do we look at failure points?

Most applications today by design handle failure points very inefficiently, what does this mean for example “one could go to an internet site referring back to e-commerce example trying to place an order and bang gets error message failed to connect to x order service with some cryptic stack trace.Most of errors written today are not meant for operations folks its for the developer. A failure point is place in the code which has an external dependency example opening a database connection, access a configuration file, so the typical error message the operation team would be expecting “this <action> open of this <artefact> database failed or did not work due this probable <reason> a timeout”. The other classical case the try and catch block where an exception gets thrown from the lowest most level to the highest level which itself is expensive without proper messaging.

Failure Mode Analysis(FMA)

A predictable root cause if the outage that occurs at a Failure Point. Failure Mode is the various condition can experienced on a Failure Point.

Failure Point is an external condition most of the time, failure modes identify the root cause of an outage of a failure point. The art which the developer needs to be vary of here “how much of the failure can be fixed by a simple retry or reported out”. The retry go far beyond just database connection it can service opening, connecting to the service bus etc…”.

Failure Mode Example

Failure Mode Modelling is as important as Threat Modelling and should be part of the overall project lifecycle.

What is a Scale Unit in Cloud World?

Unit of Scale is associated with a service, a workload and is the null unit of deployment in case of a scale up or down. A Unit of Scale has the following

Workloads – Messaging, Collaboration, Productivity.
Resources- 4 – Web Roles ( 8 CPU)
Storage : 100 GB Database, 10 GB Blob Storage
Demands it can meet: 10k Active Users, 1K Concurrent Users, < 2 seconds response time.

Fault and Upgrade Domains.

The architecture or design is cloud strongest as its weakest component. A failed component can’t take down service. Make sure there are dual domain or “minimum of 2 instances”. Upgrade Domain is another areas. Both these areas are an inherent part of Windows Azure more information can be found here.

What are consideration one needs to give in Applications?

Following are the recommendations

Default to asynchronous
Handle Transient Faults
Circuit Breaker Pattern: Services in cloud architecture generally have an avail of 99.99% , with 2 instances the avail can be increased further and adding up geo we can achieve much more.Throttling in case overhauling client calls or other failure condition requires the clients to write the code in such as manner where by retries to the service can happen in a safe manner i.e when the service is up. Developing enterprise-level applications, we often need to call external services and resources. One method of attempting to overcome a service failure is to queue requests and retry periodically. This allows us to continue processing requests until the service becomes available again. However, if a service is experiencing problems, hammering it with retry attempts will not help the service to recover, especially if it is under increased load. Such a pounding can cause even more damage and interruption to services. If we know there could potentially be a problem with a service, we can help take some of the strain by implementing a Circuit Breaker pattern on the client application.
Automate All the Things

Embrace Open Standards – This is bit of Prescriptive Guidance which can help

OData – Use OData as standard data protocol
OAuth- Identity standards
Open Graph-

These standards are discussed in a Fail Safe because there is no need to reinvent the wheel around data, identity and social arena as this promotes easy interoperability.

Data Decomposition

In the cloud world its key to understood reading and writing from the single storage has its limitation, there is no defined limit on the number of concurrent connection to sql azure but there is high chance too many connection can lead to throttle. Most architect tend to give too much importance on the application and service layer from a scale unit stand point of view but kind of forget database also many need some kind of partitioning i.e horizontal, vertical etc..

Apply functional composition to database layer too.

Don’t force partitioning for the sake of partitioning this will impact manageability.
Partition where when required to reduce dependency, independent management and scale,

Reduce logic in SQL Database

CRUD is acceptable.

Latency Shifts

Latency is cuts across 2 areas internal server to server OR device to service. Latency has to be built into design.

References

1.http://www.windowsazure.com/en-us/develop/net/architecture/