How-old.net: Augmenting Virality in Real Time with Batch Analytics using Data Factory
Microsoft launched the http://www.how-old.net demo as part of the keynote presentation at the Build2015 developer conference. This demo allows anyone to upload a picture and have the Azure ML Face Detection API’s predict the age and gender of any faces recognized in that picture. We wanted to showcase how developers can easily and quickly build smart applications using our newly released Face Detection Machine Learning API’s, analyze data in real time using Azure Stream Analytics and visualize the data in PowerBI.
We started the demo as a test for 50 users but it went viral with more than 33 million users and 236 million images in less than a week of activity. Very soon, this turned out to be a big data question, and we decided to augment real time with historic data analytics. Real time and batch analytics enabled us to have a pulse on what is happening now, as well as have a view into what has already happened. This is a classic use case pattern for anyone trying to get near real time insights (e.g. # of successful face recognitions in the last 60 seconds) but also wanting to analyze data over larger time periods (e.g. how did usage change month to month etc.).
Batch Analytics: How did we do it?
We used Azure Data Factory and Azure HDInsight to quickly build pipelines that allowed us to take the raw face demo data being dumped in azure blob and do batch processing (using HDInsight/Hadoop) to get insights into the data. The Azure Data Factory pipelines (see the screenshot below) are now running every day and processing the data that is being captured from people uploading photos and trying out http://how-old.net . The processed data is dumped to a SQL azure database and consumed by PowerBI dashboard. The entire data factory and pipelines were built and operationalized in less than 4 hours.
Note: The data being collected is not the actual photo itself, but rather metadata such as # of requests per time period, what type of device sent the request etc. No personally identifiable information is collected or analyzed by these pipelines.
The PowerBI dashboard (see the screenshot below) showcasing real-time analytics generated using Azure Stream Analytics (Number of Photos-Last 60 Seconds, Faces by Browser-Last 2 minutes, Photos Per Second-By Date) has now been augmented to showcase batch analytics generated using Azure Data Factory. Some of the batch analytics being captured include:
- Users By Day
- Photos By Day
- Users By Platform
This use case showcases how customers can quickly and easily augment real time analytics with batch analytics for big data processing. Go try out http://how-old.net for yourself (#HowOldRobot) – we hope you have fun with it and are inspired to create your own solutions using Azure services and the APIs available in the ML Gallery.
If you want to try out Azure Data Factory, visit us here and get started by building batch pipelines easily and quickly using data factory. If you have any feature requests or want to provide feedback for data factory, please visit the Azure Data Factory Forums.
Source: Microsoft Azure News