Thursday, 30 June 2016

Automating Analytic Workflows on AWS

Organizations are experiencing a proliferation of data. This data includes logs, sensor data, social media data, and transactional data, and resides in the cloud, on premises, or as high-volume, real-time data feeds. It is increasingly important to analyze this data: stakeholders want information that is timely, accurate, and reliable. This analysis ranges from simple batch processing to complex real-time event processing. Automating workflows can ensure that necessary activities take place when required to drive the analytic processes.

With Amazon Simple Workflow (Amazon SWF), AWS Data Pipeline, and, AWS Lambda, you can build analytic solutions that are automated, repeatable, scalable, and reliable. In this post, I show you how to use these services to migrate and scale an on-premises data analytics workload.

Workflow basics

A business process can be represented as a workflow. Applications often incorporate a workflow as steps that must take place in a predefined order, with opportunities to adjust the flow of information based on certain decisions or special cases.

The following is an example of an ETL workflow:

A workflow decouples steps within a complex application. In the workflow above, bubbles represent steps or activities, diamonds represent control decisions, and arrows show the control flow through the process. This post shows you how to use Amazon SWF, AWS Data Pipeline, and AWS Lambda to automate this workflow.

Overview

SWF, Data Pipeline, and Lambda are designed for highly reliable execution of tasks, which can be event-driven, on-demand, or scheduled. The following table highlights the key characteristics of each service.

FEATURE	AMAZON SWF	AWS DATA PIPELINE	AWS LAMBDA
Runs in response to	Anything	Schedules	Events from AWS services/direct invocation
Execution order	Orders execution of application steps	Schedules data movement	Reacts to event triggers / Direct calls
Scheduling	On-demand	Periodic	Event-driven / on-demand / periodic
Hosting environment	Anywhere	AWS/on-premises	AWS
Execution design	Exactly once	Exactly once, configurable retry	At least once
Programming language	Any	JSON	Supported languages

Let’s dive deeper into each of the services. If you are already familiar with the services, skip to the section below titled "Scenario: An ecommerce reporting ETL workflow."

Amazon SWF

SWF allows you to build distributed applications in any programming language with components that are accessible from anywhere. It reduces infrastructure and administration overhead because you don’t need to run orchestration infrastructure. SWF provides durable, distributed-state management that enables resilient, truly distributed applications. Think of SWF as a fully-managed state tracker and task coordinator in the cloud.

SWF key concepts:

Workflows are collections of actions.
Domains are collections of related workflows.
Actions are tasks or workflow steps.
Activity workers implement actions.
Deciders implement a workflow’s coordination logic.

SWF works as follows:

A workflow starter kickoffs your workflow execution. For example, this could be a web server frontend.
SWF receives the start workflow execution request and then schedules a decision task.
The decider receives the task from SWF, reviews the history, and applies the coordination logic to determine the activity that needs to be performed.
SWF receives the decision, schedules the activity task, and waits for the activity task to complete or time out.
SWF assigns the activity to a worker that performs the task, and returns the results to Amazon SWF.
SWF receives the results of the activity, adds them to the workflow history, and schedules a decision task.
This process repeats itself for each activity in your workflow.

The graphic below is an overview of how SWF operates.

Source: Amazon Simple Workflow – Cloud-Based Workflow Management

To facilitate a workflow, SWF uses a decider to co-ordinate the various tasks by assigning them to workers. The tasks are the logical units of computing work, and workers are the functional components of the underlying application. Workers and deciders can be written in the programming language of your choice, and they can run in the cloud (such as on an Amazon EC2 instance), in your data center, or even on your desktop.

In addition, SWF supports Lambda functions as workers. This means that you can use SWF to manage the execution of Lambda functions in the context of a broader workflow. SWF provides the AWS Flow Framework, a programming framework that lets you build distributed SWF-based applications quickly and easily.

AWS Data Pipeline

Data Pipeline allows you to create automated, scheduled workflows to orchestrate data movement from multiple sources, both within AWS and on-premises. Data Pipeline can also run activities periodically: The minimum pipeline is actually just an activity. It is natively integrated with Amazon S3, Amazon DynamoDB,Amazon RDS, Amazon EMR, Amazon Redshift, and Amazon EC2 and can be easily connected to third-party and on-premises data sources. Data Pipeline’s inputs and outputs are specified as data nodes within a workflow.

Data Pipeline key concepts:

A pipeline contains the definition of the dependent chain of data sources, destinations, and predefined or custom data processing activities required to execute your business logic.
Activities:
- Arbitrary Linux applications – anything that you can run from the shell
- Copies between different data source combinations
- SQL queries
- User-defined Amazon EMR jobs
A data node is a Data Pipeline–managed or user-managed resource.
Resources provide compute for activities, such as an Amazon EC2 instance or an Amazon EMR cluster, that perform the work defined by a pipeline activity.
Schedules drive orchestration execution.
A parameterized template lets you provide values for specially marked parameters within the template so that you can launch a customized pipeline.

Data Pipeline works as follows:

Define a task, business logic, and the schedule.
Data Pipeline checks for any preconditions.
After preconditions are satisfied, the service executes your business logic.
When a pipeline completes, a message is sent to the Amazon SNS topic of your choice. Data Pipeline also provides failure handling, SNS notifications in case of error, and built-in retry logic.

Below is a high-level diagram.

AWS Lambda

Lambda is an event-driven, zero-administration compute service. It runs your code in response to events from other AWS services or direct invocation from any web or mobile app and automatically manages compute resources for you. It allows you to build applications that respond quickly to new information, and automatically hosts and scales them for you.

Lambda key concepts:

Lambda function - You write a Lambda function, give it permission to access specific AWS resources, and then connect the function to your AWS or non-AWS resources.
Event sources publish events that cause your Lambda function to be invoked. Event sources can be:
- AWS service, such as Amazon S3 and Amazon SNS.
- Other Amazon services, such as Amazon Echo.
- An event source you build, such as a mobile application.
- Other Lambda functions that you invoke from within Lambda
- Amazon API Gateway – over HTTPS

Lambda works as follows:

Write a Lambda function.
Upload your code to AWS Lambda.
Specify requirements for the information execution environment, including memory requirements, a timeout period, an IAM role, and the function you want to invoke within your code.
Associate your function with your event source.
Lambda executes any functions that are associated with it, either asynchronously or synchronously depending on your event source.

Scenario: An ecommerce reporting ETL workflow

Here's an example to illustrate the concepts that I have discussed so far. Transactional data from your company website is stored in an on-premises master database and replicated to a slave for reporting, ad hoc querying, and manual targeting through email campaigns. Your organization has become very successful, is experiencing significant growth, and needs to scale. Reporting is currently done using a read-only copy of the transactional database. Under these circumstances, this design does not scale well to meet the needs of high-volume business analytics.

The following diagram illustrates the current on-premises architecture.

You decide to migrate the transactional reporting data to a data warehouse in the cloud to take advantage of Amazon Redshift, a scalable data warehouse optimized for query processing and analytics.

The reporting ETL workflow for this scenario is similar to the one I introduced earlier. In this example, I move data from the on-premises data store to Amazon Redshift and focus on the following activities:

Incremental data extraction from an on-premises database
Data validation and transformation
Data loading into Amazon Redshift

I am going to decompose this workflow using AWS Data Pipeline, Amazon SWF, and AWS Lambda and highlight key aspects of each approach. I focus on three different approaches, with each approach focusing on an individual service. Please note that a solution using all three services together is possible, but not covered in this post.

AWS Data Pipeline reporting ETL workflow

Data Pipeline lets you define a dependent chain of data sources and destinations with an option to create data processing activities in a pipeline. You can schedule the tasks within the pipeline to perform various activities of data movement and processing. In addition to scheduling, you can also include failure and retry options in the data pipeline workflows.

The following diagram is an example of a Data Pipeline reporting ETL pipeline that moves data from the replicated slave database on-premises to Amazon Redshift:

The above pipeline performs the following activities:

ShellCommandActivity - Incrementally extracts data from an on-premises data store to Amazon S3 using a custom script hosted on an on-premises server with Task Runner installed.
EmrActivity - Launches a transient cluster that uses the extracted dataset as input, validates, and transforms it, and then outputs to an Amazon S3 bucket as illustrated in the blog post on ETL Processing Using AWS Data Pipeline and Amazon Elastic MapReduce.
CopyActivity – Performs an Amazon Redshift COPY command on the transformed data and loads it into an Amazon Redshift table for analytics and reporting.

Data Pipeline checks the data for readiness. It also allows you to schedule and orchestrate the data movement while providing you with failure handling, SNS notifications in case of error, and built-in retry logic. You can also specify preconditions as decision logic. For example, a precondition can check whether data is present in an S3 bucket before a pipeline copy activity attempts to load it into Amazon Redshift.

Data Pipeline is useful for creating and managing periodic, batch-processing, data-driven workflows. It optimizes the data processing experience, especially as it relates to data on AWS. For on-demand or real-time processing, Data Pipeline is not an ideal choice.

Amazon SWF on-demand reporting ETL workflow

You can use SWF as an alternative to Data Pipeline if you prioritize fine-grained, programmatic customization over the control flow and patterns of your workflow logic. SWF provides significant benefits, such as robust retry mechanisms upon failure, centralized application state tracking, and logical separation of application state and units of work.

The following diagram is an example SWF reporting ETL workflow.

The workflow’s decider controls the flow of execution from task to task. At a high level, the following activities take place in the above workflow:

An admin application sends a request to start the reporting ETL workflow.
The decider assigns the first task to on-premises data extraction workers to extract data from a transactional database.
Upon completion, the decider assigns the next task to the EMR Starter to launch an EMR ETL cluster to validate and transform the extracted dataset.
Upon completion, the decider assigns the last task to the Amazon Redshift Data Loader to load the transformed data into Amazon Redshift.

This workflow uses SWF for cron to automate failure handling and scaling in case you want to run your cron job on a pool of machines on-premises. In the latter case, this would eliminate any single point of failure, which is not possible with the traditional operating system cron.

Because the workers and deciders are both stateless, you can respond to increased traffic by simply adding more workers and deciders as needed. You can do this using the Auto Scaling service for applications that are running on EC2 instances in the AWS cloud.

To eliminate the need to manage infrastructure in your workflow, SWF now provides a Lambda task so that you can run Lambda functions in place of, or alongside, traditional SWF activities. SWF invokes Lambda functions directly, so you don’t need to implement a worker program to execute a Lambda function (as you must with traditional activities).

The following example reporting ETL workflow replaces traditional SWF activity workers with Lambda functions.

In the above workflow, SWF sequences Lambda functions to perform the same tasks described in the first example workflow. It uses the Lambda-based database loader to load data into Amazon Redshift.Implementing activities as Lambda tasks using SWF Flow Framework simplifies the workflow’s execution model because there are no servers to maintain.

SWF makes it easy to build and manage on-demand, scalable, distributed workflows. For event-driven reporting ETL, turn to Lambda.

AWS Lambda event-driven reporting ETL workflow

With Lambda, you can convert the reporting ETL pipeline from a traditional batch processing or on-demand workflow to a real-time, event processing workflow with zero administration. The following diagram is an example Lambda event-driven reporting ETL workflow.

The above workflow uses Lambda to perform event-driven processing without managing any infrastructure.

At a high-level, the following activities take place:

An on-premises application uploads data into an S3 bucket.
S3 invokes a Lambda function to verify the data upon detecting an object-created event in that bucket.
Verified data is staged in another S3 bucket. You can batch files in the staging bucket, then trigger a Lambda function to launch an EMR cluster to transform the batched input files.
The Amazon Redshift Database Loader loads transformed data into the database.

In this workflow, Lambda functions perform specific activities in response to event triggers associated with AWS services. With no centralized control logic, workflow execution depends on the completion of an activity, type of event source, and more fine-grained programmatic flow logic within a Lambda function itself.

Summary

In this post, I have shown you how to migrate and scale an on-premises data analytics workload using AWS Data Pipeline, Amazon SWF, or AWS Lambda. Specifically, you’ve learned how Data Pipeline can drive a reporting ETL pipeline to incrementally refresh an Amazon Redshift database as a batch process; how to use SWF in a hybrid environment for on-demand distributed processing; and finally, how to use Lambda to provide event-driven processing with zero administration.

You can learn how customers are leveraging Lambda in unique ways to perform event-driven processing in the blog posts Building Scalable and Responsive Big Data Interfaces with AWS Lambda and How Expedia Implemented Near Real-time Analysis of Interdependent Datasets.

To get started quickly with Data Pipeline, you can use the built-in templates discussed in the blog post Using AWS Data Pipeline's Parameterized Templates to Build Your Own Library of ETL Use-case Definitions.

To get started with SWF, you can launch a sample workflow in the AWS console, or try a sample in one of the programming languages, or use the SWF Flow programming Framework.

As noted earlier, you can choose to build a reporting ETL solution that uses all three services together to automate your analytic workflow. Data Pipeline, SWF, and Lambda provide you with capability that scales to meet your processing needs. You can easily integrate these services to provide an end-to-end solution. You can build upon these concepts to automate not only your analytics workflows, but also your business processes and different types of applications that can exist anywhere in a scalable and reliable fashion.

If you have questions or suggestions, please leave a comment below.

Wednesday, 15 June 2016

Stupid Azure Trick #6 – A CORS Toggler Command-line Tool for Windows Azure Blobs

[Edit: I originally accidentally published an old draft. The draft went out to all email subscribers and was public for around 90 minutes. Fixed now.]

In the most recent Stupid Azure Trick installment, I explained how one could host a 1000 visitor-per-day web site for one penny per month. Since then I also explained my choice to use CORS in that same application. Here I will dig into specifically using CORS with Windows Azure.

I also show how the curl command line tool can be helpful to examine CORS properties in HTTP headers for a blob service.

I also will briefly describe a simple tool I built that could quickly turn CORS on or off for a specified Blob service – the CORS Toggler. The CORS Toggler (in its current simple form) was useful to me because of two constraints that were true for my scenario:

I was only reading files from the Windows Azure Blob Service. When just reading, pre-flight request doesn’t matter when you are just reading. Simplification #1.
I didn’t care whether the blob resource is publicly available, rather than just available to my application. So the CORS policy was to open to any caller (‘*’). Simplification #2.

These two simplifications mean that the toggler knew what it meant to enable CORS (open up for reading to all comers) and to disable. (Though it is worth noting that opening up CORS to any caller is probably a common scenario. Also worth noting that tool could easily extended to support a whitelist for allowed domains or other features.)

First, here’s the code for the toggler – there are three files here:

Driver program (Console app in C#) – handles command line params and such and then calls into the …
Code to perform simple CORS manipulation (C# class)
The above two and driven (in my fast toggler) through the third file (command line batch file) which passes in the storage keys and storage account name for the service I was working with

@echo off

D:\dev\path\to\ToggleCors.exe azuremap 123abcYourStorageKeyIsHere987zyx== %1

echo.
echo FOR COMPARISON QUERY THE nocors SERVICE (which never has CORS set)
echo.

D:\dev\path\to\ToggleCors.exe nocors 123abcYourStorageKeyIsHere987zyx== -q

echo.

PowerShell -Command Get-Date

view raw cors-toggler.bat hosted with

by GitHub

	using System;
	using Microsoft.WindowsAzure.Storage.Auth;
	using Microsoft.WindowsAzure.Storage.Blob;
	using Microsoft.WindowsAzure.Storage.Shared.Protocol;
	using Newtonsoft.Json;

	namespace BlobCorsToggle
	{
	public class SimpleAzureBlobCorsSetter
	{
	public CloudBlobClient CloudBlobClient { get; private set; }

	public SimpleAzureBlobCorsSetter(Uri blobServiceUri, StorageCredentials storageCredentials)
	{
	this.CloudBlobClient = new CloudBlobClient(blobServiceUri, storageCredentials);
	}

	public SimpleAzureBlobCorsSetter(CloudBlobClient blobClient)
	{
	this.CloudBlobClient = blobClient;
	}

	/// <summary>
	/// Set Blob Service CORS settings for specified Windows Azure Storage Account.
	/// Either sets to a hard-coded set of values (see below) or clears of all CORS settings.
	///
	/// Does not check for any existing CORS settings, but clobbers with the CORS settings
	/// to allow HTTP GET access from any origin. Non-CORS settings are left intact.
	///
	/// Most useful for scenarios where a file is published in Blob Storage for read-access
	/// by any client.
	///
	/// Can also be useful in conjunction with Valet Key Pattern-style limited access, as
	/// might be useful with a mobile application.
	/// </summary>
	/// <param name="clear">if true, clears all CORS setting, else allows GET from any origin</param>
	public void SetBlobCorsGetAnyOrigin(bool clear)
	{
	// http://msdn.microsoft.com/en-us/library/windowsazure/dn535601.aspx
	var corsGetAnyOriginRule = new CorsRule();
	corsGetAnyOriginRule.AllowedOrigins.Add("*"); // allow access to any client
	corsGetAnyOriginRule.AllowedMethods = CorsHttpMethods.Get; // only CORS-enable http GET
	corsGetAnyOriginRule.ExposedHeaders.Add("*"); // let client see any header we've configured
	corsGetAnyOriginRule.AllowedHeaders.Add("*"); // let clients request any header they can think of
	corsGetAnyOriginRule.MaxAgeInSeconds = (int)TimeSpan.FromHours(10).TotalSeconds; // clients are safe to cache CORS config for up to this long

	var blobServiceProperties = this.CloudBlobClient.GetServiceProperties();
	if (clear)
	{
	blobServiceProperties.Cors.CorsRules.Clear();
	}
	else
	{
	blobServiceProperties.Cors.CorsRules.Clear(); // replace current property set
	blobServiceProperties.Cors.CorsRules.Add(corsGetAnyOriginRule);
	}
	this.CloudBlobClient.SetServiceProperties(blobServiceProperties);
	}

	public void DumpCurrentProperties()
	{
	var blobServiceProperties = this.CloudBlobClient.GetServiceProperties();
	var blobPropertiesStringified = StringifyProperties(blobServiceProperties);
	Console.WriteLine("Current Properties:\n{0}", blobPropertiesStringified);
	}

	internal string StringifyProperties(ServiceProperties serviceProperties)
	{
	// JsonConvert.SerializeObject(serviceProperties) for whole object graph or
	// JsonConvert.SerializeObject(serviceProperties.Cors) for just CORS
	return Newtonsoft.Json.JsonConvert.SerializeObject(serviceProperties, Formatting.Indented);
	}
	}
	}

view raw SimpleAzureBlobCorsSetter.cs hosted with

by GitHub

using System;
using System.Diagnostics;
using System.IO;
using Microsoft.WindowsAzure.Storage.Auth;

namespace BlobCorsToggle
{
   class Program
   {
      static string clearFlag = "-clear";
      static string queryFlag = "-q";

      static void Main(string[] args)
      {
         if (args.Length == 0 || !ValidCommandArguments(args))
         {
            #region Show Correct Usage

            if (args.Length != 0 && !ValidCommandArguments(args))
            {
               var saveForegroundColor = Console.ForegroundColor;
               Console.ForegroundColor = ConsoleColor.Red;

               Console.Write("\nINVALID COMMAND LINE OR UNKNOWN FLAGS: ");
               foreach (var arg in args) Console.Write("{0} ", arg);
               Console.WriteLine();

               Console.ForegroundColor = saveForegroundColor;
            }

            Console.WriteLine("usage:\n{0} <storage acct name> <storage acct key> [{1}]",
               Path.GetFileNameWithoutExtension(Environment.GetCommandLineArgs()[0]),
               clearFlag);
            Console.WriteLine();
            Console.WriteLine("example setting:\n{0} {1} {2}",
               Path.GetFileNameWithoutExtension(Environment.GetCommandLineArgs()[0]),
               "mystorageacct",
               "lala+aou812SomERAndOmStrINglOlLolFroMmYStORaG3akounT314159265358979323ilIkEpi==");
            Console.WriteLine();
            Console.WriteLine("example clearing:\n{0} {1} {2} {3}",
               Path.GetFileNameWithoutExtension(Environment.GetCommandLineArgs()[0]),
               "mystorageacct",
               "lala+aou812SomERAndOmStrINglOlLolFroMmYStORaG3akounT314159265358979323ilIkEpi==",
               clearFlag);

            if (Debugger.IsAttached)
            {
               Console.WriteLine("\nPress any key to exit.");
               Console.ReadKey();
            }

            #endregion
         }
         else
         {
            var storageAccountName = args[0];
            var storageKey = args[1];

            var blobServiceUri = new Uri(String.Format("https://{0}.blob.core.windows.net", storageAccountName));
            var storageCredentials = new StorageCredentials(storageAccountName, storageKey);
            var blobConfig = new SimpleAzureBlobCorsSetter(blobServiceUri, storageCredentials);

            bool query = (args.Length == 3 && args[2] == queryFlag);
            if (query)
            {
               blobConfig.DumpCurrentProperties();
               return;
            }

            blobConfig.DumpCurrentProperties();
            Console.WriteLine();
            bool clear = (args.Length == 3 && args[2] == clearFlag);
            blobConfig.SetBlobCorsGetAnyOrigin(clear);
            Console.WriteLine("CORS Blob Properties for Storage Account {0} have been {1}.",
               storageAccountName, clear ? "cleared" : "set");
            Console.WriteLine();
            blobConfig.DumpCurrentProperties();
         }
      }

      private static bool ValidCommandArguments(string[] args)
      {
         if (args.Length == 2) return true;
         if (args.Length == 3 && (args[2] == clearFlag || args[2] == queryFlag)) return true;
         return false;
      }
   }
}

view raw ToggleCors.cs hosted with

by GitHub

One simple point to highlight – CORS properties are simply available on the Blob service object (and would be same for Table or Queue service within Storage):

Yes, this is a very simple API.

Showing the Service Object Contents

For those interested in the contents of these objects, here are a few ways to show content of properties (in code) before turning on CORS and after. (The object views are created using the technique I described my post on using JSON.NET as an object dumper that’s Good Enough™.)

DUMPING OBJECT BEFORE CORS ENABLED (just CORS properties):

{“Logging”:{“Version”:”1.0″,”LoggingOperations”:0,”RetentionDays”:null},”Metrics”:{“Version”:”1.0″,”MetricsLevel”:0,”RetentionDays”:null},”HourMetrics”:{“Version”:”1.0″,”MetricsLevel”:0,”RetentionDays”:null},”Cors”:{“CorsRules”:[]},”MinuteMetrics”:{“Version”:”1.0″,”MetricsLevel”:0,”RetentionDays”:null},”DefaultServiceVersion”:null}

DUMPING OBJECT AFTER CORS ENABLED:

{“Logging”:{“Version”:”1.0″,”LoggingOperations”:0,”RetentionDays”:null},”Metrics”:{“Version”:”1.0″,”MetricsLevel”:0,”RetentionDays”:null},”HourMetrics”:{“Version”:”1.0″,”MetricsLevel”:0,”RetentionDays”:null},”Cors”:{“CorsRules”:[{“AllowedOrigins”:[“*”],”ExposedHeaders”:[“*”],”AllowedHeaders”:[“*”],”AllowedMethods”:1,”MaxAgeInSeconds”:36000}]},”MinuteMetrics”:{“Version”:”1.0″,”MetricsLevel”:0,”RetentionDays”:null},”DefaultServiceVersion”:null}

DUMPING OBJECT BEFORE CORS ENABLED (but including ALL properties):

Current Properties:
{“Logging”:{“Version”:”1.0″,”LoggingOperations”:0,”RetentionDays”:null},”Metrics
“:{“Version”:”1.0″,”MetricsLevel”:0,”RetentionDays”:null},”HourMetrics”:{“Versio
n”:”1.0″,”MetricsLevel”:0,”RetentionDays”:null},”Cors”:{“CorsRules”:[]},”MinuteM
etrics”:{“Version”:”1.0″,”MetricsLevel”:0,”RetentionDays”:null},”DefaultServiceV
ersion”:null}

DUMPING OBJECT AFTER CORS ENABLED (but including ALL properties):

Current Properties:
{“Logging”:{“Version”:”1.0″,”LoggingOperations”:0,”RetentionDays”:null},”Metrics
“:{“Version”:”1.0″,”MetricsLevel”:0,”RetentionDays”:null},”HourMetrics”:{“Versio
n”:”1.0″,”MetricsLevel”:0,”RetentionDays”:null},”Cors”:{“CorsRules”:[{“AllowedOr
igins”:[“*”],”ExposedHeaders”:[“*”],”AllowedHeaders”:[“*”],”AllowedMethods”:1,”M
axAgeInSeconds”:36000}]},”MinuteMetrics”:{“Version”:”1.0″,”MetricsLevel”:0,”Rete
ntionDays”:null},”DefaultServiceVersion”:null}

Using ‘curl’ To Examine CORS Data:

curl -H "Origin: http://example.com" -H "Access-Control-Request-Method: GET" -H "Access-Control-Request-Headers: X-Requested-With" -X OPTIONS --verbose http://azuremap.blob.core.windows.net/maps/azuremap.geojson 

view raw check-cors hosted with

by GitHub

CURL OUTPUT BEFORE CORS ENABLED:

* Adding handle: conn: 0x805fa8
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* – Conn 0 (0x805fa8) send_pipe: 1, recv_pipe: 0
* About to connect() to azuremap.blob.core.windows.net port 80 (#0)
* Trying 168.62.32.206…
* Connected to azuremap.blob.core.windows.net (168.62.32.206) port 80 (#0)
> OPTIONS /maps/azuremap.geojson HTTP/1.1
> User-Agent: curl/7.31.0
> Host: azuremap.blob.core.windows.net
> Accept: */*
> Origin: http://example.com
> Access-Control-Request-Method: GET
> Access-Control-Request-Headers: X-Requested-With
>
< HTTP/1.1 403 CORS not enabled or no matching rule found for this request.
< Content-Length: 316
< Content-Type: application/xml
* Server Blob Service Version 1.0 Microsoft-HTTPAPI/2.0 is not blacklisted
< Server: Blob Service Version 1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: 04402242-d4a7-4d0c-bedc-ff553a1bc982
< Date: Sun, 26 Jan 2014 15:08:11 GMT
<
<?xml version=”1.0″ encoding=”utf-8″?><Error><Code>CorsPreflightFailure</Code><Message>CORS not enabled or no matching rule found for this request.
RequestId:04402242-d4a7-4d0c-bedc-ff553a1bc982
Time:2014-01-26T15:08:12.0193649Z</Message><MessageDetails>No CORS rules matches this request</MessageDetails></Error>*
Connection #0 to host azuremap.blob.core.windows.net left intact

CURL OUTPUT AFTER CORS ENABLED:

D:\dev\github>curl -H “Origin: http://example.com” -H “Access-Control-Request-Method: GET” -H “Access-Control-Request-Headers: X-Requested-With” -X OPTIONS –verbosehttp://azuremap.blob.core.windows.net/maps/azuremap.geojson
* Adding handle: conn: 0x1f55fa8
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* – Conn 0 (0x1f55fa8) send_pipe: 1, recv_pipe: 0
* About to connect() to azuremap.blob.core.windows.net port 80 (#0)
* Trying 168.62.32.206…
* Connected to azuremap.blob.core.windows.net (168.62.32.206) port 80 (#0)
> OPTIONS /maps/azuremap.geojson HTTP/1.1
> User-Agent: curl/7.31.0
> Host: azuremap.blob.core.windows.net
> Accept: */*
> Origin: http://example.com
> Access-Control-Request-Method: GET
> Access-Control-Request-Headers: X-Requested-With
>
< HTTP/1.1 200 OK
< Transfer-Encoding: chunked
* Server Blob Service Version 1.0 Microsoft-HTTPAPI/2.0 is not blacklisted
< Server: Blob Service Version 1.0 Microsoft-HTTPAPI/2.0
< x-ms-request-id: d4df8953-f8ae-441b-89fe-b69232579aa4
< Access-Control-Allow-Origin: http://example.com
< Access-Control-Allow-Methods: GET
< Access-Control-Allow-Headers: X-Requested-With
< Access-Control-Max-Age: 36000
< Access-Control-Allow-Credentials: true
< Date: Sun, 26 Jan 2014 16:02:25 GMT
<
* Connection #0 to host azuremap.blob.core.windows.net left intact

Resources

A new version of the Windows Azure Storage Emulator (v2.2.1) is now in Preview. This release has support for “2013-08-15” version of Storage which includes CORS (and JSON and other) support.

Overall description of Azure Storage’s CORS Support:

http://msdn.microsoft.com/en-us/library/windowsazure/dn535601.aspx

REST API doc (usually the canonical doc for any feature, though in code it is easily accessed with the Windows Azure SDK for .NET)

http://msdn.microsoft.com/en-us/library/windowsazure/hh452235.aspx

A couple of excellent posts from the community on CORS support in Windows Azure Storage:

Innovative widgets

Thursday, 30 June 2016

Automating Analytic Workflows on AWS

Workflow basics

Overview

Amazon SWF

AWS Data Pipeline

AWS Lambda

Scenario: An ecommerce reporting ETL workflow

AWS Data Pipeline reporting ETL workflow

Amazon SWF on-demand reporting ETL workflow

AWS Lambda event-driven reporting ETL workflow

Summary

Wednesday, 15 June 2016

Stupid Azure Trick #6 – A CORS Toggler Command-line Tool for Windows Azure Blobs

[Edit: I originally accidentally published an old draft. The draft went out to all email subscribers and was public for around 90 minutes. Fixed now.]

Showing the Service Object Contents

Using ‘curl’ To Examine CORS Data:

Resources

The best ways to connect to the server using Angular CLI

3rd party Integration

Report Abuse

	@echo off

	D:\dev\path\to\ToggleCors.exe azuremap 123abcYourStorageKeyIsHere987zyx== %1

	echo.
	echo FOR COMPARISON QUERY THE nocors SERVICE (which never has CORS set)
	echo.

	D:\dev\path\to\ToggleCors.exe nocors 123abcYourStorageKeyIsHere987zyx== -q

	echo.

	PowerShell -Command Get-Date

	using System;
	using System.Diagnostics;
	using System.IO;
	using Microsoft.WindowsAzure.Storage.Auth;

	namespace BlobCorsToggle
	{
	class Program
	{
	static string clearFlag = "-clear";
	static string queryFlag = "-q";

	static void Main(string[] args)
	{
	if (args.Length == 0 \|\| !ValidCommandArguments(args))
	{
	#region Show Correct Usage

	if (args.Length != 0 && !ValidCommandArguments(args))
	{
	var saveForegroundColor = Console.ForegroundColor;
	Console.ForegroundColor = ConsoleColor.Red;

	Console.Write("\nINVALID COMMAND LINE OR UNKNOWN FLAGS: ");
	foreach (var arg in args) Console.Write("{0} ", arg);
	Console.WriteLine();

	Console.ForegroundColor = saveForegroundColor;
	}

	Console.WriteLine("usage:\n{0} <storage acct name> <storage acct key> [{1}]",
	Path.GetFileNameWithoutExtension(Environment.GetCommandLineArgs()[0]),
	clearFlag);
	Console.WriteLine();
	Console.WriteLine("example setting:\n{0} {1} {2}",
	Path.GetFileNameWithoutExtension(Environment.GetCommandLineArgs()[0]),
	"mystorageacct",
	"lala+aou812SomERAndOmStrINglOlLolFroMmYStORaG3akounT314159265358979323ilIkEpi==");
	Console.WriteLine();
	Console.WriteLine("example clearing:\n{0} {1} {2} {3}",
	Path.GetFileNameWithoutExtension(Environment.GetCommandLineArgs()[0]),
	"mystorageacct",
	"lala+aou812SomERAndOmStrINglOlLolFroMmYStORaG3akounT314159265358979323ilIkEpi==",
	clearFlag);

	if (Debugger.IsAttached)
	{
	Console.WriteLine("\nPress any key to exit.");
	Console.ReadKey();
	}

	#endregion
	}
	else
	{
	var storageAccountName = args[0];
	var storageKey = args[1];

	var blobServiceUri = new Uri(String.Format("https://{0}.blob.core.windows.net", storageAccountName));
	var storageCredentials = new StorageCredentials(storageAccountName, storageKey);
	var blobConfig = new SimpleAzureBlobCorsSetter(blobServiceUri, storageCredentials);

	bool query = (args.Length == 3 && args[2] == queryFlag);
	if (query)
	{
	blobConfig.DumpCurrentProperties();
	return;
	}

	blobConfig.DumpCurrentProperties();
	Console.WriteLine();
	bool clear = (args.Length == 3 && args[2] == clearFlag);
	blobConfig.SetBlobCorsGetAnyOrigin(clear);
	Console.WriteLine("CORS Blob Properties for Storage Account {0} have been {1}.",
	storageAccountName, clear ? "cleared" : "set");
	Console.WriteLine();
	blobConfig.DumpCurrentProperties();
	}
	}

	private static bool ValidCommandArguments(string[] args)
	{
	if (args.Length == 2) return true;
	if (args.Length == 3 && (args[2] == clearFlag \|\| args[2] == queryFlag)) return true;
	return false;
	}
	}
	}