Data Movement in Big Data space through Azure Data Factory
Azure Data Factory provides a globally deployed service to support data movement across a variety of data stores. Azure Data Factory also has built-in support for securely moving data between on premise locations and cloud. The intent is to solve the data ingestion, movement and publish needs for your big data and advanced analytics scenarios.
Azure Data Factory connects to the following data stores:
- Azure Blob
- Azure Table
- Azure SQL Database
- Azure DocumentDB
- SQL Server on IaaS
- On premises SQL Server
- On premises File System
- On premises Oracle Database
- On premises MySQL Database
- On premises DB2 Database
- On premises Teradata Database
- On premises Sybase Database
- On premises PostgreSQL Database
- On premises HDFS
- Generic OData data store
- Generic ODBC data store
- Generic Web data store
We are looking to add connectivity to many more data stores, at a rapid pace, on a continuous basis. In the interim, you may use the .Net Activity to execute your own code in Azure Data Factory to connect to a data store of your choice.
Data movement in Azure Data Factory is surfaced through the Copy activity. This activity copies data from one data store to another. Copying is done in a batch mechanism as per the frequency and schedule defined. This leverages a globally deployed footprint underneath in order to efficiently copy data. This managed service safeguards against transient issues across a variety of data sources while also ensuring data is moved in a secure mechanism.
When moving data to/from an on-premises data store, a data management gateway is leveraged. Data management gateway is an agent you can install on-premises (behind a firewall) to enable hybrid data pipelines. It manages access to the on-premise data securely and enables seamless data movement between on-premise data stores.
A few of the other interesting functionalities include:
- Data can be structured, semi-structured or unstructured for data movement to occur
- For file based data stores:
- A variety of file formats such as binary, Text (CSV/TSV) and Avro are supported
- Encoding such as UTF-8, UTF- 16, gb2312, etc. can be selected specifically for Text format
- Three compression codecs – GZip, Deflate and BZip2 can be used to compress data if needed; the source and sink can use different compression algorithms
- Columns of data from source can be skipped or mapped to specific columns in the sink during data movement
- Type conversions: Different data stores have different native type systems. The copy activity performs automatic type conversions from source types to sink types. First, it converts the native source type to the corresponding .Net type. Then, it converts the .Net type to the corresponding native sink type. You will find the mapping for a given native type system to .NET for the data store in the respective data store connector articles. You can use these mappings to determine appropriate types while creating your tables to ensure the right conversions are performed during data movement.
- When populating select relational stores, stored procedures can be invoked in order to execute custom logic during data movement to insert data into multiple tables simultaneously or overwrite/upsert
- Repeatability mechanisms have been provided to ensure the re-run of copy activity does not produce redundant or incorrect data.
The net result is reliable, efficient, capability rich and cost-effective data movement via Azure Data Factory.
One can use this to:
- Enable hybrid data movement between on-premise and Cloud and vice-versa
- Load a data lake
- Load a data warehouse
- Lift and shift on-premise data analysis solution to Cloud
Source: Microsoft Azure News