Warehousing Your Data in the Cloud with ETL

The process of taking data from different systems and putting it into a data warehouse for business analysis can be a complicated affair. In this article, we look at what is involved and how the cloud has made matters potentially trickier.

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources.

DWH on the Cloud

The global data warehouse market, according to Market Research Media, is predicted to grow by 8.3% in 2017-2022, surpassing a total market value of $20bn by 2022. This figure covers data warehouse products, technologies and services.

In 2017-2022, says Market Research Media, the data warehouse market will be shaped by the emergence of data warehouses in the cloud or Data Warehouse as a Service (DWaaS; increasing demands for low latency and high speed analytics; and the commoditization of data warehouse software and hardware.

Companies that need to interrogate historical data from disparate sources commonly have to put it within a single data repository - a data warehouse – but they need a reliable system to be able to do this when the data is unstructured and from a multitude of locations.

They require the data to be analyzed in a “clean” and unified format from within the data warehouse, using business intelligence tools and analytics engines to get the business answers they want.

 

ETL Choices

The secret sauce to arrive at this point is ETL, which stands for extraction, transformation and loading, but the process isn't as straightforward as many think. ETL is a separate system outside of the data warehouse to get the data into a consolidated structured format ready for business use through the warehouse.

For instance, Microsoft SQL Server Integration Services (SSIS), Oracle Warehouse Builder (OWB), IBM Infosphere Information Server, SAP Data Services, PowerCenter Informatica, SAS Data Management, Elixir Repertoire for Data ETL and Data Migrator (IBI) are all platforms for building data integration solutions through ETL.

With such a good range of established ETL tools, many of them of course from the same companies that make the leading data warehouse platforms, data specialists were eventually becoming used to the way things had to be done, as businesses became ever more reliant on data to run themselves and find new sales and markets.

However, over the last few years, the advent of the cloud has somewhat stirred things up a bit, and companies now have to get their heads round marrying the needs of on-premise data warehouse solutions and those in the cloud.

 

The likes of Amazon Web Services (AWS), Google and Microsoft Azure are all scooping up big chunks of the data warehouse market with their cloud-based solutions. Their allure is that companies don't have to buy so much analytics hardware and software when compared to an on-premise data warehouse system – the solution is mainly accessed on an on-demand basis in the cloud. And also, because the services - like most cloud applications - are highly scalable in response to rapidly changing business needs.

However, with on-premise warehouses still making up the bigger share of the data warehouse market, linking traditional data processing and data analytics from the confines of the business to third party clouds is imperative as a first step to cloud migration for many.

Wholesale enterprise cloud migrations are huge projects and migrating data from a massive, complicated and constantly patched Oracle data warehouse, for instance, to a cloud service provider is a major challenge.

 

Cloud Migration Checklist

  • Address factors around total cost of ownership, data variety and data velocity
  • Migrate the BI platform, including existing queries, dashboards and reports
  • Migrate, rewrite or replace all custom in-house data applications
  • Train all employees on using and maintaining new cloud infrastructure-as-a-service systems
  • Migrate remaining ETL processes and retire the legacy data warehouse

 

Synchronization between on-premise data warehouses and cloud ones therefore may be a better solution for many enterprises, before any wholesale move to the cloud. Focusing on Oracle - seeing it is the market leader in many different ways - to be able to synchronize an on-premise Oracle data warehouse to the cloud, customers can make use of Oracle LogMiner.

LogMiner is an internal replication mechanism, but it may have to be used in conjunction with a third party replication tool, such as that from Alooma, to make sure that what is seen in the on-premise warehouse is also seen in the cloud.

 

Alooma’s tool reads directly from the Oracle LogMiner, which provides a view into the database change log: inserted, updated and deleted records, as well as schema changes. Reading from the LogMiner enables Alooma to continuously replicate all changes to the Oracle data, providing organizations with an exact replica on a cloud data warehouse.

Such a synchronization system enables firms to migrate all of their data applications one by one, with minimal interruptions of service to staff. It can be used for reliable, real-time synchronization with a cloud data warehouse like Amazon Redshift, Google BigQuery and many others.