Aspects to evaluate the selection of an ETL tool

Addressing a business intelligence project is important to proper assessment of the ETL tool that we will use. The tool based on which we will implement our procurement processes of the Datamart, Datawarehouse or storage structure based on which further exploit the data. It is a cornerstone for the design, construction and subsequent evolution of our BI system. We will analyze technical issues only, without entering into the economic aspects or otherwise (licenses, agreements, technical support, tool changes, etc. ..) . Note that the ETL processes, are closely linked to the processes of data profiling and data quality, here we will not consider.

Briefly as the beginning note that ETL is an acronym of Extract, Transform and Load. Based on these three basic steps in the design process, we will propose issues to consider in the selection of a tool of this type. Later we will look at issues relating to their exploitation.

Design processes

Extraction

The first step is extracting the data from source systems that may be multiple and in multiple formats. It is important at this stage a proper validation of source data and homogenize them. It is important to consider the volume that we will try and do this step without impacting source systems (batch processes). Let us consider in this step both the extraction and validation of data read.With regard to this stage, an ETL tool should fulfill the following aspects:

  • Connectivity: the ability to map multiple sources and different DBMS, different types of models (relational, non-relational), file types and formats (host, XML, Excel, etc. ..), source systems types (ERP, SCM, proprietary systems ) standard message formats (eg SWIFT), messaging queues, etc. ..

  • Facilities for the analysis and mapping of formats or source data models.

  • Support multiple refresh rates. Multiple data sources have different origins and reception cycle update. The reception frequency can change from one to another as updating needs. It's very important the ability to include checkpoints between the sources in which we ensure synchronization.

  • Ability to incorporate unstructured data formats (office documents, emails, websites, logs, etc). The field of unstructured data is huge and evolving, it is difficult to evaluate.

  • Validation management techniques: null, unique values, referential integrity, mandatory fields, ranges of values, etc ... and validation of business: rules to check the consistency of the data based on business requirements.

  • Handling errors. Detecting them and implementing actions.

  • In case of a tool that generates code from the metadata, see the possibility of reverse engineering, from code to the metadata. 

 

Transformation

The transformation phase is to perform steps of transformation or creation of new data based on the received and applying business rules of each system. There are common steps to most transformation processes, such as columns selections to treat, transform coding (columns coded differently in the source system with respect of the target system), calculate new columns based on the received columns, aggregations, cross data between different sources, etc ...

  • Predefined components that incorporates the tool for different kinds of transformations widespread (filtered, crosses, mapping, etc. ..)

  • Support for testing and debugging to facilitate monitoring of process changes.

  • Possibilities of incorporating external code (not generated by the tool).

  • Possibility to create own procedures and use other processes (code or processes reuse).

  • Possibilities to optimize the treatment process large or costly data against performance problems.

  • Data traceability. Have a track of all the transformations of a value throughout the process. Ideally, metadata searchable by the user, so you can follow it if it detects loaded data correctly, but inconsistent in the business aspect.

 

Load

This last step is to dump the data already read, validated and processed in the analytical framework for further exploitation.

  • Consider different target table types (usually DB storage objects).

  • Consider different load types: Deleted and full dump, incremental, etc ...

  • Creating aggregation processes. Consider different levels of aggregation and data organization (cubes, multidimensional DB).

  • Data dumps to consider different levels of aggregation and organization of data into target tables (cubes, multidimensional DB).

  • At this stage it is equally important the performance: parallel processing dump to optimize dump time.

Having analyzed the relevant aspects in the design process, we review the aspects to consider in the operation and maintenance and evolution of the tool.

Operation and maintenance tool

  • Adaptation to different hardware platforms and operating systems.

  • Easy use. Intuitive interface. Simple graphical representation of the repository objects, data models and data streams.

  • Possibility to perform impact analysis. See the implications that a change in the data model or process has in the overall project.

  • Versioning code. Change Control. Management of group work.

  • Documentation possibilities.

  • Speed ​​of query and update the metadata or possibilities of optimizing (indexes).

  • Opportunities to plan jobs: planning batch, event-based execution, triggers, etc ...

  • Chain execution management. Dependencies between processes. Jobs restarting.

  • Logs systems. Detailed execution logs / errors and collect execution statistics (logs summary of loads: load time, loaded records, incorrect records, etc. ..).

  • Tools to monitor and evaluate the processes performance of jobs (resource analyzer).

  • Compatibility with CWM (Common Warehouse Metamodel). Ability to export or import metadata to another ETL tools.

  • Continuous Update: Increasing data volumes can make that packs of codes that are processed daily pass to be processed in micro-batches (several a day) or even integration with message queues or capture of change data (CDC Change Data Capture ) in real time for processing and updating continuously.

  • Integration with other components of the analytical platform, with the remaining components of the platform data integration (data profiling, data quality) and the components oriented to exploitaition, analysis and presentation (reporting layer, dashboards, data mining, etc. ...)

  • Ability to adapt to standards that allow the easy exchange of data (eg SOA).

  • Interoperability capability with other tools (eg Via API).

  • Security issues inherent in any tool.

Of course, this brief summary, treats general aspects should be further complemented by the particular problems of each system, technological infrastructure, business requirements and information needs.