Happy New Year, and back to work.

The Data Warehouse is the foundation of any analytics initiative. You take data from various data sources in the organization, clean and pre-process it to fit business needs, and then load it into the data warehouse for everyone to use. This process is called ETL which stands for ‘Extract, transform, and load’.

While the extraction (from files and/or databases) and load (into the DW) are pretty straightforward, the real juice of ETL is in the data transformations where the data is converted into a more business usable format. Typical Transformation tasks involve:
* Choosing the specific columns out of the available data
* Relating data from multiple sources
* Translating and/or encoding values (e.g. storing a CityID in the DW instead of the names to conserve storage and ensure integrity)
* Encoding free-form values from legacy systems (there can be a lot of valuable information in some of those memo fields)
* Deriving a new calculated values and Summarizing multiple rows of data

You can either write a specific processor (Python is a good choice as it aids rapid prototyping, is quite readable, and is easier to maintain) or can use an ETL Tool, a software application designed to aid the process. While ETL tools provide ease of use and rapid deployment, they often assume that the source data is cleaner, better organized, and less quirkier than it usually is. If there are a lot of quirks in the data, you might be better off writing one on your own. Some ETL tools provide a plugin architecture to handle this better.

Some of the ETL tools from the open source domain are listed below:
1) Kettle
2) Octopus
3) Clover ETL

And a good blog post listing numerous Java based options is http://www.manageability.org/blog/stuff/open-source-etl/view.

Please do let me know if I have missed some significant ones out. If you have used any of these, we’ll all appreciate if you can comment upon your experiences.