The data lake may be a fashionable concept, but it is also one born of necessity. The traditional data warehouse approach cannot handle the vast quantities of data now being produced. Even within an organization, data from logs for web server traffic and visitors, ERP transactions and other enterprise systems often already exceeds the capacities of any existing Extract, Transform, Load (ETL) solution.
So, rather than trying to transform all data before loading (like the data warehouse), the principle of the data lake is to reverse these two steps. After extraction, data is first loaded into the data lake, which is where it stays, awaiting transformation – perhaps. Instead of being ETL, a data lake is therefore ELT.
The Good and the Bad of the Data Lake
Inversing the transform and load steps has its pros and cons. They are rooted in the new big data thinking that gave rise to the data lake. In this thinking, all data has value and is to be kept, compared to data warehousing, which often discards any data that is not immediately useful. Other meta data may also be applied, for example, how individual data elements may be shared outside the organizations and how do you automate changes to the meta data after it has been collected.
Advantages of a data lake include:
- Coping with the 3Vs of big data generation – velocity, variety and volume
- Storage of data in its native format, with metatags
- Schemas and transformations are only applied when queries are made by other users or systems (“schema on read”)
- Users and apps can interpret the data as they choose
Potential disadvantages may be:
- Indiscriminate data hoarding, leading to stale data
- Different user/app interpretations of data may conflict
- If metatags are missing or inaccurate, it will be more difficult to find specific data
- Without initial checks, corrupt data may be ingested and used, before the problem is recognized
The problems of stale and corrupt data can turn a data lake into a “data swamp.” Proper curation of data can prevent this from happening, although this means additional effort.
Data Lakes and Data Warehouses Working Together
The data lake operates on a bottom-up basis, ingesting all data regardless of source or requirement, and storing the data without any schema definition. By comparison, the data warehouse is designed top-down, starting with business requirements, defining data models and setting up the data cleansing and transformation mechanisms to load only qualified datasets.
Yet data lakes will not replace data warehouses or data marts, which are the subsets of data warehouses destined for an individual project or departmental needs. In fact, the best situation will often be one in which a data lake and a data warehouse work together. The data warehouse then gets its raw data from the data lake, before transforming and loading to run analytics and/or produce data marts.
The idea put forward by James Dixon, Pentaho CTO, is that the data lake is like a body of water in its natural state, into which streams (of data from source systems) flow. The data mart is then like a bottle of water that is “cleansed, packaged, and structured for easy consumption”, having been produced by a bottling plant (the data warehouse). Other users can take samples from the data lake too or dive in to see what lies below the surface, which are in turn metaphors for interactive queries, real-time analytics, machine learning and more.
The Cloud as the Data Lake Location
To manage the 3Vs of big data, organizations want storage resources that are suitably extensible. However, they often want to avoid massive upfront investment. The cloud is a natural solution to accomplish this. Amazon (AWS) and Microsoft (Azure) both offer a platform and components to make data lakes feasible and secure with identity and access management (IAM), together with facilities for analytics on or alongside the cloud platform.
An organization with a federated IT architecture can then make its cloud data lake available to its different IT entities and lines of business. Depending on the cloud platform, chargeback functionality is also available to distribute costs per usage and consumption. The entities would run their own data warehouse/mart operations, either locally or in the cloud, extracting data from the data lake as required.
There are two conditions. The first is curation to ensure that data stored in the data lake is not corrupt. The second is to make data discoverable and usable (using meta tags, for example). Responsibility for meeting these conditions may be assigned to a chief data officer with oversight of all the data resources of the organization. In this way, a cloud-based data lake can offer the advantages of cloud computing to the entities in the federated architecture, while preserving their autonomy, flexibility, and agility.