Data lakes are fast becoming valuable tools for businesses that need to organize large volumes of highly diverse data from multiple sources. However, if you are not a data scientist, a data lake may seem more like an ocean that you are bound to drown in. Making a data lake manageable for everyone requires mindful designs that empower users with the appropriate tools.
A recent webcast conducted by TDWI and Oracle, entitled “How to Design a Data Lake with Business Impact in Mind,” identified the best use cases for using a data lake and then defined how to design one for an enterprise-level business. The presentation recommended keeping data-driven use cases at the forefront, making a data lake a central IT-managed function, blending old and new data, empowering self-service, and establish a sponsor group to manage the company’s data lake plan with enough staffing and skills to keep it relevant.
“Business want to make more fact-based data but they also want to go deeper into the data they have with analytics,” says Philip Russom, a Senior Research Director for Data Management at TDWI. “We see data lakes as a good advantage for companies that want to do this as the data can be repurposed repeatedly for new analytics and use cases.”
Data lake usage is on the rise, according to TDWI surveys. A 2017 query revealed that nearly a quarter of those businesses questioned (23 percent) have a data lake already in production with another quarter (24 percent) expected to launch in 12 months with only 7 percent admitting they would not jump into a data lake. A significant number (21 percent) said they would be establish a data lake within three years.
In the same survey, respondents were asked about the business benefit of deploying a Hadoop-based data lake. Half (49 percent) rated advanced analytics including data mining, statistics, and machine learning the primary use case, followed by data exploration and discovery. The third largest response saw big data source for analytics as the third most likely use case for a data lake.
Use cases for data lakes include investigating new data coming from sensors and machines, streaming, and human language text. More complex uses for data lakes include multiplatform data warehouse environments, omnichannel marketing, and digital supply chain.
As the best argument for deploying and using a data lake is to be able to blend old and new data together. This is especially helpful for departments like marketing, finance, and governance which require insight from multiple sources old and new. Russom noted multi-module enterprise resource planning, Internet of Things (IoT), insurance claim workflow, and digital healthcare would all be areas that could benefit from data lake deployments.
When it comes to design, Russom suggests the following:
- Create a plan, prioritize use cases, and update as biz evolves
- Choose data platform(s) that support business requirements
- Get tools that work with platform and satisfy user requirements
- Augment your staff with consultants experienced with data lakes
- Train staff for Hadoop, analytics, lakes, clouds.
- Start with business use case that a lake can address w/ROI
Bruce Edwards, a Cloud Luminary and Information Management Specialist with Oracle, added that the convergence of cloud, big data, and data science have enabled the explosion of data lake deployments. Having a central vendor that not only understands large scale data management the but can integrate existing infrastructures into core data lake components is essential.
“What data lake users need is an open, integrated, self-healing, high performance tool,” Edwards said. “These elements are all needed to allow businesses to begin their data lake journey.