How a Unified Approach Supports Your Data Strategy

Are you finding it easy to explore and analyze data located on-premise or in the cloud? You are not alone, but there is a solution.

It’s a rare instance of a company that stores 100 percent of its data in one place or a company that secures 100 percent of its data in the cloud. Most companies must combine datasets. But by establishing a unified data tier, it can be easier to perform certain types of analytics, especially when the data is widely distributed.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Take for example the case of a bike-share system that looked at its publicly available ridership data, then added weather data to predict bike ridership and made appropriate changes to make sure bikes were available when and where riders needed them. If the data was stored in different geographical areas and used different storage systems, it might be difficult to compare that information to make an informed decision.

So how can companies take advantage of data, whether it’s located in Oracle Autonomous Data Warehouse, Oracle Database, object store, or Hadoop? A recent Oracle webcast titled, “Explore, Access, and Integrate Any Data, Anywhere,” explored this issue. Host Peter Jeffcock outlined four new services Oracle released in February 2020 to let companies dive right in and solve these real-world problems, manage data, and enable augmented analytics:

The idea is that there needs to be a unified data tier that starts with workload portability, which means that your data and the data environment can be managed in the public cloud, on a local cloud, or in your on-premise data store.

Unified Data Tier

The next step is to develop a converged database, especially with an autonomous component so that repeatable processes free up administrative time and reduce human error. Oracle Database allows for multiple data models, multiple workloads, and multiple tenants, making it easier to operate because all these processes are managed into a single database.

You can take it one step further if you add the cloud to the configuration. Oracle can manage the data and apply different processes and machine learning so that you can run your database autonomously in the cloud.

Unified Data Tier

The unified data tier also means taking advantage of multiple data stores such as data lakes and other databases. And finally expanding that ecosystem with our partners such as our recent agreement with Microsoft that allows for a unified data tier between Oracle Cloud and Microsoft’s Azure.

“If you want to run an application in the Microsoft Cloud and you want to connect to the Oracle Cloud where the data is stored, that’s now supported. It’s a unique relationship and it’s something to look into if you want to run a multi-cloud strategy,” Jeffcock says.

You can experience the full presentation if you register for the on-demand webcast.

To learn more about how to get started with data lakes, check out Oracle Big Data Service—and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox. Also, follow us on Twitter @OracleBigData.

Related:

Build Your Data Lake with Oracle Big Data Service

In today’s world, there’s an ever-growing deluge of highly diverse data coming from diverse sources. In the struggle to manage and organize that data, practitioners are finding it harder when they only have the traditional relational database or data warehouse as options.

That’s why the data lake has become increasingly popular as a complement to traditional data management. Think of the traditional data warehouse as a reservoir—it’s cleansed, drinkable.

The data lake, on the other hand, has data that are of potentially unknown value. The data isn’t necessarily cleansed—which is why it’s more of an adventure. The data lake can be voluminous, brimming with data and unmatched possibilities. Users can easily load even more data and start experimenting to find new insights that organizations couldn’t discover before.

Organizations must be able to:

• Store their data in a way that is less complicated

• Reduce management even though the data is more complex

• Use data in a way that makes sense for them

And that’s exactly why Oracle has created Oracle Big Data Service as a way to help build data lakes.

Oracle Big Data Service is an automated service based on Cloudera Enterprise that provides a cost-effective Hadoop data lake environment—a secure place to store and analyze data of different types from any source. It can be used as a data lake or a machine learning platform.

It comes with a fully integrated stack that includes both open-source and Oracle value-added tools, and it’s designed for enterprises that need flexible deployment options, scalability, and the ability to add tools of their choosing.

Oracle Big Data Service also provides:

  • An easy way to expand on premises to Oracle Cloud
  • Secure, reliable, and elastic Hadoop clusters in minutes
  • Native integration with Oracle Cloud platform services

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Oracle + Hadoop = A Better Data Lake Together

We wanted to make the power of Hadoop and the entire Hadoop ecosystem available to you. But Hadoop can be complicated, which is why we’ve combined the best of what Oracle and Cloudera have to offer and made it into something easier to handle—which makes building and managing your data lake easier than ever.

With Cloudera Enterprise Deployment, our service is vertically integrated for Hadoop, Kafka, and Spark with a best-practices, high-availability deployment.

With Big Data Service, you get:

  • Highly secure, highly available clusters provisioned in minutes
  • Ability to expand on-premises Hadoop, which enables you to deploy, test, development, and/or move data lakes to cloud
  • Flexibility to scale as you wish using high-performance bare metal or cost-effective virtual machine shapes
  • Automatically deployed security and management features

You also can choose your Cloudera version, giving you the ability to:

  • Match your current deployment—which is important for test and dev environments
  • Deploy new versions—allowing you to take advantage of the distribution’s latest features

Oracle Big Data Service Features

We built Oracle Big Data Service to be your go-to big data and data lake solution, one that’s specifically designed for a diverse set of big data use cases and workloads. From short-lived clusters used to tackle specific tasks to long-lived clusters that manage large data lakes, Oracle Big Data Service scales to meet an organization’s requirements at a low cost and with the highest levels of security.

Let’s explore just how Oracle Big Data Service does this.

  1. Oracle Big Data Service and Oracle Cloud SQL

Use Oracle SQL to query across big data sources with Oracle Cloud SQL, including the Hadoop Distributed File System (HDFS), Hive, object stores, Kafka, and NoSQL.

You can accomplish all of this with simple administration, because Oracle Cloud SQL uses existing Hive metadata and security, and offers fast, scale-out processing using Oracle Cloud SQL compute.

  1. Oracle Big Data Service and Big Data Analytics

What use is managing and accessing your data if you can’t run analytics to find real results? We offer support in the areas of machine learning, spatial analysis, and graph analysis to help you get the information your organization needs to gain better business results and improved metrics. Oracle Big Data Service customers are licensed for these options and can deploy at no extra cost.

It’s also easy to connect to Oracle Cloud services such as Oracle Analytics Cloud, Oracle Cloud Infrastructure Data Science, or Oracle Autonomous Database. Or you can use any Cloudera-certified application for a wide range of analytic tools and applications.

  1. Oracle Big Data Service and Workload Portability

Cloud may be the future of enterprise computing, which is why we’ve built the newest, best cloud infrastructure out there with Oracle Cloud Infrastructure. But it’s not everything—at least, not yet. You still need to maintain a mix of public cloud, local cloud, and traditional on-premises computing for the foreseeable future.

With Oracle Big Data Service, deploy where it makes sense. With Oracle, if you develop something on premises, it’s easy to move that to the cloud and vice versa.

  1. Oracle Big Data Service and Secure, High-Availability Clusters

With Oracle Big Data Service, expect easy deployment when creating your clusters. Specify minimal settings to create the cluster, then use just one click to create a cluster with highly available Hadoop services.

You also get a choice of Cloudera versions, enabling “Cloud Also” deployments to match for on-premises compatibility, or you can choose newer versions to take advantage of the latest features.

  1. Oracle Big Data Service Offers Security

If you’re using off-box virtualization, Oracle can’t see customer data and customers can’t see Oracle management code. In most first-generation clouds, the network and tenant environments are coupled, only abstracted by the hypervisor.

Oracle follows a Least Trust Design principle. We don’t trust the hardware, the customer (think rogue employees), or the hypervisor. That’s why we’ve separated our network and tenant environments. Isolating that network virtualization helps prevent the spread and lateral movement of attacks.

In addition, with Oracle Big Data Service, all Cloudera security features are enabled with strong authentication, role-based authorization, auditing, and encryption.

  1. Oracle Big Data Service and the Compute and Storage You Want

Whether you’re using Oracle Big Data Cloud for development, test, data science, or data lakes, we offer the compute offerings you need for your use case. Leverage the flexibility of virtual machines (VMs), block storage, and with direct-attached NVMe (non-volatile memory express) storage, the unparalleled performance of bare metal.

  1. Oracle Big Data Service and Superior Networking

With Oracle Big Data Service, you can expect high fidelity, virtual networks, and connectivity. Our networking is:

Customizable

  • Fully configurable IP addresses, subnets, routing, and firewalls to support new or existing private networks

High performance and consistent

  • High bandwidth, microsecond latency network
  • Private access without traversing the internet

Capable of connecting to corporate networks

  • FastConnect—dedicated, private connectivity
  • VPN Connect—simple and secure internet connectivity
  1. Oracle Big Data Service and Oracle’s Data Management Platform

Your organization spends time and effort creating, attaining and storing data and you want to be able to use it. You can reduce the time, cost, and effort of getting data from wherever it originates to all the places it’s needed across the enterprise with Oracle.

Oracle has spent decades building and expanding its data management platform.

With Oracle’s end-to-end data management, you get an easy connection to:

  • Oracle Autonomous Database
  • Oracle Analytics Cloud
  • Oracle Cloud Infrastructure Streaming
  • Oracle Cloud Infrastructure Data Catalog
  • Oracle Cloud Infrastructure Data Science
  • Oracle Cloud Infrastructure Data Flow
  • The list goes on …

And with a unified query with Oracle Cloud SQL, you’ll be able to correlate information from a variety of sources using Oracle SQL. In addition, you will gain a host of Oracle analytic and connectivity options, including:

  • Oracle Machine Learning
  • Oracle Big Data Spatial and Graph
  • Oracle Big Data Connectors
  • Oracle Data Integrator Enterprise Edition

Oracle Big Data Service for All Your Data Lake Needs

From enabling machine learning to storing and analyzing data, Oracle Big Data Service is a scalable, secure data lake service that meets your requirements at a low cost and the highest levels of security.

It allows you to worry less about managing and storing data. And it empowers you to start analyzing your data in a way that makes the future of your organization more successful than ever before.

To learn more about how to get started with data lakes, check out Oracle Big Data Service—and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox.

Related:

Data Lakes: Examining the End to End Process

It’s a good way to think of a data lake as being the ultimate hub for your organization. On the most basic level, it takes data in from various sources and makes it available for users to query. But much more goes on during the entire end to end process involving a data lake. To get a clearer understanding of how it all comes together—and a bird’s-eye view of what it can do for your organization—let’s look at each step in depth.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Step 1: Identify and connect sources

Unlike data warehouses, data lakes can take inputs from nearly any type of source. Structured, unstructured, and semi-structured data can all coexist in a data lake. The primary goal of this type of feature is allowing all of the data to exist in a single repository in its raw format. A data warehouse specializes in housing processed and prepared data for use, and while that is certainly helpful in many instances, it still leaves many types of data out of the equation. By unifying these disparate data sources into a single source, a data lake allows users to have access to all types of data without requiring the logistical legwork of connecting to individual data warehouses.

Step 2: Ingest data into zones

If a data lake is set up per best practices, then incoming data will not just get dumped into a single data swamp. Instead, since the data sources come from known quantities, it is possible to establish landing zones for datasets from particular sources. For example, if you know that a dataset contains sensitive financial information, it can immediately go into a zone that limits access by user role and additional security measures. If it’s data that comes in a set format ready for use by a certain user group (for example, the data scientists in HR), then that can immediately go into a zone defined for that. And if another dataset delivers raw data with minimal metadata specifics to easily identify it on a database level (like a stream of images), then that can go into its own zone of raw data, essentially setting that group aside for further processing.

In general, it’s recommended that the following zones be used for incoming data. Establishing this zone sorting right away allows for the first broad strokes of organization to be completed without any manual intervention. There are still more steps to go to optimize discoverability and readiness, but this automates the first big step. Per our blog post 6 Ways To Improve Data Lake Security, these are the recommended zones to establish in a data lake:

Temporal: Where ephemeral data such as copies and streaming spools live prior to deletion.

Raw: Where raw data lives prior to processing. Data in this zone may also be further encrypted if it contains sensitive material.

Trusted: Where data that has been validated as trustworthy lives for easy access by data scientists, analysts, and other end users.

Refined: Where enriched and manipulated data lives, often as final outputs from tools.

Step 3: Apply security measures

Data arrives into a data lake completely raw. That means that any inherent security risk with the source data comes along for the ride when it lands in the data lake. If there’s a CSV file with fields containing sensitive data, it will remain that way until security steps have been applied. If step 2 has been established as an automated process, then the initial sorting will help get you halfway to a secure configuration.

Other measures to consider include:

  • Clear user-based access defined by roles, needs, and organization.
  • Encryption based on a big-picture assessment of compatibility within your existing infrastructure.
  • Scrubbing the data for red flags, such as known malware issues, suspicious file names or formats (such as an executable file living in a dataset that is otherwise media files). Machine learning can significantly speed up this process.

Running all incoming data through a standardized security process ensures consistency among protocols and execution; if automation is involved, this also helps to maximize efficiency. The result? The highest levels of confidence that your data will go only to the users that should see it.

Step 4: Apply metadata

Once the data is secure, that means that it’s safe for users to access it—but how will they find it? Discoverability is only enabled when the data is properly organized and tagged with metadata. Unfortunately, since data lakes take in raw data, data can arrive with nothing but a filename, format, and time stamp. So what can you do with this?

A data catalog is a tool that can work with data lakes in a way that optimizes discovery. By enabling more metadata application, data can be organized and labeled in an accurate and effective way. In addition, if machine learning is utilized, the data catalog can begin recognizing patterns and habits to automatically label things. For example, let’s assume a data source is consistently sending MP3 files of various lengths—but the ones over twenty minutes are always given the metatag “podcast” after arriving in the data lake. Machine learning will pick up on that pattern and then start auto-tagging that group with “podcast” upon arrival.

Given that the volume of big data is getting bigger—and that more and more sources of unstructured data are entering data lakes, that type of pattern learning and automation can make huge differences in efficiency.

Step 5: User discovery

Once data is sorted, it’s ready for users to discover. With all of those data sources consolidated into a single data lake, discovery is easier than ever before. If tools like analytics exist outside of the data lake’s infrastructure, then there’s only one export/import step that needs to take place for the data to be used. In a best-case scenario, those tools are integrated into the data lake, allowing for real-time queries against the absolute latest data, all without any manual intervention.

Why is this so important? A recent survey showed that, on average, five data sources are consulted before making a decision. Consider the inefficiency if each source has to be queried and called manually. Putting it all in a single accessible data lake and integrating tools for real-time data querying removes numerous steps so that discovery can be as easy as a few clicks.

The Hidden Benefits of a Data Lake

The above details break down the end-to-end process of a data lake—and the resulting benefits go beyond saving time and money. By opening up more data to users and removing numerous access and workflow hurdles, users have the flexibility to try new perspectives, experiment with data, and look for other results. All of this leads to previously impossible insights, which can drive an organization’s innovation in new and unpredictable ways.

To learn more about how to get started with data lakes, check out Oracle Big Data Service—and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox.

Related:

Four Tools to Integrate into Your Data Lake

A data lake is an absolutely vital piece of today’s big data business environment. A single company may have incoming data from a huge variety of sources, and having a means to handle all of that is essential. For example, your business might be compiling data from places as diverse as your social media feed, your app’s metrics, your internal HR tracking, your website analytics, and your marketing campaigns. A data lake can help you get your arms around all of that, funneling those sources into a single consolidated repository of raw data.

But what can you do with that data once it’s all been brought into a data lake? The truth is that putting everything into a large repository is only part of the equation. While it’s possible to pull data from there for further analysis, a data lake without any integrated tools remains functional but cumbersome, even clunky.

On the other hand, when a data lake integrates with the right tools, the entire user experience opens up. The result is streamlined access to data while minimizing errors during export and ingestion. In fact, integrated tools do more than just make things faster and easier. By expediting automation, the door opens to exciting new insights, allowing for new perspectives and new discoveries that can maximize the potential of your business.

To get there, you’ll need to put the right pieces in place. Here are four essential tools to integrate into your data lake experience.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Machine Learning

Even if your data sources are vetted, secured, and organized, the sheer volume of data makes it unruly. As a data lake tends to be a repository for raw data—which includes unstructured items such as MP3 files, video files, and emails, in addition to structured items such as form data—much of the incoming data across various sources can only be natively organized so far. While it can be easy to set up a known data source for, say, form data into a repository dedicated to the fields related to that format, other data (such as images) arrives with limited discoverability.

Machine learning can help accelerate the processing of this data. With machine learning, data is organized and made more accessible through various processes, including:

In processed datasets, machine learning can use historical data and results to identify patterns and insights ahead of time, flagging them for further examination and analysis.

With raw data, machine learning can analyze usage patterns and historical metadata assignments to begin implementing metadata automatically for faster discovery.

The latter point requires the use of a data catalog tool, which leads us to the next point.

Data Catalog

Simply put, a data catalog is a tool that integrates into any data repository for metadata management and assignment. Products like Oracle Cloud Infrastructure Data Catalog are a critical element of data processing. With a data catalog, raw data can be assigned technical, operational, and business metadata. These are defined as:

  • Technical metadata: Used in the storage and structure of the data in a database or system
  • Business metadata: Contributed by users as annotations or business context
  • Operational metadata: Created from the processing and accessing of data, which indicates data freshness and data usage, and connects everything together in a meaningful way

By implementing metadata, raw data can be made much more accessible. This accelerates organization, preparation, and discoverability for all users without any need to dig into the technical details of raw data within the data lake.

Integrated Analytics

A data lake acts as a middleman between data sources and tools, storing the data until it is called for by data scientists and business users. When analytics and other tools exist separate from the data lake, that adds further steps for additional preparation and formatting, exporting to CSV or other standardized formats, and then importing into the analytics platform. Sometimes, this also includes additional configuration once inside the analytics platform for usability. The cumulative effect of all these steps creates a drag on the overall analysis process, and while having all the data within the data lake is certainly a help, this lack of connectivity creates significant hurdles within a workflow.

Thus, the ideal way to allow all users within an organization to swiftly access data is to use analytics tools that seamlessly integrate with your data lake. Doing so removes unnecessary manual steps for data preparation and ingestion. This really comes into play when experimenting with variability in datasets; rather than having to pull a new dataset every time you experiment with different variables, integrated tools allow this to be done in real time (or near-real time). Not only does this make things easier, this flexibility opens the door to new levels of insight as it allows for previously unavailable experimentation.

Integrated Graph Analytics

In recent years, data analysts have started to take advantage of graph analyticsthat is, a newer form of data analysis that creates insights based on relationships between data points. For those new to the concept, graph analytics considers individual data points similar to dots in a bubble—each data point is a dot, and graph analytics allows you to examine the relationship between data by identifying volume of related connections, proximity, strength of connection, and other factors.

This is a powerful tool that can be used for new types of analysis in datasets with the need to examine relationships between data points. Graph analytics often works with a graph database itself or through a separate graph analytics tool. As with traditional analytics, any sort of extra data exporting/ingesting can slow down the process or create data inaccuracies depending on the level of manual involvement. To get the most out of your data lake, integrating cutting-edge tools such as graph analytics means giving data scientists the means to produce insights as they see fit.

Why Oracle Big Data Service?

Oracle Big Data Service is a powerful Hadoop-based data lake solution that delivers all of the needs and capabilities required in a big data world:

  • Integration: Oracle Big Data Service is built on Oracle Cloud Infrastructure and integrates seamlessly into related services and features such as Oracle Analytics Cloud and Oracle Cloud Infrastructure Data Catalog.
  • Comprehensive software stack: Oracle Big Data Service comes with key big data software: Oracle Machine Learning for Spark, Oracle Spatial Analysis, Oracle Graph Analysis, and much more.
  • Provisioning: Deploying a fully configured version of Cloudera Enterprise, Oracle Big Data Service easily configures and scales up as needed.
  • Secure and highly available: With built-in high availability and security measures, Oracle Big Data Service integrates and executes this in a single click.

To learn more about Oracle Big Data Service, click here—and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox.

Related:

6 Ways to Improve Data Lake Security

Data lakes, such as Oracle Big Data Service, represent an efficient and secure way to store all of your incoming data. Worldwide big data is projected to rise from 2.7 zettabytes to 175 zettabytes by 2025, and this means an exponentially growing number of ones and zeroes, all pouring in from an increasing number of data sources. Unlike data warehouses, which require structured and processed data, data lakes act as a single repository for raw data across numerous sources.

What do you get when you establish a single source of truth for all your data? Having all that data in one place creates a cascading effect of benefits, starting with simplifying IT infrastructure and processes and rippling outward to workflows with end users and analysts. Streamlined and efficient, a single data lake basket makes everything from analysis to reporting faster and easier.

There’s just one issue: all of your proverbial digital eggs are in one “data lake” basket.

For all of the benefits of consolidation, a data lake also comes with the inherent risk of a single point of failure. Of course, in today’s IT world, it’s rare for IT departments to set anything up with a true single point of failure—backups, redundancies, and other standard failsafe techniques tend to protect enterprise data from true catastrophic failure. This is doubly so when enterprise data lives in the cloud, such as with Oracle Cloud Infrastructure, as data entrusted in the cloud rather than locally has the added benefit of trusted vendors building their entire business around keeping your data safe.

Does that mean that your data lake comes protected from all threats out of the box? Not necessarily; as with any technology, a true assessment of security risks requires a 360-degree view of the situation. Before you jump into a data lake, consider the following six ways to secure your configuration and safeguard your data.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Establish Governance: A data lake is built for all data. As a repository for raw and unstructured data, it can ingest just about anything from any source. But that doesn’t necessarily mean that it should. The sources you select for your data lake should be vetted for how that data will be managed, processed, and consumed. The perils of a data swamp are very real, and avoiding them depends on the quality of several things: the sources, the data from the sources, and the rules for treating that data when it is ingested. By establishing governance, it’s possible to identify things such as ownership, security rules for sensitive data, data history, source history, and more.

Access: One of the biggest security risks involved with data lakes is related to data quality. Rather than a macro-scale problem such as an entire dataset coming from a single source, a risk can stem from individual files within the dataset, either during ingestion or after due to hacker infiltration. For example, malware can hide within a seemingly benign raw file, waiting to execute. Another possible vulnerability stems from user access—if sensitive data is not properly protected, it’s possible for unscrupulous users to access those records, possibly even modify them. These examples demonstrate the importance of establishing various levels of user access across the entire data lake. By creating strategic and strict rules for role-based access, it’s possible to minimize the risks to data, particularly sensitive data or raw data that has yet to be vetted and processed. In general, the widest access should be for data that has been confirmed to be clean, accurate, and ready for use, thus limiting the possibility of accessing a potentially damaging file or gaining inappropriate access to sensitive data.

Use Machine Learning:Some data lake platforms come with built-in machine learning (ML) capabilities. The use of ML can significantly minimize security risks by accelerating raw data processing and categorization, particularly if used in conjunction with a data cataloging tool. By implementing this level of automation, large amounts of data can be processed for general use while also identifying red flags in raw data for further security investigation.

Partitions and Hierarchy: When data gets ingested into a data lake, it’s important to store it in a proper partition. The general consensus is that data lakes require several standard zones to house data based on how trusted it is and how ready-to-use it is. These zones are:

  • Temporal: Where ephemeral data such as copies and streaming spools live prior to deletion.
  • Raw: Where raw data lives prior to processing. Data in this zone may also be further encrypted if it contains sensitive material.
  • Trusted: Where data that has been validated as trustworthy lives for easy access by data scientists, analysts, and other end users.
  • Refined: Where enriched and manipulated data lives, often as final outputs from tools.

Using zones like these creates a hierarchy that, when coupled with role-based access, can help minimize the possibility of the wrong people accessing potentially sensitive or malicious data.

Data Lifecycle Management:Which data is constantly used by your organization? Which data hasn’t been touched in years? Data lifecycle management is the process of identifying and phasing out stale data. In a data lake environment, older stale data can be moved to a specific tier designed for efficient storage, ensuring that it is still available should it ever be needed but not taking up needed resources. A data lake powered by ML can even use automation to identify and process stale data to maximize overall efficiency. While this may not touch directly on security concerns, an efficient and well managed data lake allows it to function like a well-oiled machine rather than collapsing under the weight of its own data.

Data Encryption:The idea of encryption being vital to data security is nothing new, and most data lake platforms come with their own methodology for data encryption. How your organization executes, of course, is critical. Regardless of which platform you use or what you decide between on premises vs, cloud, a sound data encryption strategy that works with your existing infrastructure is absolutely vital to protecting all of your data whether in motion or at rest—in particular, your sensitive data.

Create Your Secure Data Lake

What’s the best way to create a secure data lake? With Oracle’s family of products, a powerful data lake is just steps away. Built upon the foundation of Oracle Cloud Infrastructure, Oracle Big Data Service delivers cutting-edge data lake capabilities while integrating into premiere analytics tools and one-touch Hadoop security functions. Learn more about Oracle Big Data Service to see how easy it is to deploy a powerful cloud-based data lake in your organization—and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox.

Related:

Why Data Lakes Need a Data Catalog

It’s no secret that big data is getting much bigger with each passing year—in fact, the world is seeing exponential growth in the amount of data generated, as plenty of research shows. That creates the issue of storage. If all those bits and bytes are being transmitted and you need access to them in order to analyze and derive insights via business intelligence, then the next logical step is a data lake.

But what happens when all of that data is sitting in the data lake? Finding anything specific within such a repository can be unwieldy by today’s standards. With the growing volume of data generated by all the world’s devices, the data lake will only grow wider and deeper with each passing day. Thus, while collecting it into a repository is key to using it, information needs to be cataloged and accessible in order for it to actually be usable. The sensible solution, then, is to implement a data catalog.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

What Is a Data Lake?

Before understanding why a data catalog can be so useful in this situation, it’s important to grasp the concept of a data lake. In layman’s terms, a data lake acts as a repository that stores data exactly the way it comes in. If it’s a structured dataset, it maintains that structure without adding any further indexing or metadata. If it’s unstructured data (for example, social media posts, images, MP3 files, etc.), it lands in the data lake as is, whatever its native format might be. Data lakes can take input from multiple sources, making them a functional single repository for an organization to use as a collection point. To go further into the lake metaphor, consider each data source as a stream or a river and they all lead to the data lake, where raw and unfiltered datasets sit next to curated and enterprise/certified datasets.

Collecting data is only half of the equation, however. A repository only works well if data can be called up and used for analysis. In a data lake, data remains in its raw format until this step happens. At that point, a schema is applied to it for processing (schema on read), allowing analysts and data scientists to pick and choose what they work with and how they work with it.

This is a very simple call-and-response action, but one element is missing: the search process. A data lake requires data governance. Without organization, searching for data is a chaotic, inefficient, and time-consuming process. And if too much time passes without clear organization and governance, the value of a data lake may collapse under its own accumulated data.

Enter the data catalog.

What Is a Data Catalog?

A data catalog is exactly as it sounds: it is a catalog for all the big data in a data lake. By applying metadata to everything within the data lake, data discovery and governance become much easier tasks. By applying metadata and a hierarchical logic to incoming data, datasets receive the necessary context and trackable lineage to be used efficiently in workflows.

Let’s use the analogy of notes in a researcher’s library. In this library, a researcher gets structured data in the form of books that feature chapters, indices, and glossaries. The researcher also gets unstructured data in the form of notebooks that feature no real organization or delineation at all. A data catalog would take each of these items without changing their native format and apply a logical catalog to them using metadata such as date received, sender, general topic, and other such items that could accelerate data discovery.

Given that most data lake situations lack a universal organizational tool, a true data catalog is an essential add-on. Without the level of organization of a data catalog, a data lake becomes a data swamp—and trying to pull data from a data swamp creates a process that is inefficient at best and a bottleneck at worst.

How Data Lakes Work with Data Catalogs

Let’s take a look at a data scientist’s workflow from two different perspectives: without a data catalog and with a data catalog. Our hypothetical case study involves a smart doorbell that provides a stream of device data. At the same time, the company tracks mentions on social media by users who’ve had packages stolen to determine times that more accurately predict when thieves come.

Without a data catalog: In this example, a data lake has datasets streaming in from Internet of Things (IoT) devices along with collected social media posts from the marketing team. A data analyst wants to examine the impact of a specific feature’s usage on social media sharing. Remember, the data in a data lake remains raw and unprocessed. In this case, data scientists will have to pull device datasets from the time period of the feature’s launch, then examine the individual data tables. To cross reference against social media, they will have to pull all social media posts from this time period, then filter out by keyword to try and drill down using mentions of the feature. While all this can be achieved using the data lake as a single source, it also requires quite a bit of manual labor for preparation time.

With a data catalog: As datasets come into the data lake, a data catalog’s machine learning capabilities recognize the IoT data and create a universal schema based on those elements. Users still have the ability to apply their own metadata to enhance discoverability. Thus, when data scientists want to pull their data, a search within the data catalog brings up relevant results associated with the feature and other targeted keywords, allowing for much quicker preparation and processing.

This example illustrates the stark difference created by a data catalog. Without it, data scientists are essentially searching through folders without context—the information sought has to be already identified through some means such as data source, time range, and file type. In a small, controlled data environment with limited sources, this is workable. However, in a large repository featuring many sources and heavy collaboration, it quickly devolves into murky chaos.

A data catalog doesn’t completely automate everything, though its ability to intake structured data does feature significant automated processing. However, even with unstructured data, inherent machine learning and artificial intelligence capabilities mean that if a data scientist manually processes data with set patterns, then the catalog can begin to learn and provide first-cut recommendations to speed things up.

Position Your Data Lake for Success

The volume of data flowing into repositories is only getting bigger with each passing day. To ensure efficiency and accuracy, a form of governance is necessary for creating order among the chaos. Otherwise, a data lake quickly becomes a proverbial data swamp. Fortunately, data catalogs are a simple tool to achieve this, and by integrating such a thing into a repository, organizations are set up for success now—and prepared to scale up as needed towards a bigger-than-big data future.

Need to know more about data lakes and data catalogs? Check out Oracle’s big data management products and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox

Related:

Data Lake, Data Warehouse and Database…What’s the Difference?

There are so many buzzwords these days regarding data management. Data lakes, data warehouses, and databases – what are they? In this article, we’ll walk through them and cover the definitions, the key differences, and what we see for the future.

Start building your own data lake with a free trial

Data Lake Definition

If you want full, in-depth information, you can read our article called, “What’s a Data Lake?” But here we can tell you, “A data lake is a place to store your structured and unstructured data, as well as a method for organizing large volumes of highly diverse data from diverse sources.”

The data lake tends to ingest data very quickly and prepare it later, on the fly, as people access it.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Data Warehouse Definition

A data warehouse collects data from various sources, whether internal or external, and optimizes the data for retrieval for business purposes. The data is usually structured, often from relational databases, but it can be unstructured too.

Primarily, the data warehouse is designed to gather business insights and allows businesses to integrate their data, manage it, and analyze it at many levels.

Database Definition

Essentially, a database is an organized collection of data. Databases are classified by the way they store this data. Early databases were flat and limited to simple rows and columns. Today, the popular databases are:

  • Relational databases, which store their data in tables
  • Object-oriented databases, which store their data in object classes and subclasses

Data Mart, Data Swamp and Other Terms

And, of course, there are other terms such as data mart and data swamp, which we’ll cover very quickly so you can sound like a data expert.

Enterprise Data Warehouse (EDW): This is a data warehouse that serves the entire enterprise.

Data Mart: A data mart is used by individual departments or groups and is intentionally limited in scope because it looks at what users need right now versus the data that already exists.

Data Swamp: When your data lake gets messy and is unmanageable, it becomes a data swamp.

The Differences Between Data Lakes, Data Warehouses, and Databases

Data lakes, data warehouses and databases are all designed to store data. So why are there different ways to store data, and what’s significant about them? In this section, we’ll cover the significant differences, with each definition building on the last.

The Database

Databases came about first, rising in the 1950s with the relational database becoming popular in the 1980s.

Databases are really set up to monitor and update real-time structured data, and they usually have only the most recent data available.

The Data Warehouse

But the data warehouse is a model to support the flow of data from operational systems to decision systems. What this means, essentially, is that businesses were finding that their data was coming in from multiple places—and they needed a different place to analyze it all. Hence the growth of the data warehouse.

For example, let’s say you have a rewards card with a grocery chain. The database might hold your most recent purchases, with a goal to analyze current shopper trends. The data warehouse might hold a record of all of the items you’ve ever bought and it would be optimized so that data scientists could more easily analyze all of that data.

The Data Lake

Now let’s throw the data lake into the mix. And because it’s the newest, we’ll talk about this one more in depth. The data lake really started to rise around the 2000s, as a way to store unstructured data in a more cost-effective way. The key phrase here is cost effective.

Although databases and data warehouses can handle unstructured data, they don’t do so in the most efficient manner. With so much data out there, it can get expensive to store all of your data in a database or a data warehouse.

In addition, there’s the time-and-effort constraint. Data that goes into databases and data warehouses needs to be cleansed and prepared before it gets stored. And with today’s unstructured data, that can be a long and arduous process when you’re not even completely sure that the data is going to be used.

That’s why data lakes have risen to the forefront. The data lake is mainly designed to handle unstructured data in the most cost-effective manner possible. As a reminder, unstructured data can be anything from text to social media data to machine data such as log files and sensor data from IoT devices.

Data Lake Example

Going back to the grocery example that we used with the data warehouse, you might consider adding a data lake into the mix when you want a way to store your big data. Think about the social sentiment you’re collecting, or advertising results. Anything that is unstructured but still valuable can be stored in a data lake and work with both your data warehouse and your database.

Note 1: Having a data lake doesn’t mean you can just load your data willy-nilly. That’s what leads to a data swamp. But it does make the process easier, and new technologies such as having a data catalog will steadily make it simpler to find and use the data in your data lake.

Note 2: If you want more information on the ideal data lake architecture, you can read the full article we wrote on the topic. It describes why you want your data lake built on object storage and Apache Spark, versus Hadoop.

What’s the Future of Data Lakes, Data Warehouses, and Databases?

Will one of these technologies rise to overtake the others?

We don’t think so.

Here’s what we see. As the value and amount of unstructured data rises, the data lake will become increasingly popular. But there will always be an essential place for databases and data warehouses.

You’ll probably continue to keep your structured data in the database or data warehouse. But these days, more companies are moving their unstructured data to data lakes on the cloud, where it’s more cost effective to store it and easier to move it when it’s needed.

This workload that involves the database, data warehouse, and data lake in different ways is one that works, and works well. We’ll continue to see more of this for the foreseeable future.

If you’re interested in the data lake and want to try to build one yourself, we’re offering a free data lake trial with a step-by-step tutorial. Get started today, and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox

Related:

4 Reasons Why Businesses Need Data Lakes

It’s common knowledge in the modern business world that big data —that is, the large volumes of data collected for analysis by organizations—is a significant element of any strategy—operations, sales, marketing, finance, HR, and every other department rely on big data solutions to stay ahead of the competition. But how that big data is handled remains another story.

Enter the world of data lakes. Data lakes are repositories that can take in data from multiple sources. Rather than process data for immediate analysis, all received data is stored in its native format. This model allows data lakes to hold massive amounts of data while using minimal resources. Data is only processed upon being called for usage (compared to a data warehouse, which processes all incoming data). This ultimately allows data lakes to be an efficient way for storage, resource management, and data preparation.

But do you actually need a data lake, especially if your big data solution already has a data warehouse? The answer is a resounding yes. In a world where the volume of data transmitted across countless devices continues to increase, a resource-efficient means of accessing data is critical to a successful organization. In fact, here are four specific reasons why the need for a data lake is only going to get more urgent as time goes on.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

90% of data has been generated since 2016

90% of all data ever is a lot—or is it? Consider what has become available to people as Wi-Fi, smartphones, and high-speed data networks have entered everyday life over the past twenty years. In the early 2000s, streaming was limited to audio, while broadband internet was used mostly for web surfing, emailing, and downloads. In that paradigm, device data was at a minimum and the actual data consumed was mostly about interpersonal communication, especially because videos and TV hadn’t hit a level of compression that supported high-quality streaming. Towards the end of the decade, smartphones became common and Netflix had shifted its business priority to streaming.

That means between 2010 and 2020, the internet has seen the growth of smartphones (and their apps), social media, streaming services for both audio and video, streaming video game platforms, software delivered through downloads rather than physical media, and so on, all creating exponential consumption of data. As for the part that is the most relevant to business? Consider how many businesses have associated apps constantly transmitting data to and from devices, whether to control appliances, provide instructions and specifications, or quietly transmit user metrics in the background.

With 5G data networks widely starting to deploy in 2019, bandwidths and speeds are only going to get better. This means as massive—and significant—as big data has already been in the past few years, it’s only going to get bigger as technology allows the world to become even more connected. Is your data repository ready?

95% of businesses handle unstructured data

In a digital world, businesses collect data from all types of sources, and most of that is unstructured. Consider the data collected by a company that sells services and makes appointments via an app. While some of that data comes structured—that is, in predefined formats and fields such as phone numbers, dates, transaction prices, time stamps, etc.—a company like that still has to archive and store a lot of unstructured data. Unstructured data is any type of data that doesn’t contain an inherent structure or predefined model, which makes it difficult to search, sort, and analyze without further preparation.

For the example above, unstructured data comes in a wide range of formats. For a user making an appointment, any text fields filled out to make that appointment count as unstructured data. Within the company itself, emails and documents are another form of unstructured data. The posts from a company’s social media channel are also unstructured data. Any photos or videos used by employees as notes while performing services are unstructured data. Similarly, any instructional videos or podcasts created by the company as marketing assets are also unstructured.

Unstructured data is everywhere, and as more devices connect to deliver a greater range of information, it becomes clear that organizations need a way to get their proverbial arms around all of it.

4.4 GB of data are used by Americans every minute

More than 325 million people live in the US. Nearly 70% of them have smartphones. And even if you don’t count the people currently streaming media, consider what is happening on an average smartphone in a minute. It’s receiving an update on the weather. It’s checking for any new emails in the user’s inbox. It’s pushing data to social media, delivering voicemail over Wi-Fi, delivering strategic marketing notifications from apps, such as when a real estate app pushes a new housing listing. It’s sending text and images via chat apps, and downloading app/OS updates in the background.

Data is everywhere now, which means the minute that just passed while you read the above paragraph, gigabytes of data have been transmitted across the country—4.4 million GB of data every minute, according to Domo’s Data Never Sleeps report. And that’s just the United States; when combined with the rest of the world, the total volume of data grows exponentially. For businesses, collecting this kind of data is vital to all aspects of operations, from marketing to sales to communication. Thus, every organization must put a premium on safe, available, and accessible storage.

50% of businesses say that big data has changed their sales and marketing

Most people think of big data in terms of the technical aspects. Clearly, a company that works through a phone app or provides a form of streaming uses big data and is delivering a service that simply wasn’t feasible twenty years ago. However, big data is much more than delivery of streaming content. It can create significant improvements in sales and marketing—so much so that according to a McKinsey report, 50% of businesses say that big data is driving them to change their approach in these departments.

What’s the reason for this? With big data, organizations have a much more efficient path to understanding customers than in-person focus groups. Data allows for gathering a mass sample of actions from existing and potential customers. Everything from their website browsing prior to conversion to how long they engaged with certain features of a product or service are all available at high volume, which creates a large enough sample size for a reliable customer model. To be in the cutting-edge 50%, an organization needs to have the data infrastructure to receive, store, and retrieve massive amounts of structured and unstructured data for processing.

Basically, you need a data lake

The above statistics all point to one thing—your organization needs a data lake. And if you don’t get ahead of the curve now in terms of managing data, it’s clear that the world will pass you by in all areas: operations, sales, marketing, communications, and other departments. Data is simply a way of life now, enabling precise insight-driven decisions and unparalleled discovery into root causes. When combined with machine learning and artificial intelligence, this data also allows for predictive modeling for future actions.

Learn more about why data lakes are the future of big data and discover Oracle’s big data solutions—and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox.

(Note: Corrected typo from Domo’s Data Never Sleeps citation.)

Related:

Structured vs. Unstructured Data

What is the difference between structured and unstructured data—and should you care? For many businesses and organizations, such distinctions may feel like they belong solely to the IT department dealing with big data. And while there is some truth to that, it’s worthwhile for everyone to understand the difference, because once you grasp the definition of structured data and unstructured data (along with where that data lives and how to process it), it’s possible to see how this can be used to improve any data-driven process.

And these days, nearly any workflow in any department is data-driven.

Sales, marketing, communications, operations, human resources, all of these produce data. Even the smallest of small business—say, a brick-and-mortar store with physical inventory and a local customer base—produces structured and unstructured data from things like email, credit card transactions, inventory purchases, and social media. Thus, taking advantage of this comes through understanding the two, and how they work together.

What Is Structured Data?

Structured data is data that uses a predefined and expected format. This can come from many different sources, but the common factor is that the fields are fixed, as is the way that it is stored (hence, structured). This predetermined data model enables easy entry, querying, and analysis. Here are two examples to illustrate this point.

First, consider transactional data from an online purchase. In this data, each record will have a timestamp, purchase amount, associated account information (or guest account), item(s) purchased, payment information, and confirmation number. Because each field has a defined purpose, it makes it easy to manually query (the equivalent of hitting CTRL+F on an Excel spreadsheet) and also easy for machine learning algorithms to identify patterns—and in many cases, identify anomalies outside of those patterns.

Another example is data coming from a medical device. Something as simple as a hospital EKG meter represents structured data down to two key fields: the electrical activity of a person’s heart and the associated timestamp. Those two fields are predefined and would easily fit into a relational or tabular database; machine learning algorithms could easily identify patterns and anomalies with just a few minutes worth of records.

Despite the vast difference in technical complexity between these examples, it’s clearly shown that structured data drills down to using established and expected elements. Timestamps will arrive in a defined format; it won’t (or can’t) transmit a timestamp described in words because that is outside of the structure. A predefined format allows for easy scalability and processing, even if handled on a manual level.

Structured data can be used for anything as long as the source defines the structure. Some of the most common uses in business include CRM forms, online transactions, stock data, corporate network monitoring data, and website forms.

What Is Unstructured Data?

Structured data comes with definition. Thus, unstructured data is the opposite of that. Rather than predefined fields in a purposeful format, unstructured data can come in all shapes and sizes. Though typically text (like an open text field in a form), unstructured data can come in many forms to be stored as objects: images, audio, video, document files, and other file formats. The common point with all types of unstructured data comes back to the idea of lacking definition. Unstructured data is more commonly available (more on that below) and fields may not have the same character or space limits as structured data. Given the wide range of formats comprising unstructured data, it’s not surprising that this type typically makes up about 80% of an organization’s data.

Let’s look at some examples of unstructured data.

First, a company’s social posts are a specific example of unstructured data. The metrics behind each social media post—likes, shares, views, hashtags, and so on—are structured, in that they are predefined and purposeful for each post. The actual posts, though, are unstructured. The posts archive into a repository, but searching or relating the posts with metrics or other insights requires effort. There is no way of knowing what each post specifically contains without actually examining it, whether it’s customer service or promotion or an organizational news update. Compare that to structured data, where the purpose of fields (e.g., dates, names, geospatial coordinates) is clear.

A second example comes from media files. Something like a podcast has no structure to its content. Searching for the podcast’s MP3 file is not easy by default; metadata such as file name, timestamp, and manually assigned tags may help the search, but the audio file itself lacks context without further analysis or relationships.

Another example comes from video files. Video assets are everywhere these days, from short clips on social media to larger files that show full webinars or discussions. As with podcast MP3 files, content of this data lacks specificity outside of metadata. You simply can’t search for a specific video file based on its actual content in the database.

How Do They Work Together?

In today’s data-driven business world, structured and unstructured data tend to go hand in hand. For most instances, using both is a good way to develop insight. Let’s go back to the example of a company’s social media posts, specifically posts with some form of media attachment. How can an organization develop insights on marketing engagement?

First, use structured data to sort social media posts by highest engagement, then filter out hashtags that aren’t related to marketing (for example, removing any high-engagement posts with a hashtag related to customer service). From there, the related unstructured data can be examined—the actual social media post content—looking at messaging, type of media, tone, and other elements that may give insight as to why the post generated engagement.

This may sound like a lot of manual labor is involved, and that was true several years ago. However, advances in machine learning and artificial intelligence are enabling levels of automation. For example, if audio files are run through natural-language processing to create speech-to-text output, then the text can be analyzed for keyword patterns or positive/negative messaging. These insights are expedited thanks to cutting-edge tools, which are becoming increasingly important due to the fact that big data is getting bigger and that the majority of that big data is unstructured.

Where Data Comes From and Where It Goes

In today’s business world, data comes in from multiple sources. Let’s look at a mid-size company with a standard ecommerce setup. In this case, data likely comes from the following areas:

  • Customer transactions
  • Customer account data
  • Customer feedback forms
  • Inventory purchasing
  • Logistical tracking
  • Social media engagement
  • Marketing outreach engagement
  • Internal HR data
  • Search engine crawling for keywords
  • And much more

In fact, the amount of data pulled by any company these days is staggering. You don’t have to be one of the world’s biggest corporations to be part of the big data revolution. But how you handle that data is key to being able to utilize it. The best solution in many cases is a data lake.

Data lakes are repositories that receive structured, and unstructured data. The ability to consolidate multiple data inputs into a single source makes data lakes an essential part of any big data infrastructure. When data goes into a data lake, any inherent structure is stripped out so that it is raw data, making it easily scalable and flexible. When the data is read and processed, it is then given structure and schema as needed, balancing both volume and efficiency.

Efficiency in storage is key because scalability and flexibility allow for including more data sources and more applications of cutting-edge tools such as machine learning. This means that the foundation for receiving structured and unstructured data needs to be built for the present and the future, and the industry consensus points to moving data to the cloud.

Want to dig deeper? The following links might help:

What is big data?

What is machine learning?

What is a data lake?

And for more about how you can benefit from Oracle Big Data, visit Oracle’s Big Data pageand don’t forget to subscribe to the Oracle Big Data blog and get the latest posts sent to your inbox.

Related:

Powering New Insights with a High Performing Data Lake

EMC logo


To discover new insights with analytics, organizations are looking to find correlations in data across different, combined data sets. For those insights to be discovered, they need to be able to provide access to the data across multiple workgroups and stakeholders concurrently.

Most organizations use a data lake to store their data in its raw, native format. However, building data lakes and managing data storage can be challenging. The areas I most often see organizations struggle with across all types of environments running Hadoop are:

Set Up and Configuration

  • Hadoop services failing due to lack of proper configuration
  • Maintenance of multiple Hadoop environments is challenging and requires more resources

Security and Compliance

  • Lack of consistency and strong security controls in securing Hadoop and the data lake
  • Inability to integrate the data lake with LDAP or Active Directory

Storage and Compute

  • Low cluster utilization efficiency with varied workloads
  • Difficulty in scaling when the increase in data volumes is faster than anticipated
  • Server sprawl and challenges migrating multi-petabyte namespaces from direct attached storage (DAS) to network attached storage (NAS)
  • Lack of Hadoop Tiered Storage meaning cold and hot data are together causing performance issues

Multi-tenancy

  • Difficulty with the Hadoop Cluster supporting the different requirements of the Hadoop Distributed File System (HDFS) workflows
  • Challenges in moving data between environments (e.g., dev to prod and prod to dev) for data scientists to use production data in a secure environment

Hadoop on Isilon

Dell EMC’s Isilon Scale Out Network Attached Storage (NAS) makes the process of building data lakes much easier and offers many features that help organizations reduce maintenance and storage costs by keeping all of their data, including structured, semi-structured and unstructured data, in one place and file system.

Organizations can then extend the data lake to the cloud and to enterprise edge locations to consolidate and manage data more effectively, easily gain cloud-scale storage capacity, reduce overall storage costs, increase data protection and security, and simplify management of unstructured data.

Data Engineering Makes the Magic Happen

Hadoop is a consumer of Isilon, the data lake where all the data resides. To fully enable the capabilities of Isilon using Hadoop, and integrate the clusters securely and consistently, you need knowledgeable data engineers to set up and configure the environment. To illustrate the point, let’s look back at the common challenge areas and how you can mitigate them with proper data engineering and Hadoop on Isilon.

For more information on Multi-tenancy, refer to this whitepaper.

Implementation Process

In order to reap the benefits of Hadoop on Isilon, prior to implementation, data engineers need to secure your critical enterprise data, protect your valuable data, and simplify your storage capacity and performance management.

From there, the process of installing a Hadoop distribution and integrating it with an Isilon cluster varies by distribution, requirements, objectives, network topology, security policies, and many other factors. There is also a specific process to follow as illustrated in the diagram below.  For example, a supported distribution of Hadoop is installed and configured with Isilon before Hadoop is configured for authentication and both Hadoop and Isilon are authenticated with Kerberos.

For more information on setting up and managing Hadoop on Isilon, refer to this white paper.

Engaging a Trusted Partner

The good news is you and your teams don’t need to be experts in data engineering or navigate the implementation and configuration process on your own. Dell EMC Consulting Services can help you optimize your data lake and storage and maximize your investment, whether you’re just getting started with Hadoop on Isilon or have an existing environment that isn’t performing optimally. Our services are delivered by a global team of deeply skilled data engineers and include implementations, migrations, third party software integrations, ETL offloads, health checks and Hadoop performance optimizations as outlined in the graphic below.

Hadoop on Isilon, supported by data engineering services, offers a compelling business proposition for organizations looking to better manage their data to drive new insights and support advanced analytics techniques, such as artificial intelligence. If you are interested in learning more about our Hadoop on Isilon services or other Big Data and Analytics Consulting services, please contact your Dell EMC representative.

The post Powering New Insights with a High Performing Data Lake appeared first on InFocus Blog | Dell EMC Services.


Update your feed preferences


   

   


   


   

submit to reddit
   

Related: