The Evolution of the Data Warehouse in the Big Data Era

About 20 years ago, I started my journey into data warehousing and business analytics. Over all these years, it’s been interesting to see the evolution of big data and data warehousing, driven by the rise of artificial intelligence and widespread adoption of Hadoop. When I started in this work, the main business challenge was how to handle the explosion of data with ever-growing data sets and, most importantly, how to gain business intelligence in as close to real time as possible. The effort to solve these business challenges led the way for a ground-breaking architecture called … READ MORE

Related:

Data Lakes: Examining the End to End Process

It’s a good way to think of a data lake as being the ultimate hub for your organization. On the most basic level, it takes data in from various sources and makes it available for users to query. But much more goes on during the entire end to end process involving a data lake. To get a clearer understanding of how it all comes together—and a bird’s-eye view of what it can do for your organization—let’s look at each step in depth.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Step 1: Identify and connect sources

Unlike data warehouses, data lakes can take inputs from nearly any type of source. Structured, unstructured, and semi-structured data can all coexist in a data lake. The primary goal of this type of feature is allowing all of the data to exist in a single repository in its raw format. A data warehouse specializes in housing processed and prepared data for use, and while that is certainly helpful in many instances, it still leaves many types of data out of the equation. By unifying these disparate data sources into a single source, a data lake allows users to have access to all types of data without requiring the logistical legwork of connecting to individual data warehouses.

Step 2: Ingest data into zones

If a data lake is set up per best practices, then incoming data will not just get dumped into a single data swamp. Instead, since the data sources come from known quantities, it is possible to establish landing zones for datasets from particular sources. For example, if you know that a dataset contains sensitive financial information, it can immediately go into a zone that limits access by user role and additional security measures. If it’s data that comes in a set format ready for use by a certain user group (for example, the data scientists in HR), then that can immediately go into a zone defined for that. And if another dataset delivers raw data with minimal metadata specifics to easily identify it on a database level (like a stream of images), then that can go into its own zone of raw data, essentially setting that group aside for further processing.

In general, it’s recommended that the following zones be used for incoming data. Establishing this zone sorting right away allows for the first broad strokes of organization to be completed without any manual intervention. There are still more steps to go to optimize discoverability and readiness, but this automates the first big step. Per our blog post 6 Ways To Improve Data Lake Security, these are the recommended zones to establish in a data lake:

Temporal: Where ephemeral data such as copies and streaming spools live prior to deletion.

Raw: Where raw data lives prior to processing. Data in this zone may also be further encrypted if it contains sensitive material.

Trusted: Where data that has been validated as trustworthy lives for easy access by data scientists, analysts, and other end users.

Refined: Where enriched and manipulated data lives, often as final outputs from tools.

Step 3: Apply security measures

Data arrives into a data lake completely raw. That means that any inherent security risk with the source data comes along for the ride when it lands in the data lake. If there’s a CSV file with fields containing sensitive data, it will remain that way until security steps have been applied. If step 2 has been established as an automated process, then the initial sorting will help get you halfway to a secure configuration.

Other measures to consider include:

  • Clear user-based access defined by roles, needs, and organization.
  • Encryption based on a big-picture assessment of compatibility within your existing infrastructure.
  • Scrubbing the data for red flags, such as known malware issues, suspicious file names or formats (such as an executable file living in a dataset that is otherwise media files). Machine learning can significantly speed up this process.

Running all incoming data through a standardized security process ensures consistency among protocols and execution; if automation is involved, this also helps to maximize efficiency. The result? The highest levels of confidence that your data will go only to the users that should see it.

Step 4: Apply metadata

Once the data is secure, that means that it’s safe for users to access it—but how will they find it? Discoverability is only enabled when the data is properly organized and tagged with metadata. Unfortunately, since data lakes take in raw data, data can arrive with nothing but a filename, format, and time stamp. So what can you do with this?

A data catalog is a tool that can work with data lakes in a way that optimizes discovery. By enabling more metadata application, data can be organized and labeled in an accurate and effective way. In addition, if machine learning is utilized, the data catalog can begin recognizing patterns and habits to automatically label things. For example, let’s assume a data source is consistently sending MP3 files of various lengths—but the ones over twenty minutes are always given the metatag “podcast” after arriving in the data lake. Machine learning will pick up on that pattern and then start auto-tagging that group with “podcast” upon arrival.

Given that the volume of big data is getting bigger—and that more and more sources of unstructured data are entering data lakes, that type of pattern learning and automation can make huge differences in efficiency.

Step 5: User discovery

Once data is sorted, it’s ready for users to discover. With all of those data sources consolidated into a single data lake, discovery is easier than ever before. If tools like analytics exist outside of the data lake’s infrastructure, then there’s only one export/import step that needs to take place for the data to be used. In a best-case scenario, those tools are integrated into the data lake, allowing for real-time queries against the absolute latest data, all without any manual intervention.

Why is this so important? A recent survey showed that, on average, five data sources are consulted before making a decision. Consider the inefficiency if each source has to be queried and called manually. Putting it all in a single accessible data lake and integrating tools for real-time data querying removes numerous steps so that discovery can be as easy as a few clicks.

The Hidden Benefits of a Data Lake

The above details break down the end-to-end process of a data lake—and the resulting benefits go beyond saving time and money. By opening up more data to users and removing numerous access and workflow hurdles, users have the flexibility to try new perspectives, experiment with data, and look for other results. All of this leads to previously impossible insights, which can drive an organization’s innovation in new and unpredictable ways.

To learn more about how to get started with data lakes, check out Oracle Big Data Service—and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox.

Related:

6 Ways to Improve Data Lake Security

Data lakes, such as Oracle Big Data Service, represent an efficient and secure way to store all of your incoming data. Worldwide big data is projected to rise from 2.7 zettabytes to 175 zettabytes by 2025, and this means an exponentially growing number of ones and zeroes, all pouring in from an increasing number of data sources. Unlike data warehouses, which require structured and processed data, data lakes act as a single repository for raw data across numerous sources.

What do you get when you establish a single source of truth for all your data? Having all that data in one place creates a cascading effect of benefits, starting with simplifying IT infrastructure and processes and rippling outward to workflows with end users and analysts. Streamlined and efficient, a single data lake basket makes everything from analysis to reporting faster and easier.

There’s just one issue: all of your proverbial digital eggs are in one “data lake” basket.

For all of the benefits of consolidation, a data lake also comes with the inherent risk of a single point of failure. Of course, in today’s IT world, it’s rare for IT departments to set anything up with a true single point of failure—backups, redundancies, and other standard failsafe techniques tend to protect enterprise data from true catastrophic failure. This is doubly so when enterprise data lives in the cloud, such as with Oracle Cloud Infrastructure, as data entrusted in the cloud rather than locally has the added benefit of trusted vendors building their entire business around keeping your data safe.

Does that mean that your data lake comes protected from all threats out of the box? Not necessarily; as with any technology, a true assessment of security risks requires a 360-degree view of the situation. Before you jump into a data lake, consider the following six ways to secure your configuration and safeguard your data.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Establish Governance: A data lake is built for all data. As a repository for raw and unstructured data, it can ingest just about anything from any source. But that doesn’t necessarily mean that it should. The sources you select for your data lake should be vetted for how that data will be managed, processed, and consumed. The perils of a data swamp are very real, and avoiding them depends on the quality of several things: the sources, the data from the sources, and the rules for treating that data when it is ingested. By establishing governance, it’s possible to identify things such as ownership, security rules for sensitive data, data history, source history, and more.

Access: One of the biggest security risks involved with data lakes is related to data quality. Rather than a macro-scale problem such as an entire dataset coming from a single source, a risk can stem from individual files within the dataset, either during ingestion or after due to hacker infiltration. For example, malware can hide within a seemingly benign raw file, waiting to execute. Another possible vulnerability stems from user access—if sensitive data is not properly protected, it’s possible for unscrupulous users to access those records, possibly even modify them. These examples demonstrate the importance of establishing various levels of user access across the entire data lake. By creating strategic and strict rules for role-based access, it’s possible to minimize the risks to data, particularly sensitive data or raw data that has yet to be vetted and processed. In general, the widest access should be for data that has been confirmed to be clean, accurate, and ready for use, thus limiting the possibility of accessing a potentially damaging file or gaining inappropriate access to sensitive data.

Use Machine Learning:Some data lake platforms come with built-in machine learning (ML) capabilities. The use of ML can significantly minimize security risks by accelerating raw data processing and categorization, particularly if used in conjunction with a data cataloging tool. By implementing this level of automation, large amounts of data can be processed for general use while also identifying red flags in raw data for further security investigation.

Partitions and Hierarchy: When data gets ingested into a data lake, it’s important to store it in a proper partition. The general consensus is that data lakes require several standard zones to house data based on how trusted it is and how ready-to-use it is. These zones are:

  • Temporal: Where ephemeral data such as copies and streaming spools live prior to deletion.
  • Raw: Where raw data lives prior to processing. Data in this zone may also be further encrypted if it contains sensitive material.
  • Trusted: Where data that has been validated as trustworthy lives for easy access by data scientists, analysts, and other end users.
  • Refined: Where enriched and manipulated data lives, often as final outputs from tools.

Using zones like these creates a hierarchy that, when coupled with role-based access, can help minimize the possibility of the wrong people accessing potentially sensitive or malicious data.

Data Lifecycle Management:Which data is constantly used by your organization? Which data hasn’t been touched in years? Data lifecycle management is the process of identifying and phasing out stale data. In a data lake environment, older stale data can be moved to a specific tier designed for efficient storage, ensuring that it is still available should it ever be needed but not taking up needed resources. A data lake powered by ML can even use automation to identify and process stale data to maximize overall efficiency. While this may not touch directly on security concerns, an efficient and well managed data lake allows it to function like a well-oiled machine rather than collapsing under the weight of its own data.

Data Encryption:The idea of encryption being vital to data security is nothing new, and most data lake platforms come with their own methodology for data encryption. How your organization executes, of course, is critical. Regardless of which platform you use or what you decide between on premises vs, cloud, a sound data encryption strategy that works with your existing infrastructure is absolutely vital to protecting all of your data whether in motion or at rest—in particular, your sensitive data.

Create Your Secure Data Lake

What’s the best way to create a secure data lake? With Oracle’s family of products, a powerful data lake is just steps away. Built upon the foundation of Oracle Cloud Infrastructure, Oracle Big Data Service delivers cutting-edge data lake capabilities while integrating into premiere analytics tools and one-touch Hadoop security functions. Learn more about Oracle Big Data Service to see how easy it is to deploy a powerful cloud-based data lake in your organization—and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox.

Related:

What Is Oracle Cloud Infrastructure Data Catalog?

And What Can You Do with It?

Simply put, Oracle Cloud Infrastructure Data Catalog helps organizations manage their data by creating an organized inventory of data assets. It uses metadata to create a single, all-encompassing and searchable view to provide deeper visibility into your data assets across Oracle Cloud and beyond. This video provides a quick overview of the service.

This helps data professionals such as analysts, data scientists, and data stewards discover and assess data for analytics and data science projects. It also supports data governance by helping users find, understand, and track their cloud data assets and on-premises data as well—and it’s included with your Oracle Cloud Infrastructure subscription.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Why Does Oracle Cloud Infrastructure Data Catalog Matter?

Hint: It has to do with self-service data discovery and governance.

Oracle Cloud Infrastructure Data Catalog matters because it’s a foundational part of the modern data platform—a platform where all of your data stores can act as one, and you can view and access that data easily, no matter whether it resides in Oracle Cloud, object storage, an on-premises database, big data system, or a self-driving database.

This means that data users—data scientists, data analysts, data engineers, and data stewards—can all find data across systems and the enterprise more easily because a data catalog provides a centralized, collaborative environment to encourage exploration. Now these key players can trust their data because they gain technical as well as business context around it. It means they don’t have to have SQL access, or understand what object storage is, or figure out the complexities of Hadoop—they can get started faster with their single unified view through their data catalog. It’s no longer necessary to have five different people with five different skillsets just to find where the right data resides.

Easy data discovery is now possible.

And of course, it’s not just data discovery that’s easier. Governance is also easier—and that is a key benefit with GDPR and ever more complex compliance requirements in today’s world of multiple enterprise systems, with on-premises, cloud, and multi-cloud environments.

With Oracle Cloud Infrastructure Data Catalog, you have better visibility into all of your assets, and business context is available in the form of a business glossary and user annotations. And of course, understanding the data you have is essential for governance.

How Does Oracle Cloud Infrastructure Data Catalog Work?

Oracle Cloud Infrastructure Data Catalog takes metadata—technical, business, and operational—from various data sources, users, and assets, and harvests it to turn it into a data catalog: a single collaborative solution for data professionals to collect, organize, find, access, enrich, and activate metadata to support self-service data discovery and governance for trusted data assets across Oracle Cloud.

And what’s so important about this metadata? Metadata is the key to Oracle Cloud Infrastructure Data Catalog. There are three types of metadata that are relevant and key to how our data catalog works:

  • Technical metadata: Used in the storage and structure of the data in a database or system
  • Business metadata: Contributed by users as annotations or business context
  • Operational metadata: Created from the processing and accessing of data, which indicates data freshness and data usage, and connects everything together in a meaningful way

You can harvest this metadata from a variety of sources, including:

    • Oracle Cloud Infrastructure Object Storage
    • Oracle Database
    • Oracle Autonomous Transaction Processing
    • Oracle Autonomous Data Warehouse
    • Oracle MySQL Cloud Service
    • Hive
    • Kafka

And the supported file types for Oracle Cloud Infrastructure Object Storage include:

    • CSV, Excel
    • ORC, Avro, Parquet
    • JSON

Once the technical metadata is harvested, subject matter experts and data users can contribute business metadata in the form of annotations to the technical metadata. By organizing all this metadata and providing a holistic view into it, Oracle Cloud Infrastructure Data Catalog helps data users find the data they need, discover information on available data, and gain information about the trustworthiness of data for different uses.

How Can You Use a Data Catalog?

Metadata Enrichment

Oracle Cloud Infrastructure Data Catalog enables users to collaboratively enrich technical information with business context to capture and share tribal knowledge. You can tag or link data entities and attributes to business terms to provide a more all-inclusive view as you begin to gather data assets for analysis and data science projects. These enrichments also help with classification, search, and data discovery.

Business Glossaries

One of the first steps towards effective data governance is establishing a common understanding of business concepts across the organization, and establishing their relationships to the data assets in the organization. Oracle Cloud Infrastructure Data Catalog makes it possible to see associations and linkages between glossary terms and other technical terms, assets, and artifacts. This helps increase user trust because users understand the relationships and what they’re looking at.

Oracle Cloud Infrastructure Data Catalog makes this possible by including capabilities to collaboratively define business terms in rich text form, categorize them appropriately, and build a hierarchy to organize this vocabulary. You can also create parent-child relationships between various terms to build a taxonomy, or set business term owners and approval status so that users know who can answer their questions regarding specific terms. Once created, users can then link these terms to technical assets to provide business meaning and use them for searching as well.

Searchable Data Asset Inventory

By organizing all this metadata and providing a more complete view into it, Oracle Cloud Infrastructure Data Catalog helps users find the data they need, discover information on available data, and gain information about the trustworthiness of data for different uses.

Being able to search across data stores makes finding the right data so much easier. With Oracle Cloud Infrastructure Data Catalog, you have a powerful, searchable, standardized inventory of the available data sources, entities, and attributes. You can enter technical information, defined tags, or business terms to easily pull up the right data entities and assets. You can also use filtering options to discover relevant datasets, or browse metadata based on the technical hierarchy of data assets, entities, and attributes. These features make it easier to get started with data science, analytics, and data engineering projects.

Data Catalog API and SDK

Many of Oracle Cloud Infrastructure Data Catalog’s capabilities are also available as public REST APIs to enable integrations such as:

  • Searching and displaying results in applications that use the data assets
  • Looking up definitions of defined business terms in the business glossary and displaying them in reporting applications
  • Invoking job execution to harvest metadata as needed

Available search capabilities include:

  • Search data based on technical names, business terms, or tags
  • View details of various objects
  • Browse Oracle Cloud Infrastructure Data Catalog based on data assets

Available single collaborative environment includes:

  • Homepage with helpful shortcuts and operational stats
  • Search and browse
  • Quick actions to manage data assets, glossaries, jobs, and schedules
  • Popular tags and recently updated objects

Conclusion

Oracle Cloud Infrastructure Data Catalog is the underlying foundation to data management that you’ve been waiting for—and it’s included with your Oracle Cloud Infrastructure subscription. Now, data professionals can use technical, business, and operational metadata to support self-service data discovery and governance for data assets in Oracle Cloud and beyond.

Leverage your data in new ways, and more easily than you ever could before. Try Oracle Cloud Infrastructure Data Catalog today and start discovering the value of your data. And don’t forget to subscribe to the Big Data Blog for the latest on Big Data straight to your inbox!

Related:

Data Lake, Data Warehouse and Database…What’s the Difference?

There are so many buzzwords these days regarding data management. Data lakes, data warehouses, and databases – what are they? In this article, we’ll walk through them and cover the definitions, the key differences, and what we see for the future.

Start building your own data lake with a free trial

Data Lake Definition

If you want full, in-depth information, you can read our article called, “What’s a Data Lake?” But here we can tell you, “A data lake is a place to store your structured and unstructured data, as well as a method for organizing large volumes of highly diverse data from diverse sources.”

The data lake tends to ingest data very quickly and prepare it later, on the fly, as people access it.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Data Warehouse Definition

A data warehouse collects data from various sources, whether internal or external, and optimizes the data for retrieval for business purposes. The data is usually structured, often from relational databases, but it can be unstructured too.

Primarily, the data warehouse is designed to gather business insights and allows businesses to integrate their data, manage it, and analyze it at many levels.

Database Definition

Essentially, a database is an organized collection of data. Databases are classified by the way they store this data. Early databases were flat and limited to simple rows and columns. Today, the popular databases are:

  • Relational databases, which store their data in tables
  • Object-oriented databases, which store their data in object classes and subclasses

Data Mart, Data Swamp and Other Terms

And, of course, there are other terms such as data mart and data swamp, which we’ll cover very quickly so you can sound like a data expert.

Enterprise Data Warehouse (EDW): This is a data warehouse that serves the entire enterprise.

Data Mart: A data mart is used by individual departments or groups and is intentionally limited in scope because it looks at what users need right now versus the data that already exists.

Data Swamp: When your data lake gets messy and is unmanageable, it becomes a data swamp.

The Differences Between Data Lakes, Data Warehouses, and Databases

Data lakes, data warehouses and databases are all designed to store data. So why are there different ways to store data, and what’s significant about them? In this section, we’ll cover the significant differences, with each definition building on the last.

The Database

Databases came about first, rising in the 1950s with the relational database becoming popular in the 1980s.

Databases are really set up to monitor and update real-time structured data, and they usually have only the most recent data available.

The Data Warehouse

But the data warehouse is a model to support the flow of data from operational systems to decision systems. What this means, essentially, is that businesses were finding that their data was coming in from multiple places—and they needed a different place to analyze it all. Hence the growth of the data warehouse.

For example, let’s say you have a rewards card with a grocery chain. The database might hold your most recent purchases, with a goal to analyze current shopper trends. The data warehouse might hold a record of all of the items you’ve ever bought and it would be optimized so that data scientists could more easily analyze all of that data.

The Data Lake

Now let’s throw the data lake into the mix. And because it’s the newest, we’ll talk about this one more in depth. The data lake really started to rise around the 2000s, as a way to store unstructured data in a more cost-effective way. The key phrase here is cost effective.

Although databases and data warehouses can handle unstructured data, they don’t do so in the most efficient manner. With so much data out there, it can get expensive to store all of your data in a database or a data warehouse.

In addition, there’s the time-and-effort constraint. Data that goes into databases and data warehouses needs to be cleansed and prepared before it gets stored. And with today’s unstructured data, that can be a long and arduous process when you’re not even completely sure that the data is going to be used.

That’s why data lakes have risen to the forefront. The data lake is mainly designed to handle unstructured data in the most cost-effective manner possible. As a reminder, unstructured data can be anything from text to social media data to machine data such as log files and sensor data from IoT devices.

Data Lake Example

Going back to the grocery example that we used with the data warehouse, you might consider adding a data lake into the mix when you want a way to store your big data. Think about the social sentiment you’re collecting, or advertising results. Anything that is unstructured but still valuable can be stored in a data lake and work with both your data warehouse and your database.

Note 1: Having a data lake doesn’t mean you can just load your data willy-nilly. That’s what leads to a data swamp. But it does make the process easier, and new technologies such as having a data catalog will steadily make it simpler to find and use the data in your data lake.

Note 2: If you want more information on the ideal data lake architecture, you can read the full article we wrote on the topic. It describes why you want your data lake built on object storage and Apache Spark, versus Hadoop.

What’s the Future of Data Lakes, Data Warehouses, and Databases?

Will one of these technologies rise to overtake the others?

We don’t think so.

Here’s what we see. As the value and amount of unstructured data rises, the data lake will become increasingly popular. But there will always be an essential place for databases and data warehouses.

You’ll probably continue to keep your structured data in the database or data warehouse. But these days, more companies are moving their unstructured data to data lakes on the cloud, where it’s more cost effective to store it and easier to move it when it’s needed.

This workload that involves the database, data warehouse, and data lake in different ways is one that works, and works well. We’ll continue to see more of this for the foreseeable future.

If you’re interested in the data lake and want to try to build one yourself, we’re offering a free data lake trial with a step-by-step tutorial. Get started today, and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox

Related:

4 Reasons Why Businesses Need Data Lakes

It’s common knowledge in the modern business world that big data —that is, the large volumes of data collected for analysis by organizations—is a significant element of any strategy—operations, sales, marketing, finance, HR, and every other department rely on big data solutions to stay ahead of the competition. But how that big data is handled remains another story.

Enter the world of data lakes. Data lakes are repositories that can take in data from multiple sources. Rather than process data for immediate analysis, all received data is stored in its native format. This model allows data lakes to hold massive amounts of data while using minimal resources. Data is only processed upon being called for usage (compared to a data warehouse, which processes all incoming data). This ultimately allows data lakes to be an efficient way for storage, resource management, and data preparation.

But do you actually need a data lake, especially if your big data solution already has a data warehouse? The answer is a resounding yes. In a world where the volume of data transmitted across countless devices continues to increase, a resource-efficient means of accessing data is critical to a successful organization. In fact, here are four specific reasons why the need for a data lake is only going to get more urgent as time goes on.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

90% of data has been generated since 2016

90% of all data ever is a lot—or is it? Consider what has become available to people as Wi-Fi, smartphones, and high-speed data networks have entered everyday life over the past twenty years. In the early 2000s, streaming was limited to audio, while broadband internet was used mostly for web surfing, emailing, and downloads. In that paradigm, device data was at a minimum and the actual data consumed was mostly about interpersonal communication, especially because videos and TV hadn’t hit a level of compression that supported high-quality streaming. Towards the end of the decade, smartphones became common and Netflix had shifted its business priority to streaming.

That means between 2010 and 2020, the internet has seen the growth of smartphones (and their apps), social media, streaming services for both audio and video, streaming video game platforms, software delivered through downloads rather than physical media, and so on, all creating exponential consumption of data. As for the part that is the most relevant to business? Consider how many businesses have associated apps constantly transmitting data to and from devices, whether to control appliances, provide instructions and specifications, or quietly transmit user metrics in the background.

With 5G data networks widely starting to deploy in 2019, bandwidths and speeds are only going to get better. This means as massive—and significant—as big data has already been in the past few years, it’s only going to get bigger as technology allows the world to become even more connected. Is your data repository ready?

95% of businesses handle unstructured data

In a digital world, businesses collect data from all types of sources, and most of that is unstructured. Consider the data collected by a company that sells services and makes appointments via an app. While some of that data comes structured—that is, in predefined formats and fields such as phone numbers, dates, transaction prices, time stamps, etc.—a company like that still has to archive and store a lot of unstructured data. Unstructured data is any type of data that doesn’t contain an inherent structure or predefined model, which makes it difficult to search, sort, and analyze without further preparation.

For the example above, unstructured data comes in a wide range of formats. For a user making an appointment, any text fields filled out to make that appointment count as unstructured data. Within the company itself, emails and documents are another form of unstructured data. The posts from a company’s social media channel are also unstructured data. Any photos or videos used by employees as notes while performing services are unstructured data. Similarly, any instructional videos or podcasts created by the company as marketing assets are also unstructured.

Unstructured data is everywhere, and as more devices connect to deliver a greater range of information, it becomes clear that organizations need a way to get their proverbial arms around all of it.

4.4 GB of data are used by Americans every minute

More than 325 million people live in the US. Nearly 70% of them have smartphones. And even if you don’t count the people currently streaming media, consider what is happening on an average smartphone in a minute. It’s receiving an update on the weather. It’s checking for any new emails in the user’s inbox. It’s pushing data to social media, delivering voicemail over Wi-Fi, delivering strategic marketing notifications from apps, such as when a real estate app pushes a new housing listing. It’s sending text and images via chat apps, and downloading app/OS updates in the background.

Data is everywhere now, which means the minute that just passed while you read the above paragraph, gigabytes of data have been transmitted across the country—4.4 million GB of data every minute, according to Domo’s Data Never Sleeps report. And that’s just the United States; when combined with the rest of the world, the total volume of data grows exponentially. For businesses, collecting this kind of data is vital to all aspects of operations, from marketing to sales to communication. Thus, every organization must put a premium on safe, available, and accessible storage.

50% of businesses say that big data has changed their sales and marketing

Most people think of big data in terms of the technical aspects. Clearly, a company that works through a phone app or provides a form of streaming uses big data and is delivering a service that simply wasn’t feasible twenty years ago. However, big data is much more than delivery of streaming content. It can create significant improvements in sales and marketing—so much so that according to a McKinsey report, 50% of businesses say that big data is driving them to change their approach in these departments.

What’s the reason for this? With big data, organizations have a much more efficient path to understanding customers than in-person focus groups. Data allows for gathering a mass sample of actions from existing and potential customers. Everything from their website browsing prior to conversion to how long they engaged with certain features of a product or service are all available at high volume, which creates a large enough sample size for a reliable customer model. To be in the cutting-edge 50%, an organization needs to have the data infrastructure to receive, store, and retrieve massive amounts of structured and unstructured data for processing.

Basically, you need a data lake

The above statistics all point to one thing—your organization needs a data lake. And if you don’t get ahead of the curve now in terms of managing data, it’s clear that the world will pass you by in all areas: operations, sales, marketing, communications, and other departments. Data is simply a way of life now, enabling precise insight-driven decisions and unparalleled discovery into root causes. When combined with machine learning and artificial intelligence, this data also allows for predictive modeling for future actions.

Learn more about why data lakes are the future of big data and discover Oracle’s big data solutions—and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox.

(Note: Corrected typo from Domo’s Data Never Sleeps citation.)

Related:

The Oracle Autonomous Data Warehouse: a three-part series

Part 1: Injecting intelligence and autonomy into data management

Today’s connected society is driven, if not given life, by data. Don’t believe me? Consider that by 2025, each connected person will have at least one data interaction every 18 seconds. That’s 4,800 times per day! Furthermore, by that same year, more than a quarter of data created will be real time in nature, with real-time IoT data making up more than 95% of this.

All these devices and interaction points with new, connected systems are slated to grow the global data footprint from 33 Zettabytes (ZB) in 2018 to 175 ZB by 2025. To put that into perspective, 90 percent of the world’s data was generated in the past two years. The figures become even more staggering when you imagine the exponential amount of data we’ll create over the next five years.

If you look at the data-driven market in general, solutions around data management are growing and show no signs of slowing. To better manage the vast growth around data creation, IDC pointed out that big data and business analytics (BDA) related software are quickly becoming the two largest software categories.

“Digital transformation is a key driver of big data and analytics spending with executive-level initiatives resulting in deep assessments of current business practices and demands for better, faster, and more comprehensive access to data and related analytics and insights,” said Dan Vesset, group vice president, Analytics and Information Management at IDC. “Enterprises are rearchitecting to meet these demands and investing in modern technology that will enable them to innovate and remain competitive.”

The net? As sure as the sun will rise tomorrow, so will the volume of data continue to grow and compound. It’s critical for businesses to learn the value of data and how to best leverage it. This also means learning new ways to ingest different data types, how to intelligently process it, and certainly how to secure and protect it, along with how data can enable better market decisions. Because without data-driven solutions, businesses will stall, overcome with the inability to gain real value from their own hard-earned natural resources – data.

So, what’s the reality if businesses don’t actually invest in data-driven solutions? It’s actually pretty grim. Reminder: whether you like it or not, your business will generate data. And, you’ll likely generate a lot of it over the course of a year.

That data contains information about your products, services, solutions, customers, and so much more. Without good data engines that help correlate and understand the information, your data sits idle and will never reach the working potential that you need.

Without a data-driven solution, you’ll quickly see both your competitors and the market pass you by. You won’t be able to react to industry trends as quickly, nor will you have a good vision into customer requirements, and most of all, you’ll lose serious dollars because you won’t have reliable insights to make better decisions.

So how do you know if you’re actually using data properly — or using proper data? This is where a good partner can help. But, from an even more straightforward perspective, you can ask yourself some pretty basic questions:

1. Do I have the ability to aggregate various data points to make better business decisions?

2. Is my data sitting idle in silos doing nothing?

3. Do I have a means to leverage data analytics engines to see relationships between the market, my customers, and my data?

4. Am I able to impact business direction with clear data points?

5. Do I have good visibility into my data streams or is it sitting unguarded and unused?

The first step to gaining control of your data platform is to understand that you might have a data management challenge. But with so much data being created every single day, designing an intelligent data processing solution isn’t always easy.

In this three-part series, we’ll examine the world of data warehousing, how autonomous data warehouses are built, along with details about how to create an effective security strategy with data and autonomous data warehouses.

The Difference Between Simple Data Storage and Intelligent Data Warehousing

Our reliance on data has also created new ways to process, store, and utilize information. In fact, there are now ways to store data where it sits idle (simple storage), as well intelligent data warehousing – where intelligence is directly integrated into the storage and data ingestion process.

  • Traditional Simple Storage: Think of this as a storage array that sits in your data center. Similarly, it can be a storage repository that’s in a colocation or the cloud. These ‘simple’ storage mechanisms literally store only data. They’re highly resilient, capable of archival storage or even all-flash performance. They’ll even have some stats and analytics around storage utilization, file types, and ways to optimize performance and experience. However, the data-driven analytics portion of these solutions will usually be minimal at best. That means that although you can leverage some of the best systems out there to store your data, you’ll still need a data engine that can help you find ways to apply and use the data.
  • Intelligence-Driven Data Storage and Warehousing: Unlike traditional storage and file repositories, think of these systems as powerful data ingestion engines. They can store your data. But they also connect to various data streams like ERP systems, cloud applications, backend databases, and more to ingest, quantify, correlate, and process vast amounts of data. This data can come from all sources and can be different in structure. Unlike the many simple storage solutions out there, intelligence-driven data warehousing also lets you work with data that’s structured, unstructured, semi-structured, and so on. From there, you can leverage powerful data-driven tools to make even better business decisions. These tools are applications and services that are designed to find patterns and trends impossible to spot using traditional storage methods.

The New Wave of Database Automation Is Self-Driving

Here’s a specific example: Coupled with solutions around data analytics and big data processing, data warehousing allows you to take valuable information to a new level. Powerful data warehouse solutions help you create data visualization to make better decisions around your business and the market. A data warehouse helps with data ingestion and is a decision support system which stores historical data from across the organization, processes it, and makes it possible to use the data for critical business analysis, reports, and dashboards.

Now, let’s take this a step further. Let’s say you’re a large organization dealing with massive streams of data. Or, maybe you’re a company that needs to respond to certain trends and data analysis points. How do you take your data warehouse solution and create levels of autonomous behavior? Most of all, how do you work with cognitive solutions like AI, machine learning, and even advanced data analytics to get even more value out of your data?

Recently, IDC estimates that the amount of the global datasphere subject to data analysis will grow by a factor of 50 to 5.2ZB in 2025; the amount of analyzed data that is “touched” by cognitive systems will grow by a factor of 100 to 1.4ZB in 2025!

From that perspective, to leverage data, you need to not only support new data center initiatives, but also enable the business to become a part of a data-driven world.

This is where data warehouses that are autonomous and capable of cognitive data processing can revolutionize the way you leverage data, work with intelligent systems, visualize information, and deliver powerful new competitive advantages to your business.

A dashboard offering a personalized data view from a range of data sources, aggregated within the Oracle Autonomous Data Warehouse.

As pictured in the image above, you can see how new solutions – in this case, specifically, Oracle Autonomous Data Warehouse – are changing the way organizations can leverage data to create competitive advantages.

Basically, the autonomous solutions leverage cognitive systems to allow the data to be processed far more effectively to help your business react to major market trends and customer demands. Beyond that, you can actually leverage powerful dashboards which make data analytics far easier and proactive to the market. And here’s the cool part, so much of this is automated for you when you leverage an autonomous data warehouse! But, before we go too much further, it’s important to note what we actually mean by a data warehouse that is autonomous:

The Oracle Autonomous Data Warehouse provides an easy-to-use, fully autonomous data warehouse that scales elastically, delivers fast query performance and requires no database administration. It is designed to support all standard SQL and business intelligence (BI) tools, and provides all the performance of the market-leading Oracle Database in an environment that is tuned and optimized for data warehouse workloads.

As a service, the Oracle Autonomous Data Warehouse does not require database administration. With Oracle Autonomous Data Warehouses, you do not need to configure or manage any hardware or install any software. The Autonomous Data Warehouse creates the data warehouse, backs up the database, patches and upgrades the database, and grows or shrinks the database as needed, automatically.

Oracle Autonomous Data Warehouse

“Oracle has thought a great deal about how to give businesses the ability to take advantage of that potential, without adding a tremendous amount of strain on their IT resources which is why we developed Oracle Autonomous Database,” Oracle Product Marketing Manager, Christopher McCarthy says.

Remember, the entire idea here is to simplify the way you work with and manage vast amounts of different types of data. This is a big reason why Oracle’s Autonomous Data Warehouse makes working with data way easier because it does not require any tuning. By design, Autonomous Data Warehouse is a ”load-and-go” service. You start the service, define tables, load data, run queries, and then get value out of the data. That’s it. Seriously.

A good example is car rental company, Hertz who are utilizing Autonomous Data Warehouse to reduce IT administration, increase data security, and analyze their data in Oracle Autonomous Analytics Cloud. Before Autonomous Data Warehouse, it would take two weeks, end-to-end, for Hertz to get the approvals necessary to provision a new database, provision it, tune it and be ready to load data. With Oracle Autonomous Data Warehouse, they were able to provision and load their data in just 8 minutes!

Again, this is specifically designed for ease of use and getting results from data, fast. When you use Autonomous Data Warehouse, no tuning is necessary. You do not need to consider any details about parallelism, partitioning, indexing, or compression. The service automatically configures the database for high performance queries.

Furthermore, Autonomous Data Warehouse includes a cloud-based service console for managing the service (for tasks like creating or scaling the service), and monitoring the service (for tasks such as viewing the recent levels of activity on the data warehouse).

Autonomous Data Warehouse also includes a cloud-based notebook application that provides simple querying, data-visualization, and collaboration capabilities. The notebook is designed for use alongside other business intelligence applications. These Oracle Machine Learning notebooks, based on Apache Zeppelin technology, enable teams to collaborate to build, evaluate, and deploy predictive models and analytical methodologies in the Oracle Autonomous Data Warehouse.

Multi-user collaboration lets multiple users have the same notebook open at the same time. Changes made by one user are immediately updated for other team members. Plus, these data-driven solutions are also really smart when it comes security. Supporting enterprise requirements for security, authentication, and auditing, Oracle Machine Learning notebooks adhere to all Oracle standards and supports privilege-based access to data, models, and notebooks.

Putting It All Together

There’s absolutely no question that we’re creating more data. And, we know that this data is very valuable to every business that’s trying to compete. The Oracle Autonomous Data Warehouse isn’t just another data processing engine. It’s designed to ‘think’ while making data ingestion and processing much easier. You’re working with a technology that can understand various types of data inputs and actually adjust, automatically, based on the data set and the type of queries you’re creating. What does this mean? You can focus more on the results of the data and less on trying to set it all up.

To that extent, it’s really important to grasp the architecture behind the Oracle Autonomous Data Warehouse. That’s why I lifted up the hood on the product and took a look inside.

Conclusion

In part two of our three-part blog series, we’ll review the Oracle Autonomous Data Warehouse architecture, where there are key benefits, and even how data visualization is impacted by new and advanced autonomous solutions.

In the meantime, register for this webcast to learn more about Oracle Autonomous Data Warehouse, its’ key benefits and advantages (including how it’s 8-14x faster than AWS) and how easy it is to get started with the world’s first self-driving database.

For more information on how Oracle Cloud outperforms the competition, follow #LetsProveIt on Twitter and LinkedIn. And if you haven’t yet tried Oracle Cloud, sign up for a free trial.

Related:

Migrate from Amazon Redshift to Oracle Autonomous Data Warehouse in 7 easy steps.

In this blog, I plan to give you a quick overview of how you can use SQL Developer Amazon Redshift Migration Assistant to help you migrate your existing Amazon Redshift to Oracle Autonomous Data Warehouse (ADW)

But first, why the need to migrate to Autonomous Data Warehouse?

Data-driven organizations differentiate themselves through analytics to further their competitive advantage by extracting value from all their data sources. Today’s digital world is already creating data at an explosive rate, that organizations’ physical data warehouses that were once great for collecting data from across the enterprise for analysis are not able to keep pace with storage and compute resources needed to support them. In addition, the manual cumbersome task of patching, upgrading and securing the environments and data poses significant risks to businesses.

There are few cloud vendors that serve this niche market, one of them is Amazon Redshift, a fully managed data warehouse cloud service that is built on top of technology licensed from ParAccel. Though it is an early entrant, its query processing architecture severely limits concurrency levels, making it unsuitable for large data warehouses or web-scale data analytics. Redshift is only available for fixed blocks of hardware configurations, as such computers cannot be scaled independently of storage. This leads to excess capacity making customers pay for more than what is used. Additionally, resizing puts it in a read-only state and may require downtime, which could take hours while data is redistributed.

Oracle Autonomous Data Warehouse is a fully managed database tuned and optimized for data warehouse workloads that support both structured and unstructured data. It automatically and continuously patches, tunes, backups, and upgrades with virtually no downtime. Integrated machine-learning algorithms drive automatic caching, adaptive indexing, advanced compression, and optimized cloud data-loading delivers unrivaled performance, allowing you to quickly extract data insights and make critical decisions in real time. With little human intervention, the product virtually eliminates human error, with dramatic implications for not only minimizing security breaches and outages but also on cost. Autonomous Data Warehouse is built on latest Oracle Database software and technology that runs your existing on-premises marts, data warehouses, and applications, making it compatible with all your existing data warehouse, data integration, and BI tools.

Strategize your Data Warehouse Migration

Here is a proposed workflow for either on-demand migration of Amazon Redshift or the generation of scripts for a scheduled manual migration that can be run at a later time

Establish connections to both Amazon Redshift (Source) and Oracle Autonomous Data Warehouse (Target) using SQL Developer Migration Assistant.

Download SQL Developer 18.3 or later versions. It is a client application that can be installed on a workstation, laptop for both Windows / Mac OSX. For the purposes of this blog, we will run it on Microsoft Windows. Download Amazon Redshift JDBC driver to access Amazon Redshift Environment.

Open SQL Developer application and add Redshift JDBC driver as third-party driver (Tools > Preferences > Database > Third Party JDBC Drivers)

Add Connection to Amazon Redshift Database, in the connections panel, create new connection, select the Amazon Redshift tab and enter the connection information for Amazon Redshift.

Tip:

  • If you are planning to migrate multiple schemas it is recommended to connect with the master username to your Amazon Redshift instance.
  • If you deployed your Amazon Redshift environment within a Virtual Private Cloud (VPC) you have to ensure that your cluster is accessible from the Internet, here are the details on how to enable public Internet access.
  • If your Amazon Redshift client connection to the database appears to hang or times out when running long queries, here are the details with possible solutions to address this issue.

Add Connection to Oracle Autonomous Data Warehouse, in the connections panel, create new connection, select the Oracle tab and enter the connection information along with wallet details. If you haven’t provisioned Autonomous Data Warehouse yet, please do so now. Here are quick easy steps to get you started. You can even start with a free trial account.

Test connections for both Redshift and Autonomous Data Warehouse before you save them.

2. Capture / Map Schema: From the tools menu of SQL Developer, start the Cloud Migration Wizard to capture metadata schemas and tables from the source database (Amazon Redshift).

First, connect to AWS Redshift from the connection profile and identify the schemas that need to be migrated. All objects, mainly tables, in the schema will be migrated. You have the option to migrate data as well. Migration to Autonomous Data Warehouse is a per-schema basis and schemas cannot be renamed as part of the migration.

Note: When you migrate data, you have to provide the AWS access key, AWS Secret Access Key, and an existing S3 bucket URI where the Redshift data will be uploaded to and staged. The security credentials require privileges to store data in S3. If possible, create new, separate access keys for the migration. The same access keys will be used later to load the data into the Autonomous Data Warehouse using secure REST requests.

For example, if you provide URI as https://s3-us-west-2.amazonaws.com/my_bucket

Migration assistant will create these folders: oracle_schema_name/oracle_table_name inside the bucket: my_bucket

"https://s3-us-west 2.amazonaws.com/my_bucket/oracle_schema_name/oracle_table_name/*.gz"

Redshift Datatypes are mapped to Oracle Datatypes. Similarly, Redshift Object names are converted to Oracle names based on Oracle Naming Convention. The column defaults that use Redshift functions are replaced to their Oracle equivalents.

3. Generate Schema: Connect to Autonomous Data Warehouse from the connection profile. Ensure the user has administrative privileges, as this connection is used throughout the migration to create schemas and objects. Provide a password for the migration repository that will be created in the Autonomous Data Warehouse. You can choose to remove this repository post-migration. Specify a directory on the local system to store generated scripts necessary for the migration. To start migration right away, choose ‘Migrate Now

Use ‘Advanced Settings’ to control the formatting options, parallel threads to enable when loading data, reject limit (number of rows to reject before erroring out)during the migration

Review the summary and click ‘finish’. If you have chosen an immediate migration, then the wizard stays open until the migration is finished. If not, the migration process generates the necessary scripts in the specified local directory and does not run the scripts.

If you choose to just generate migration scripts in the local directory, then continue with the next steps.

  1. Stage Data: Connect to Amazon Redshift environment to run redshift_s3unload.sql to unload data from Redshift tables and store them to Amazon Storage S3 (staging) using the access credentials and the S3 bucket that was specified in the migration wizard workflow.
  2. Deploy Target Schema: Connect to Autonomous Data Warehouse as a privileged user (example: ADMIN) to run adwc_ddl.sql to deploy the generated schemas and DDLs converted from Amazon Redshift.
  3. Copy Data: While being connected to Autonomous Data Warehouse, run adwc_dataload.sql that contains all the load commands necessary to load data straight from S3 into your Autonomous Data Warehouse.
  4. Review Migration Results: Migration task creates 3 files in local directory; MigrationResults.log, readme.txt and redshift_migration_reportxxx.txt. Each of them will have information on the status of migration

Test few queries to make sure all your data from Amazon Redshift has migrated. Oracle Autonomous Data Warehouse supports connections from various client applications. Connect and test them.

Conclusion

With greater flexibility, lower infrastructure cost, and lower operations overhead, there’s a lot to love about Oracle Autonomous Data Warehouse. The unique value of Oracle comes from its complete cloud portfolio with intelligence infused at every layer, spanning infrastructure services, platform services, and applications. For Oracle, the autonomous enterprise goes beyond just automation, in which machines respond to an action with an automated reaction, instead, it is based on applied machine learning, making it completely autonomous, eliminating human error and delivering unprecedented performance, high security and reliability in the cloud.

Related:

See How Easily You Can Make a Database Clone

There are lots of times when, as a DBA, you are going to get asked to make a copy of an existing database. Until now, creating these non-operational/production environments has been challenging and time consuming process for the technical teams, especially the data warehouse DBAs! What everyone has been waiting for is a way to just “right-click” and deploy an exact copy of an existing data warehouse instance (obviously you can do this programmatically too using the Cloud command line APIs but that’s not the focus of this post) . Well the wait is over….

The new cloning feature of autonomous data warehouse comes to the rescue….in a few mouse clicks it is possible to make exact copies of your data warehouse, either including or excluding the actual data depending on your precise needs. Let’s see how this works…

First step is to log in to your cloud console and go to your Autonomous Database overview screen. In the screen below we have set the “Workload” type filter to only show our Autonomous Data Warehouse instances…

ADB Landing Page with list of ADB instances

As you can see we have an existing application database called “nodeappDB”. Let’s assume that we have had a request to create a new training instance of this environment so we can run a training event for some key business users. To do this we will use the new cloning feature to make a copy of our existing “nodeappDB” instance. Here we go…

If you click on the little three vertical dots on the right hand side. In the pop-up menu there is a new menu option “Create Clone”, as shown here:

Selecting the pop-up menu from the list of ADB instances

which gives us access to the “Create Clone” feature on the pop-up menu…

Selecting the create clone menu option

Step 1) Now up pops our familiar “Create Autonomous Database” form except this time it says ““Create Autonomous Database Clone” and the first decision we have to make is to determine the type of clone that we want to create! Fortunately, there are two simple options:

Select the type of clone to create

Full Clone – this creates a new data warehouse instance complete with all our data and metadata (i.e. the definition of all the database objects such as tables, views etc).

Metadata Clone – this creates a new data warehouse that contains only our source data warehouse’s metadata without the data (i.e. the new autonomous database instance will only contain the definitions of our existing database objects such as tables, views etc).

Since we need to create a training environment the obvious choice is to create a “Full Clone” because the business users will get more from their training workshop if or instance contains a realistic data set. If we were creating a new development-type or testing-type instances then a metadata-only clone would probably be sufficient. So with that done, let’s move to…

Step 2) If we need to, we can change the compartment (and if you have no idea what a compartment is then there is more information about compartments and how to use them here)..for example, you could have a specific compartment setup for “training” which contains all your Autonomous Database training instances. So you can think of compartments as a way of organising, grouping your autonomous database instances. In this example let’s put this new “clone” in the same compartment (LABS) as our existing “nodeappDB” instance:

Setting the compartment for the new cloned instance

…with that done, now let’s move on to….

Step 3) Now we can set the Display Name and Database Name for our new instance, as shown below. As of today, a clone does not keep any relationship to its source instance so it might be a good idea to have some sort of naming convention to identify development vs. testing vs. training etc etc instances just to make your life easier in the long run!

Setting the display and database names

Step 4) Next step is to set our CPU and storage resources for our new “clone“, i.e. the number of cores and the amount of storage. Note that if you specify a “Full Clone in Step 1 then obviously the minimum storage that you can specify here is the actual space used (rounded to the next TB) by your “source” database instance. However, the great thing here is that you can set the resources you need specifically for your clone. In this case our source instance, “nodeappDB”, was configured with 16 OCPUs but as we are creating a “training” instance we can allocate fewer resources i.e. let’s just go with 4 OCPUs….

Set the CPU and storage requirements

Step 5) Next we need to set a new administrator password for our cloned database. All the usual password requirements apply to ensure our new instance remains safe and secure. Quick refresher if you are unsure of the rules:

  • The database checks for the following requirements when you create or modify passwords:
  • The password must be between 12 and 30 characters long and must include at least one uppercase letter, one lowercase letter, and one numeric character.
  • Note, the password limit is shown as 60 characters in some help tooltip popups. Limit passwords to a maximum of 30 characters.
  • The password cannot contain the username.
  • The password cannot be one of the last four passwords used for the same username.
  • The password cannot contain the double quote (“) character.
  • The password must not be the same password that is set less than 24 hours ago.

which brings us to the final step…

Step 6) The final step is to set the type of license we want to use for with our new autonomous database. The options are the same as for when you create a completely new autonomous database:

  • Bring your existing database software licenses (see here for more details).
  • Subscribe to new database software licenses and the database cloud service.

Select the type of license

That’s it, we are all done! All we have to do is click the big blue “Create Autonomous Database Clone” at the bottom of the form to start the provisioning process…at which point the create-form will disappear and the following page will be displayed…

New clone in provisioning state

….and in a couple of minutes our new autonomous database will be ready for use.

New autonomous database is now ready for use

So with a few mouse-clicks we have deployed an exact copy of our product nodeappDB autonomous database ready for our training workshop with our business users.

Don’t forget that you can stop this new created training instance until you are ready to run the training workshop with your business users. When the time arrives it’s a quick click on the “Start” button on the management console and you are up and running in a couple of minutes.

If the above screenshots are a little too fuzzy then you can download a PDF containing all the steps here.

But we are not quite finished if you are a DBA reading this blog post because…

If you are a technical user or a data warehouse DBA or a Cloud DBA then there a couple of additional areas that you will want to consider after creating your newly cloned data warehouse:

  • What about all the optimizer statistics from my original data warehouse instance?
  • What the resource rules within my newly cloned instance?

What about the Optimizer Statistics for your cloned data warehouse?

Essentially it doesn’t matter which type of clone you decide to create (Full Clone or a Metadata Clone) the optimizer statistics are copied from the source data warehouse to your newly cloned data warehouse. For for a Full Clone, where all the data from your source data warehouse is copied to your newly cloned instance you are ready to roll straight away! With a Metadata Clone, the first data load into a table will force the optimizer to update the statistics based on the new data load.

What about our resource management rules?

During the cloning process for your new data warehouse instance (Full Clone and Metadata Clone), any resource management rules in the source data warehouse that have been changed by the cloud DBA/administrator will be carried over to our newly cloned data warehouse. For more information on setting resource management rules, see Manage Runaway SQL Statements on Autonomous Data Warehouse.

So, what happens next?

Now that your newly cloned data warehouse instance is available you are ready to start connecting your data warehouse and business intelligence tools. If you are new to connecting different tools to autonomous data warehouse then take a look at our guide to Connecting to Autonomous Database.

Click here for more information on Autonomous Database.

Related:

Which OpenWorld Europe Sessions Should You Attend?

Line of business leaders – don’t let your valuable data go stale. Learn about new ways you can manage it and gain value during OpenWorld Europe, happening January 16 and 17 this year.

Oracle has entered a truly exciting time with the development of the Autonomous Database. We’ve continually added new products and new capabilities, like Autonomous Transaction Processing and Autonomous Data Warehouse. These new products will help you in your digital transformation as the world revolutionizes the way data is utilized.

Here are the data management and Autonomous Database sessions you don’t want to miss.

Oracle Code Keynote: Cloud-Native Data Management [SOL1843-LON]

The rise of the cloud brought many changes to the way in which developers build applications today. Containers, Serverless, and Microservices are just a few of the technologies and methodologies that are unthinkable of not being part of a modern cloud-native architecture. Yet the next big step is yet to come. Developers know how to build scalable and distributed cloud-native applications but still have to rely on traditional and fragmented data stores to serve their applications with the data they need to process. Penny Avril will unveil the next evolution of a cloud-native data management platform that not only can store and analyze all your data but is also capable to tune, secure and heal itself so that developers can continue to focus on building the next revolutionary application.

Speaker: Penny Avril, Vice President, Server Technology Division, Oracle

Wednesday, January 16, 09:00 AM – 10:20 AM | Arena 2 (Level 3) – ExCeL London

Oracle Autonomous Database [SOL1682-LON]

Oracle Chairman and Chief Technology Officer Larry Ellison describes the Oracle Autonomous Database as “probably the most important thing Oracle has ever done.” In his annual Oracle OpenWorld address, Oracle Executive Vice President Andy Mendelsohn shares the latest updates from the Database Development team along with customer reaction to Oracle Autonomous Database.

The definition of Digital Transformation continues to evolve. Many people think of Digital Transformation as — to be simplistic — as the integration of digital technology into all areas of a business. But Digital Transformation has the potential to be so much more – it’s a necessary disruptor. Digital Transformation isn’t just about technology . . . it’s part vision, perspective, strategy and precision. Companies will experience digital transformation across the enterprise — from customers, to employees, to partners alike. Learn how successful companies evolve to not only respond to customer, employee, and partner needs, but focus on strategies and technologies that straddle digital transformation. Hear from June Manley of Data Intensity on digital transformation.

Speakers: Andrew Mendelsohn, Executive Vice President, Database Server Technologies, Oracle; June Manley, CMO, Data Intensity, LLC; Eric Grancher, Head of Database Department, CERN; Manuel Martin Marquez, Data Analytics Scientist, CERN

Wednesday, January 16, 12:55 PM – 02:15 PM | Arena 1 (Level 3) – ExCeL London

The Changing Role of the DBA [SES1683-LON]

The advent of the cloud and the introduction of Oracle Autonomous Database Cloud presents opportunities for every organization, but what’s the future role for the DBA? In this session explore how the role of the DBA will continue to evolve, and get advice on key skills required to be a successful DBA in the world of the cloud.

Speaker: Penny Avril, Vice President, Server Technology Division, Oracle

Wednesday, January 16, 02:25 PM – 03:00 PM | Arena 1 (Level 3) – ExCeL London

Unleash the Potential of Data to Drive a Smarter Business [SES1221-LON]

Organizations are under tremendous pressure to lower cost, reduce risk, and accelerate innovation. In this session learn how Oracle Autonomous Database Cloud is helping customers achieve these objectives by leveraging the most valuable currency of the company: data. With its self-driving, self-repairing, and self-securing capabilities using machine leaning, all stakeholders including executives, business users, and data analysts can gain insights for smarter business decisions, and IT can deploy applications in minutes for faster innovation. Learn why Larry Ellison calls Oracle Autonomous Database Cloud “the most important thing we have done in a long time.” See how it is revolutionizing data management and empowering line of business, DBAs, and data scientists, to do more with data.

Speakers: Monica Kumar, Vice President, Product Marketing Database and Big Data, Oracle

Thursday, January 17, 12:10 PM – 12:45 PM | Arena 2 (Level 3) – ExCeL London

Related: