Four Tools to Integrate into Your Data Lake

A data lake is an absolutely vital piece of today’s big data business environment. A single company may have incoming data from a huge variety of sources, and having a means to handle all of that is essential. For example, your business might be compiling data from places as diverse as your social media feed, your app’s metrics, your internal HR tracking, your website analytics, and your marketing campaigns. A data lake can help you get your arms around all of that, funneling those sources into a single consolidated repository of raw data.

But what can you do with that data once it’s all been brought into a data lake? The truth is that putting everything into a large repository is only part of the equation. While it’s possible to pull data from there for further analysis, a data lake without any integrated tools remains functional but cumbersome, even clunky.

On the other hand, when a data lake integrates with the right tools, the entire user experience opens up. The result is streamlined access to data while minimizing errors during export and ingestion. In fact, integrated tools do more than just make things faster and easier. By expediting automation, the door opens to exciting new insights, allowing for new perspectives and new discoveries that can maximize the potential of your business.

To get there, you’ll need to put the right pieces in place. Here are four essential tools to integrate into your data lake experience.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Machine Learning

Even if your data sources are vetted, secured, and organized, the sheer volume of data makes it unruly. As a data lake tends to be a repository for raw data—which includes unstructured items such as MP3 files, video files, and emails, in addition to structured items such as form data—much of the incoming data across various sources can only be natively organized so far. While it can be easy to set up a known data source for, say, form data into a repository dedicated to the fields related to that format, other data (such as images) arrives with limited discoverability.

Machine learning can help accelerate the processing of this data. With machine learning, data is organized and made more accessible through various processes, including:

In processed datasets, machine learning can use historical data and results to identify patterns and insights ahead of time, flagging them for further examination and analysis.

With raw data, machine learning can analyze usage patterns and historical metadata assignments to begin implementing metadata automatically for faster discovery.

The latter point requires the use of a data catalog tool, which leads us to the next point.

Data Catalog

Simply put, a data catalog is a tool that integrates into any data repository for metadata management and assignment. Products like Oracle Cloud Infrastructure Data Catalog are a critical element of data processing. With a data catalog, raw data can be assigned technical, operational, and business metadata. These are defined as:

  • Technical metadata: Used in the storage and structure of the data in a database or system
  • Business metadata: Contributed by users as annotations or business context
  • Operational metadata: Created from the processing and accessing of data, which indicates data freshness and data usage, and connects everything together in a meaningful way

By implementing metadata, raw data can be made much more accessible. This accelerates organization, preparation, and discoverability for all users without any need to dig into the technical details of raw data within the data lake.

Integrated Analytics

A data lake acts as a middleman between data sources and tools, storing the data until it is called for by data scientists and business users. When analytics and other tools exist separate from the data lake, that adds further steps for additional preparation and formatting, exporting to CSV or other standardized formats, and then importing into the analytics platform. Sometimes, this also includes additional configuration once inside the analytics platform for usability. The cumulative effect of all these steps creates a drag on the overall analysis process, and while having all the data within the data lake is certainly a help, this lack of connectivity creates significant hurdles within a workflow.

Thus, the ideal way to allow all users within an organization to swiftly access data is to use analytics tools that seamlessly integrate with your data lake. Doing so removes unnecessary manual steps for data preparation and ingestion. This really comes into play when experimenting with variability in datasets; rather than having to pull a new dataset every time you experiment with different variables, integrated tools allow this to be done in real time (or near-real time). Not only does this make things easier, this flexibility opens the door to new levels of insight as it allows for previously unavailable experimentation.

Integrated Graph Analytics

In recent years, data analysts have started to take advantage of graph analyticsthat is, a newer form of data analysis that creates insights based on relationships between data points. For those new to the concept, graph analytics considers individual data points similar to dots in a bubble—each data point is a dot, and graph analytics allows you to examine the relationship between data by identifying volume of related connections, proximity, strength of connection, and other factors.

This is a powerful tool that can be used for new types of analysis in datasets with the need to examine relationships between data points. Graph analytics often works with a graph database itself or through a separate graph analytics tool. As with traditional analytics, any sort of extra data exporting/ingesting can slow down the process or create data inaccuracies depending on the level of manual involvement. To get the most out of your data lake, integrating cutting-edge tools such as graph analytics means giving data scientists the means to produce insights as they see fit.

Why Oracle Big Data Service?

Oracle Big Data Service is a powerful Hadoop-based data lake solution that delivers all of the needs and capabilities required in a big data world:

  • Integration: Oracle Big Data Service is built on Oracle Cloud Infrastructure and integrates seamlessly into related services and features such as Oracle Analytics Cloud and Oracle Cloud Infrastructure Data Catalog.
  • Comprehensive software stack: Oracle Big Data Service comes with key big data software: Oracle Machine Learning for Spark, Oracle Spatial Analysis, Oracle Graph Analysis, and much more.
  • Provisioning: Deploying a fully configured version of Cloudera Enterprise, Oracle Big Data Service easily configures and scales up as needed.
  • Secure and highly available: With built-in high availability and security measures, Oracle Big Data Service integrates and executes this in a single click.

To learn more about Oracle Big Data Service, click here—and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox.


  • No Related Posts

Podcast #377: Oracle Autonomous Database — An Interview with Maria Colgan

This episode brings you an interview with Maria Colgan, Master Product Manager for Oracle Database. Maria joined Oracle in 1996, and since then has held positions as product manager for Oracle Database In-Memory and the Oracle Database query optimizer. Maria is the primary author of the SQLMaria blog and a contributing author to the Oracle Optimizer blog.

“With the Autonomous Database, data scientists and developers can basically help themselves, provision the database and get running and utilize it.” – Maria Colgan

Handling the host duties for this program is Alexa Weber Morales. An award-winning musician and writer, Alexa is director of developer content at Oracle. She is the former editor in chief of Software Development magazine, and has more than 15 years of experience as a technology content strategist and journalist.

In this program Maria talks about Oracle Autonomous Database and the new features that allow data scientists, developers, and others beyond traditional database users to help themselves.

This program is Groundbreakers Podcast #377. The interview was originally recorded on September 17, 2019 at Oracle Openworld. Listen!

On the Mic

Maria ColganMaria Colgan

Master Product Manager, Oracle Database, Oracle

San Francisco, California

Alexa Weber MoralesAlexa Weber Morales

Editor/Content Strategist, Oracle

San Francisco, California

Additional Resources

Stay Tuned!

The Oracle Groundbreakers Podcast is available via:


  • No Related Posts

Six Retail Dashboards for Data Visualizations

Retail is rapidly transforming. Consumers expect an omni-channel experience that empowers them to shop from anywhere in the world. To capture this global demand, retailers have developed ecommerce platforms to complement their traditional brick-and-mortar stores. Ecommerce is truly revolutionizing the way retailers can collect data about the customer journey and identify key buying behaviors.

As described byBeyond Marketing by Deloitte, analysts are taking advantage of “greater volumes of diverse data—in an environment that a company controls—mak[ing] it possible to develop a deeper understanding of customers and individual preferences and behaviors.” Savvy retailers will leverage the influx of data to generate insights on how to further innovate products to the tastes and preferences of their target audience. But how can they do this?

Try a Data Warehouse to Improve Your Analytics Capabilities

Retailers are looking to leverage data to find innovative answers to their questions. While analysts are looking for those data insights, leadership want immediate insights delivered in a clear and concise dashboard to understand the business. Oracle Autonomous Database delivers analytical insights allowing retail analysts to immediately visualize their global operations on the fly with no specialized skills. In this blog, we’ll be creating sales dashboards for a global retailer to help them better understand and guide the business with data.

Analyzing Retail Dashboards

Retail analysts can create dashboards to track KPIs including: Sales, Inventory turnover, Return rates, Cost of customer acquisition, and Retention. These dashboards can help with monitoring KPIs with easily understandable graphics that can be shared with executive management to drive business decisions.

In this blog, we focus on retail dashboards that break down sales and revenue by:

  • Product
  • Region
  • Customer segment

In each dashboard, we will identify and isolate areas of interest. With the introduction of more data, you can continuously update the dashboard and create data-driven insights to guide the business.

Understanding the Retail Data Set

We used modified sales and profit data from a global retailer to simulate the data that retail analysts can incorporate into their own sales dashboards. In our Sample Order Lines excel (shown below), we track data elements such as: Order ID, Customer ID, Customer Segment, Product Category, Order Discount, Profit, and Order Tracking.

Data Visualization Desktop, a tool that comes free with ADW, allows users to continuously update their dashboards by easily uploading each month’s sales data. By introducing more data, we can understand how the business is changing over time and adapt accordingly.

For more on how to continuously update your dashboard please see: Loading Data into Autonomous Data Warehouse Using Oracle Data Visualization Desktop

We looked at the following questions:

  1. What is the current overview of sales and profit broken down by product?
  2. Which regional offices have the best sales performance?
  3. Which geographic regions are hotbeds of activity?
  4. How are different products and regions linked together by sales?
  5. Which market segments are the most profitable?
  6. Which specific products are driving profitability?

Here is the view of the data in Excel:

Here’s a quick snapshot of the data loaded into Data Visualization Desktop:

What is the current overview of sales and profit broken down by product?

This is a sales dashboard summary which shows the overall revenue and profit from different product segments. Here are some quick insights:

  • Using tiles (top left), we see: $1.3M profit out of $8.5M total sales making a 15.3% profit margin.
  • We use pie charts to break out total sales and profit by product category (top right). Technology products not only contribute the most to sales of any single product category (40.88%) but also make up the most profit of any single category (56.15%), meaning that technology products are the highest grossing product line. Under the pie charts is a pivot table showing the actual figures.
  • A line graph shows that every product category has been growing with technology products growing the fastest (bottom left). There was a spike in technology sales that started August and peaked in November 2018.
  • Sales are broken down by both product and customer segment so that we can understand more about the buying habits for different customers (bottom right).

For an even more detailed segment analysis, we also broke out the corporate customer segment (below) to compare with the overall business (above).

Which regional offices have the best sales performance?

In this visualization, we’re looking at the performance of different regional offices and how they’re collectively trending. We overlaid a horizontal stacked graph and a donut chart on a scatterplot. Using a dot to represent each city, the scatterplot analysis compares profit (x-axis) vs sales (y-axis) using larger dots to represent larger customer populations in each city.

For example, the dot in yellow (far right) represents San Paulo, Brazil with 127 customers generating $200,193 in sales and $44,169 in profit. As the most profitable city, San Paulo has a profit margin of 22%, averaging total purchases of $1,576 per customer. On the scatterplot, cities that make at least $10,000 of profit are indicated left of the dotted line.

The horizontal stacked graph (top left) breaks down sales by continent so you can see which regions are leading in sales. The donut chart (bottom right) indicates shows the total amount of sales from all the regions ($9M) and shows each region as a percent. Here are the leading regions by sales:

  • America (38.64%)
  • Europe (28.81%)
  • Asia (18.05%)

To learn more, we use the “keep selected” option to dynamically look at a specific region like Europe (shown below). We can see that Europe accounts for just under $2.5M in sales with the largest portion coming from Northern Europe. The scatterplot also dynamically changes to only show cities in Europe. Now you can now identify that the profitable European city is Belfast, Ireland ($27,729) and the city with the most sales is St. Petersburg, Russia ($127,521). This allows us to identify and replicate the success of the offices like Belfast and St. Petersburg in the other regions as well.

Which geographic regions are hotbeds of activity?

Analysts need to identify which markets to immediately focus on. Using a heat map, we can see which regions have the most sales (shown in color) and regions without sales (gray). This particular global retailer’s sales are primarily in developed markets:

1. America ($1.5M+)

2. United Kingdom ($887K)

3. Australia ($695K)

We can investigate further to pinpoint the exact cities (below) in the UK. We can see that the sales are originating from multiple cities including:
  • Belfast
  • Leeds
  • Manchester
  • Sheffield
Using a heat map can not only help identify how easily customers access storefront locations but also show where to expand operations based on demand.

How are different products and regions linked together by sales?

It’s often hard to see how different factors like sales, product, and geography are interrelated. Using a network map, we see how products categories (technology, furniture, office supplies) are linked to continents that are sublinked to countries. The thickness of the connecting line from one node on the network to another is based on sales and the deeper shades of green are represent more profits. We hover over the line connecting Africa to Southern Africa (above) to see the total sales ($242K) and profit ($34K) from Southern Africa.

Another way to focus on a specific region is to hover over a specific node and use the “keep selected” option (below). In this in the example, we only identify nodes linked to Europe. By doing this, we can see that a majority of the sales and profits from Europe are coming from technology products ($1,030K sales, $213K profit) and originating from the Northern Europe ($974K sales, $162K profit) specifically the UK ($880K sales, $162K profit). Analysts can identify the regional sources of sales/profit while seeing a macroview of how products and regions are linked.

Which market segments are the most profitable?

It’s critical to understand which customer groups are growing the fastest and generating the most sales and profit. We use a stacked bar (left) and scatterplot (right) break down profitability by market segment in FY18. We categorize buyer types into:

  • Consumer
  • Corporate
  • Home office
  • Small business

In the stack bar, we can see that the sales has been growing from Q2 to Q4 but the primary market segments driving sales growth are corporate (61% growth since Q1) and small business (53% growth since Q1). The combined growth of the corporate and small business segments lead to a $191K increase of sales of since Q1. Although these two segment made up over 63% of total sales in FY18Q4, we can also see that sales from the home office segment more than doubled from FY18Q3 to FY18Q4.

In a scatterplot (right), we can see the changes in profit ratio of each market segment over time. The profit ratio formula divides net profits for a reporting period by net sales for the same period. The fastest growing market segments and the most profitable market segments in FY18 (top right quadrant) are:

  • Corporate
  • Small business

We can also isolate the profitability of the corporate customer segment (below). By generating insights about the target market segments, companies are able to focus their product development and marketing efforts.

Which specific products are driving profitability?

Retailers are often managing a portfolio of hundreds, if not thousands, of products. This complexity makes it challenging to track and identify the profitability of individual products. However, we can easily visualize how profitability has changed over time and compare it to specific products. We use a combo graph (top left) to indicate changes to sales and profit ratios over time.

Generally, we can see that every year sales (and profits) increase from Q1 to Q4 then drop off with the start of the next Q1. We use a waterfall graph to track how profits have gradually changed over time (bottom left). From 2013 to the end of 2018, there was a net gain of $167K in profit.

Analysts identify performant products to expand and unprofitable products to cut. On the right, we track sales and profit ratios by individual products. We can see that product that generate the most sales are:

  1. Telephones/communication tools ($1,380K)
  2. Office machines ($1,077)
  3. Chairs ($1,046K)

The products with the highest profit ratio are:

  1. Binders (35 percent)
  2. Envelopes (32.4 percent)
  3. Labels (31.6 percent)
This means that for every binder sold, 35 percent of the sale was pure profit. We also found that product such as bookcases (-5.2 percent), tablets (-5.3 percent), and scissors/rulers (-8.3 percent) had negative profit ratios, which means that there was a loss on each sale. We can also isolate sales performance of the top five products (below).


Data visualizations dashboards empowered by Autonomous Data Warehouse allow major global retailers to easily understand the state of their business and make judgments on how adapt to dynamic market environments.

Oracle Autonomous Database allow users to easily create secure data marts in the cloud to generate powerful business insights – without specialized skills. It took us fewer than five minutes to provision a database and upload data for analysis.

Now you can also leverage the Autonomous Data Warehouse through a cloud trial:

Sign up for your free Autonomous Data Warehouse trial today

Please visit the blogs below for a step-by-step guide on how to start your free cloud trial: upload your data into OCI Object Store, create an Object Store Authentication Token, create a Database Credential for the user, and load data using the Data Import Wizard in SQL Developer:

Feedback and questions are welcome. Tell us about the dashboards you’ve created!


  • No Related Posts

6 Ways to Improve Data Lake Security

Data lakes, such as Oracle Big Data Service, represent an efficient and secure way to store all of your incoming data. Worldwide big data is projected to rise from 2.7 zettabytes to 175 zettabytes by 2025, and this means an exponentially growing number of ones and zeroes, all pouring in from an increasing number of data sources. Unlike data warehouses, which require structured and processed data, data lakes act as a single repository for raw data across numerous sources.

What do you get when you establish a single source of truth for all your data? Having all that data in one place creates a cascading effect of benefits, starting with simplifying IT infrastructure and processes and rippling outward to workflows with end users and analysts. Streamlined and efficient, a single data lake basket makes everything from analysis to reporting faster and easier.

There’s just one issue: all of your proverbial digital eggs are in one “data lake” basket.

For all of the benefits of consolidation, a data lake also comes with the inherent risk of a single point of failure. Of course, in today’s IT world, it’s rare for IT departments to set anything up with a true single point of failure—backups, redundancies, and other standard failsafe techniques tend to protect enterprise data from true catastrophic failure. This is doubly so when enterprise data lives in the cloud, such as with Oracle Cloud Infrastructure, as data entrusted in the cloud rather than locally has the added benefit of trusted vendors building their entire business around keeping your data safe.

Does that mean that your data lake comes protected from all threats out of the box? Not necessarily; as with any technology, a true assessment of security risks requires a 360-degree view of the situation. Before you jump into a data lake, consider the following six ways to secure your configuration and safeguard your data.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Establish Governance: A data lake is built for all data. As a repository for raw and unstructured data, it can ingest just about anything from any source. But that doesn’t necessarily mean that it should. The sources you select for your data lake should be vetted for how that data will be managed, processed, and consumed. The perils of a data swamp are very real, and avoiding them depends on the quality of several things: the sources, the data from the sources, and the rules for treating that data when it is ingested. By establishing governance, it’s possible to identify things such as ownership, security rules for sensitive data, data history, source history, and more.

Access: One of the biggest security risks involved with data lakes is related to data quality. Rather than a macro-scale problem such as an entire dataset coming from a single source, a risk can stem from individual files within the dataset, either during ingestion or after due to hacker infiltration. For example, malware can hide within a seemingly benign raw file, waiting to execute. Another possible vulnerability stems from user access—if sensitive data is not properly protected, it’s possible for unscrupulous users to access those records, possibly even modify them. These examples demonstrate the importance of establishing various levels of user access across the entire data lake. By creating strategic and strict rules for role-based access, it’s possible to minimize the risks to data, particularly sensitive data or raw data that has yet to be vetted and processed. In general, the widest access should be for data that has been confirmed to be clean, accurate, and ready for use, thus limiting the possibility of accessing a potentially damaging file or gaining inappropriate access to sensitive data.

Use Machine Learning:Some data lake platforms come with built-in machine learning (ML) capabilities. The use of ML can significantly minimize security risks by accelerating raw data processing and categorization, particularly if used in conjunction with a data cataloging tool. By implementing this level of automation, large amounts of data can be processed for general use while also identifying red flags in raw data for further security investigation.

Partitions and Hierarchy: When data gets ingested into a data lake, it’s important to store it in a proper partition. The general consensus is that data lakes require several standard zones to house data based on how trusted it is and how ready-to-use it is. These zones are:

  • Temporal: Where ephemeral data such as copies and streaming spools live prior to deletion.
  • Raw: Where raw data lives prior to processing. Data in this zone may also be further encrypted if it contains sensitive material.
  • Trusted: Where data that has been validated as trustworthy lives for easy access by data scientists, analysts, and other end users.
  • Refined: Where enriched and manipulated data lives, often as final outputs from tools.

Using zones like these creates a hierarchy that, when coupled with role-based access, can help minimize the possibility of the wrong people accessing potentially sensitive or malicious data.

Data Lifecycle Management:Which data is constantly used by your organization? Which data hasn’t been touched in years? Data lifecycle management is the process of identifying and phasing out stale data. In a data lake environment, older stale data can be moved to a specific tier designed for efficient storage, ensuring that it is still available should it ever be needed but not taking up needed resources. A data lake powered by ML can even use automation to identify and process stale data to maximize overall efficiency. While this may not touch directly on security concerns, an efficient and well managed data lake allows it to function like a well-oiled machine rather than collapsing under the weight of its own data.

Data Encryption:The idea of encryption being vital to data security is nothing new, and most data lake platforms come with their own methodology for data encryption. How your organization executes, of course, is critical. Regardless of which platform you use or what you decide between on premises vs, cloud, a sound data encryption strategy that works with your existing infrastructure is absolutely vital to protecting all of your data whether in motion or at rest—in particular, your sensitive data.

Create Your Secure Data Lake

What’s the best way to create a secure data lake? With Oracle’s family of products, a powerful data lake is just steps away. Built upon the foundation of Oracle Cloud Infrastructure, Oracle Big Data Service delivers cutting-edge data lake capabilities while integrating into premiere analytics tools and one-touch Hadoop security functions. Learn more about Oracle Big Data Service to see how easy it is to deploy a powerful cloud-based data lake in your organization—and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox.


  • No Related Posts

Oracle Database client libraries for Java now on Maven Central

Oracle has published its Oracle Database JDBC client libraries on Maven Central. From now on you can find Oracle Database related jar files under the group id. You will find all libraries from version (e.g. ojdbc6) to 19.3.0 (e.g. ojdbc10).

Going forward, Oracle will use Maven Central as one of the primary distribution mechanisms for Oracle Database Java client libraries, meaning that you will also be able to find new versions of these libraries on Maven Central in the future.

To get the latest Oracle Database JDBC driver, use the following dependency GAV in your Maven POM file:

<dependency> <groupId></groupId> <artifactId>ojdbc10</artifactId> <version></version></dependency>

The group id has the following subgroups:

  • oracle.database.jdbc: this group contains all JDBC libraries
    • ojdbc[N].jar:

      The Oracle Database JDBC driver compiled with Java [N]
    • ucp.jar:

      The Universal Connection Pool for JDBC
    • ojdbc[N]dms.jar:

      The Oracle Database JDBC driver compiled with Java [N] including the Dynamic Monitoring System (DMS)

      Note: ojdbc8dms.jar and ojdbc10dms.jar contain the instrumentation to support the Dynamic Monitoring System (DMS) and limited support for java.util.logging.

  • oracle.database.jdbc.debug: this group contains all JDBC debug libraries
  • this group contains all Security libraries for Oracle wallet and more
    • oraclepki.jar:

      The Oracle PKI provider used for Oracle wallets
    • osdt_cert.jar:

      Certificate management components used for Oracle wallets
    • osdt_core.jar:

      Core components between oraclepki.jar and osdt_cert.jar
  • oracle.database.ha: this group contains all High Availability libraries
    • ons.jar:

      Oracle Notification System library
    • simplefan.jar:

      Simple Fast Application Notification library
  • oracle.database.nls: this group contains the Internationalization library
    • orai18n.jar:

      orainternationalization.jar –> orai – 18 letters in between – n.jar
  • oracle.database.xml: this group contains all XML and XML DB related libraries
    • xdb.jar, xdb6.jar

      Support for the JDBC 4.x standard java.sql.SQLXML interface
    • xmlparserv2.jar

      The Oracle Database XML Parser library, including APIs for:

      • DOM and Simple API for XML (SAX) parsers
      • XML Schema processor
      • Extensible Stylesheet Language Transformation (XSLT) processor
      • XML compression
      • Java API for XML Processing (JAXP)
      • Utility functionality such as XMLSAXSerializer and asynchronous DOM Builder

        Note: xdb6.jar is a legacy name, xdb.jar is the new name.

  • oracle.database.observability: this group contains the Observability library
    • dms.jar:

      The Oracle Database Dynamic Monitoring System (DMS)
  • oracle.database.soda (coming soon!): this group contains the Simple Oracle Document Access driver for Oracle Database
  • oracle.database.messaging (coming soon!): this group contains the Advanced Queuing Java Messaging Service driver for Oracle Database

With this change, Oracle made it easier than ever before for developers and users alike to consume the Oracle Database Java client libraries.


  • No Related Posts

Oracle Database 18c XE now under the Oracle Free Use Terms and Conditions license

Today we announce the availability of Oracle Database 18c XE for Linux under the Oracle Free Use Terms and Conditions license. This new license is part of the XE RPM installer file and will be installed alongside Oracle Database 18c XE. The download of the RPM file requires no more click-through on the website! Users can now install Oracle Database 18c XE for Linux directly from the web via:

yum -y localinstall

The Docker build files and Vagrant boxes files for 18c XE have also been updated to take advantage of this change and no longer require the user to download the software first either!

This change has been requested by the community. Oracle continues its commitment to the community and will base future releases of XE, Windows and Linux alike, under the Free Use Terms and Conditions license as well.


  • No Related Posts

5 Reasons Why Oracle Cloud Infrastructure Data Flow Optimizes Apache Spark

Apache Spark is the dominant force in big data, seamlessly combining scalable data processing, reporting, and machine learning in a convenient package. How widespread is Apache Spark usage? Consider this: since launching a decade ago as an open-source project at UC Berkeley, Apache Spark has powered AI and big data for some of the world’s largest tech companies, and in doing so, has changed the way the world consumes data. From driving user recommendations to understanding customer data, Apache Spark’s ability to unify complex data workflows and provide analytics at scale make it the ideal foundation for big data projects. Its presence in the industry can’t be understated – there’s a reason why, as an open source project, has a significant list of ongoing contributors.

Why is this so important? Simply put, big data has powered the shift to the digital era; today, everyone understands that they need big data, and the numbers back it up – big data is getting bigger and there’s no way around it. Apache Spark powers big data because of its ability to use cluster computing for data preparation and computational functions, allowing for powerful parallel processes that maximizes the ability to handle big data projects. Despite that significant ability – along with the ability to scale by simply increasing the quantity of processors—is still dependent on the infrastructure it’s built upon.

And because of that, industry research still shows that about 85 percent of big data projects fail. Why is that? The broad-stroke answer is that the enormous complexity of current big data solutions often causes projects to implode under their own weight. A finer examination shows that some of this stems from a user perspective (e.g. not having a clear scope or goal) but technology is just as involved. Big data usually involves juggling data from many sources, which can also include different security protocols and requirements. Bringing all of this together creates all sorts of headaches, especially when you consider the logistical difficulties involved with unifying legacy systems.

However, there’s a solution on the horizon for that.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Oracle is excited to launch Oracle Cloud Infrastructure Data Flow, a fully-managed Apache Spark service that makes running Spark applications easier than ever before. With Oracle Cloud Infrastructure Data Flow, everything becomes simplified and streamlined. How does Oracle achieve such a bold claim? Simple – Oracle Cloud Infrastructure Data Flow offers:

  • Spark without the infrastructure – everything exists in the cloud.
  • Always-on cloud-native security that is continuously updated to meet the latest protocols and regulations.
  • Minimal resources from IT thanks to zero need to configure, install, manage, or upgrade.

Five Reasons (And More) To Use Oracle Cloud Infrastructure Data Flow

All of this comes with an Oracle Cloud Infrastructure account. But let’s dive even deeper with five reasons Data Flow is better than your current Spark solution (and a bonus reason to boot!):

Reason 1: Managed infrastructure

Running multiple Spark jobs is hard enough to manage without having to worry about IT operations. However, as mentioned above, the underlying infrastructure is often a critical factor in whether or not a project can come to its conclusion – and even it does finish, the overall efficiency in resource allocation can be a drain on the system as a whole. Oracle Cloud Infrastructure Data Flow handles this infrastructure provisioning, setting up networks, storage and security, and tearing everything down when Spark jobs complete. This enables teams to focus solely on their Spark projects – the engine under the hood is all handled by Oracle.

Reason 2: Out-of-the-box security

Security concerns derail a lot of big data projects. Data Flow checks all the requisite security boxes: Authentication, Authorization, Encryption and Isolation, and all other critical points. This platform is built on the foundation of Oracle Cloud Infrastructure’s cloud-native identity and access management (IAM) security system. Data stored in Oracle Cloud Infrastructure’s object store is encrypted at rest and in motion, and is protected by IAM authorization policies. With Oracle Cloud Infrastructure Data Flow, security is automatic and not an extra step.

Reason 3: Consolidated operational insight

Big data often creates big problems for IT operations, no pun intended. Whether it’s making sense of thousands of jobs or figuring out which jobs are consuming the most resources, getting a handle on utilization is a complicated task. Existing Spark solutions make it hard to get a complete and thorough picture of what all users are doing. Oracle Cloud Infrastructure Data Flow makes it easy by consolidating all this information into a single searchable, sortable interface. Want to know which job from last week cost the most? With a few clicks, this information can be requested and displayed.

Reason 4: Simple troubleshooting

Tracking down the logs and tools necessary to troubleshoot a Spark job can take hours. Oracle Cloud Infrastructure Data Flow consolidates everything required into a single place, from Spark UI to Spark History Server and log output – all just a click away. In addition, administrators can easily load other user jobs when a persistent issue needs an expert eye for troubleshooting.

Reason 5: Fully managed job output.

Getting the code to work is the first step. However, a project isn’t complete until the job output makes it to the target business users. Oracle Cloud Infrastructure Data Flow makes it easy to get analytics to the people who need it through an automated process that securely captures and stores a job’s output. This output is available either via the web interface or by calling REST APIs. This means that historic outputs are easily obtainable for any purpose. Need to know the output of a SQL job you ran last week? It’s just one click or API call away.

Bonus Reason: The best value in big data

The five reasons above show why Oracle Cloud Infrastructure Data Flow is the easiest way to run Apache Spark. But there’s one more very practical benefit designed to help convince all of the decision makers and stakeholders along the way: with Oracle Cloud Infrastructure Data Flow, you only pay for the IaaS resources used while they’re being used. In short, there’s no additional charge despite the many features built into this platform. Combined with Oracle Cloud Infrastructure’s already industry-leading price for performance, it may be the best value in big data.

The easiest way to see for yourself is to simply dive in with a test drive. All you need is an Oracle Cloud account. Follow along with our Data Flow Tutorial which takes you through step by step and shows you just how simple Big Data can be with Data Flow. Check out Oracle’s big data management products and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox.


  • No Related Posts

GA of Oracle Database 20c Preview Release

The latest annual release of the world’s most popular database, Oracle Database 20c, is now available for preview on Oracle Cloud (Database Cloud Service Virtual Machine).

As with every new release, Oracle Database 20c introduces key new features and enhancements that further extend Oracle’s multi-model converged architecture with the introduction of Native Blockchain Tables, and more performance enhancements such as Automatic In-Memory (AIM) and a binary JSON datatype. For a quick introduction, watch Oracle EVP, Andy Mendelsohn discuss Oracle Database 20c during his last Openworld keynote.

For the complete list of new features in Oracle Database 20c, please refer to the new features guide in latest documentation set. To learn more about some of the key new features and enhancements in Oracle Database 20c, check out the following blog posts:

For availability of Oracle Database 20c on all other platforms on-premises (including Exadata) and in Oracle Cloud please refer to MyOracle Support (MOS) note 742060.1.


  • No Related Posts

What Is Oracle Cloud Infrastructure Data Catalog?

And What Can You Do with It?

Simply put, Oracle Cloud Infrastructure Data Catalog helps organizations manage their data by creating an organized inventory of data assets. It uses metadata to create a single, all-encompassing and searchable view to provide deeper visibility into your data assets across Oracle Cloud and beyond. This video provides a quick overview of the service.

This helps data professionals such as analysts, data scientists, and data stewards discover and assess data for analytics and data science projects. It also supports data governance by helping users find, understand, and track their cloud data assets and on-premises data as well—and it’s included with your Oracle Cloud Infrastructure subscription.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Why Does Oracle Cloud Infrastructure Data Catalog Matter?

Hint: It has to do with self-service data discovery and governance.

Oracle Cloud Infrastructure Data Catalog matters because it’s a foundational part of the modern data platform—a platform where all of your data stores can act as one, and you can view and access that data easily, no matter whether it resides in Oracle Cloud, object storage, an on-premises database, big data system, or a self-driving database.

This means that data users—data scientists, data analysts, data engineers, and data stewards—can all find data across systems and the enterprise more easily because a data catalog provides a centralized, collaborative environment to encourage exploration. Now these key players can trust their data because they gain technical as well as business context around it. It means they don’t have to have SQL access, or understand what object storage is, or figure out the complexities of Hadoop—they can get started faster with their single unified view through their data catalog. It’s no longer necessary to have five different people with five different skillsets just to find where the right data resides.

Easy data discovery is now possible.

And of course, it’s not just data discovery that’s easier. Governance is also easier—and that is a key benefit with GDPR and ever more complex compliance requirements in today’s world of multiple enterprise systems, with on-premises, cloud, and multi-cloud environments.

With Oracle Cloud Infrastructure Data Catalog, you have better visibility into all of your assets, and business context is available in the form of a business glossary and user annotations. And of course, understanding the data you have is essential for governance.

How Does Oracle Cloud Infrastructure Data Catalog Work?

Oracle Cloud Infrastructure Data Catalog takes metadata—technical, business, and operational—from various data sources, users, and assets, and harvests it to turn it into a data catalog: a single collaborative solution for data professionals to collect, organize, find, access, enrich, and activate metadata to support self-service data discovery and governance for trusted data assets across Oracle Cloud.

And what’s so important about this metadata? Metadata is the key to Oracle Cloud Infrastructure Data Catalog. There are three types of metadata that are relevant and key to how our data catalog works:

  • Technical metadata: Used in the storage and structure of the data in a database or system
  • Business metadata: Contributed by users as annotations or business context
  • Operational metadata: Created from the processing and accessing of data, which indicates data freshness and data usage, and connects everything together in a meaningful way

You can harvest this metadata from a variety of sources, including:

    • Oracle Cloud Infrastructure Object Storage
    • Oracle Database
    • Oracle Autonomous Transaction Processing
    • Oracle Autonomous Data Warehouse
    • Oracle MySQL Cloud Service
    • Hive
    • Kafka

And the supported file types for Oracle Cloud Infrastructure Object Storage include:

    • CSV, Excel
    • ORC, Avro, Parquet
    • JSON

Once the technical metadata is harvested, subject matter experts and data users can contribute business metadata in the form of annotations to the technical metadata. By organizing all this metadata and providing a holistic view into it, Oracle Cloud Infrastructure Data Catalog helps data users find the data they need, discover information on available data, and gain information about the trustworthiness of data for different uses.

How Can You Use a Data Catalog?

Metadata Enrichment

Oracle Cloud Infrastructure Data Catalog enables users to collaboratively enrich technical information with business context to capture and share tribal knowledge. You can tag or link data entities and attributes to business terms to provide a more all-inclusive view as you begin to gather data assets for analysis and data science projects. These enrichments also help with classification, search, and data discovery.

Business Glossaries

One of the first steps towards effective data governance is establishing a common understanding of business concepts across the organization, and establishing their relationships to the data assets in the organization. Oracle Cloud Infrastructure Data Catalog makes it possible to see associations and linkages between glossary terms and other technical terms, assets, and artifacts. This helps increase user trust because users understand the relationships and what they’re looking at.

Oracle Cloud Infrastructure Data Catalog makes this possible by including capabilities to collaboratively define business terms in rich text form, categorize them appropriately, and build a hierarchy to organize this vocabulary. You can also create parent-child relationships between various terms to build a taxonomy, or set business term owners and approval status so that users know who can answer their questions regarding specific terms. Once created, users can then link these terms to technical assets to provide business meaning and use them for searching as well.

Searchable Data Asset Inventory

By organizing all this metadata and providing a more complete view into it, Oracle Cloud Infrastructure Data Catalog helps users find the data they need, discover information on available data, and gain information about the trustworthiness of data for different uses.

Being able to search across data stores makes finding the right data so much easier. With Oracle Cloud Infrastructure Data Catalog, you have a powerful, searchable, standardized inventory of the available data sources, entities, and attributes. You can enter technical information, defined tags, or business terms to easily pull up the right data entities and assets. You can also use filtering options to discover relevant datasets, or browse metadata based on the technical hierarchy of data assets, entities, and attributes. These features make it easier to get started with data science, analytics, and data engineering projects.

Data Catalog API and SDK

Many of Oracle Cloud Infrastructure Data Catalog’s capabilities are also available as public REST APIs to enable integrations such as:

  • Searching and displaying results in applications that use the data assets
  • Looking up definitions of defined business terms in the business glossary and displaying them in reporting applications
  • Invoking job execution to harvest metadata as needed

Available search capabilities include:

  • Search data based on technical names, business terms, or tags
  • View details of various objects
  • Browse Oracle Cloud Infrastructure Data Catalog based on data assets

Available single collaborative environment includes:

  • Homepage with helpful shortcuts and operational stats
  • Search and browse
  • Quick actions to manage data assets, glossaries, jobs, and schedules
  • Popular tags and recently updated objects


Oracle Cloud Infrastructure Data Catalog is the underlying foundation to data management that you’ve been waiting for—and it’s included with your Oracle Cloud Infrastructure subscription. Now, data professionals can use technical, business, and operational metadata to support self-service data discovery and governance for data assets in Oracle Cloud and beyond.

Leverage your data in new ways, and more easily than you ever could before. Try Oracle Cloud Infrastructure Data Catalog today and start discovering the value of your data. And don’t forget to subscribe to the Big Data Blog for the latest on Big Data straight to your inbox!


  • No Related Posts