Interactive Data Lake Queries At Scale

By: Peter Jeffcock

Big Data Product Marketing

Data lakes have been a key aspect of the big data landscape for years now. They provide somewhere to capture and manage all kinds of new data that potentially enable new and exciting use cases. You can read more here or listen to this webcast

But maybe the key word in that first paragraph is “potentially”. Because to realize that value you need to understand the new data you have, explore it and query it interactively so you can form and test hypotheses. Interactive data lake query at scale is not easy. In this blog we’re going to take a look at some of the problems you need to overcome to make full productive use of all your data. That’s why Oracle acquired SparklineData to help address interactive data lake query at scale. More on that at the end of this article.

Hadoop has been the default platform for data lakes for a while but it was originally designed for batch rather than interactive work. The development of Apache Spark TM offered a new approach to interactive queries because Spark’s modern distributed compute platform and is one or two orders of magnitude faster than Hadoop with MapReduce. Replace HDFS with Oracle’s object storage https://blogs.oracle.com/cloud-infrastructure/bare-metal-cloud-object-storage-overview (Amazon calls it S3 while Microsoft refers to Blob Storage) and you’ve got the foundation for a modern data lake that can potentially deliver interactive query at scale.

Try building a fully functioning data lake – free

Interactive Query At Scale Is Hard

OK, I said “potentially” again. Because even though you’ve now got a modern data lake, there are some other issues that make interactive query of multi-dimensional data at scale very hard:

  • Performance
  • Pre-aggregation
  • Scale-out
  • Elasticity
  • Tool choice

Let’s look at each one of these in turn.

Interactive queries need fast response times. Users need “think speed analysis” as they navigate worksheets and dashboards. However, performance gets worse when many users try to access datasets in the data lake at the same time. Further, joins between fact and dimension tables can cause additional performance bottlenecks. Many tools have resorted to building an in-memory layer but this approach alone is insufficient. Which leads to the second problem.

Another way to address performance is to extract data from the lake and pre-aggregate it. OLAP cubes, extracts and materialized pre-aggregated tables have been used for a while to facilitate the analysis of multi-dimensional data. But there’s a tradeoff here. This kind of pre-aggregation supports dashboards or reporting, but it is not what you want for more ad-hoc querying. Key information behind the higher-level summaries is not available. It’s like zooming into a digital photograph and getting a pixelated view that obscures the details. What you want is access to all the original data so you can zoom in and look around at whatever you need. Take a look at this more detailed explanation.

Data lakes can grow quite large. And sooner or later you’re going to need to do analysis on terabytes, rather than gigabytes of data at a time. Scaling out to this kind of magnitude is the kind of stress test that plenty of tools fail at because they don’t have the distributed compute engine architecture that a framework like Spark brings natively to operate at this scale.

Scaling out successfully is part of the problem. But you also need to scale back down again. In other words, you need an elastic environment, because your workload is going to vary over time in response to anything from the sudden availability of a new data set to the need to analyze a recently-completed campaign or the requirement to support a monthly dashboard update. Elasticity is partly a function of having a modern data lake where compute and storage can scale independently. But elasticity also requires that tools using the data lake have the kind of distributed architecture needed to address scale out.

Finally, getting the most out of your data is not a job for one person or even one role. You need input from data scientists as well as business analysts, and they will each bring their requirements for different tools. You want all the tools to be able to operate on the same data and not have to do unique preparations for each different tool.

Addressing Data Lake Query Problems

Oracle acquired SparklineData last week, and we’re excited because Sparkline SNAP has some innovative solutions to these problems:

  • It runs scale-out in-memory using Apache Spark for performance, scalability and more.
  • It can deliver sub-second queries on terabyte data sets.
  • It doesn’t need to pre-aggregate data or create extracts because it uses in-memory indexes with a fully distributed compute engine architecture.
  • It’s fully elastic when running in a modern data lake based on object storage and Spark clusters.
  • Different users can access data with their tools of choice including Zeppelin or Jupyter notebooks running Python or R, and BI tools like Tableau. This means that people can use their tool of choice but connect to the SparklineSNAP in-memory layer.

We’re looking forward to integrating Sparkline SNAP into Oracle’s own data lake and analytics solutions and making it available to our customers as soon as possible.

Interactive Query Use Cases

So when would you want to use this technology? There are lots of use cases, but here are three to think about:

Data from machines and click streams fall into event/time-series data that can quickly grow in size and complexity. Providing ad-hoc interactive query performance on multi-terabyte data to BI tools connecting live to such data is impossible with current data lake infrastructures. Sparkline SNAP is designed to operate and analyze such large datasets in-place on the data lake without the need to move and summarize it for performance.

Perhaps all the data you want to work with isn’t currently in a data lake at all. If you have ERP data in multiple different applications and data stores, doing an integrated analysis is a nigh-on impossible task. But if you move it all into object storage and make it accessible to Sparkline SNAP, you can do ad hoc queries as you need, whether the original data came from a single source or from 60 different ones.

Finally, maybe you’re already struggling with all the extracts and pre-aggregation needed to support your current in-memory BI tool. With Sparkline SNAP you can dispense with all that and work on live data at any level of granularity. So not only can you save the time and effort of preparing the data, you can do a better analysis anyway. There’s more information here.

If you’d like to get started with a data lake, then check out this guided trial. In just a few hours you’ll have a functioning data lake, populated with data and incorporating visualizations and machine learning.

Related:

  • No Related Posts

Oracle Collaborate 18 Key Sessions for Data Management Cloud Services

By: Ilona Gabinsky

Principal Product Marketing Manager

Oracle Collaborate 18 – Technology and Application Forum – is coming to Las Vegas, Nevada on April 22-26. It offers 1200 Educational sessions and events, 5000 peer professionals, 200 exhibitors, and unlimited ideas and opportunities. To attend this event please register here.

Below are the links to Oracle’s Data Management Cloud Sessions:

Steve Daheb, Senior Vice President, Oracle Cloud

Keynote “Oracle Cloud — How to Build Your Own Personalized Path to Cloud

Monday, April 23rd, 2:30 p.m. – 3:30 p.m. – Mandalay Bay Ballroom F

Monica Kumar, Vice President, Product Marketing, Oracle

“Revolutionize Your Data Management with world’s 1st Autonomous Database

Monday, April 23rd, 4:15 PM–5:15 PM – Jasmine G

The Future of Autonomous Cloud

Wednesday, April 25th, 11:00 AM–12:00 – Jasmine G

Sachin Sathaye, Sr. Director, Cloud Platform Services, Oracle

Your journey to cloud with Choice and Control

Monday, April 23rd, 4:15 PM–5:15 PM – Lagoon L

Follow Us On Social Media:

Related:

  • No Related Posts

The Forgotten Link To The Cloud

By: Ilona Gabinsky

Principal Product Marketing Manager

Today we have guest blogger – Francisco Munoz Alvarez – an author and popular speaker at many Oracle conferences around the world.

Last year when presenting a session at Collaborate’17 in Las Vegas regarding Tips and Best Practices for DBAs, I went thru the evolution of the DBA profession and also gave a few tips of how a DBA can improve and be successful on his/her career.

After my session, many people approached me with questions regarding what will happen with the DBA profession with the introduction of Cloud to our life. Would the DBAs workload be fully automated and the DBA profession will disappear? Should I be afraid of the Cloud? Should I start looking for a new career? And many more questions like this, making me aware of an unexpected situation – the DBAs are blocking many possible Cloud endeavors for their organizations because they are scared of what it would bring to their future in the industry.

So, after discovering this unexpected situation I decided to write this post to express my overview on this so important topic!

Automation vs. Autonomous

Let start this post by clearing some common confusions and misunderstandings. For the past 10 years of my career I have been recommending DBA’s to automate most (if not all) business as usual (BAU) work and to concentrate at becoming as much proactive as possible (If you cannot automate a BAU process, delegate it) because you have more important things to do! I am recommending this over and over because I am constantly watching DBA’s losing too much time with BAU work, consequently making them unable to expend time on career development (training and learning about new technologies), work at important projects such as per example Optimization, Security, Performance Tuning, High Availability, Migrations, Upgrades and also unable to work with new technologies that could seriously benefit the business and obtain the best ROI to company resources.

I would like to use the automotive industry as example to clarify the differences between Automation and Autonomous. I love to drive my car, I love to be behind the wheel and enjoy the experience. Recently I bought a new car that include many driver assistance technologies (Automation). It has Autonomous cruise control (that automatically adjusts the vehicle speed to maintain a safe distance from vehicles ahead, and even stop the car if necessary), Line Changing Alert, Collision alert (alert us if close to have a collision due to speed or proximity and breaks for you if required), Driving behavior alert (Check for fatigue and dangerous driving behaviors) Automatic Head Light and Windscreen Wiper, Parking assistance and much more. Many people could think that all these features would affect my driving experience, but remember, as the person at control (the driver) you can choose what options would be used and when, and also can be adjusted as per your driving requirements. So, it did not affect me at all, by the opposite, they made my driving easier and safer, furthermore allowing me to enjoy it even more.

We are also talking about fully autonomous cars (that would fully drive itself, you just need to tell the car where you are going) for long time and many companies are investing resources on this type of technologies (Toyota, Tesla, Google, Uber are only a few) and are constantly making public testing of it. We know this is the future, and we know that is coming, but not anytime too soon (like this year or the next) and when the time comes, the global population would adopt it gradually.

The Oracle Database world is very similar to the above example. Automation within database is a reality and well needed, autonomous databases are coming and we cannot stop it. So, let’s evolve and be prepared on time, we still have time until everyone starts adopting it gradually.

Cloud, the inevitable next step in the DBA DNA evolution!

The constant evolution of IT has, among other things, affected the role of a DBA. Today the DBA is not merely a Database Administrator anymore but is morphing more into the Database Architect role. If you want to become a successful DBA (Database Architect) and be more competitive in the market, you should have a different skill set than was normally required in the past. Now a day, you need to have a wide range of understanding in architectural design, Cloud, network, storage, licensing, versioning, automation, and much more – the more knowledge you have, the better opportunities you will find.

We know without doubt that our future involves automation and Cloud technologies, so why fight it? Why continue losing our time and energy against it?

Let’s take advantage of it now!

So, what is next?

First, learn to change yourself

If you want to become a successful professional, first you need to educate yourself to be successful! Your future success depends only in your attitude today. You control your career, nobody else!

Becoming a successful DBA is a combination of:

  • Your professional attitude, always think positive and always look for solutions instead to kill yourself in a cup of water.
  • Learn how to research, before do something, investigate, search in the internet, read manuals. You need to show that you know how to do a properly research and look for solutions for your problems yourself.
  • Be innovate, don’t wait for others to do your job, or because the other DBAs don’t care about the business you will do the same. Learn to innovate, learn to become a leader and make everyone follow your example with results. Think Different!
  • Learn to communicate properly; the best way to learn how to communicate effectively is learning to listen first. Listen, then analyse the context expressed and only than communicate an answer in a professional and honest way to your peers. Always treat everyone the same way you would like to be treated.

Albert Einstein said one time:

“If I had one hour to save the world, I would spend fifty-five minutes defining the problem and only five minutes finding the solution”

Second Learn to be Proactive

Why check the problems only when they are critical, or when is too late and the database is down, or the users are screaming?

Being proactive is the best approach to keep your DB healthy and to show your company, or your clients that you really care about them.

Many DBA’s expend most of their time being firefighters only, fixing problems and working on user’s requests all the time. They don’t do any proactive work; this mentality only will cause an overload of work to them, thousands of dollars of overtime, several hours without access to the data to the users, poor performance to the applications, and what is worse of all, several unhappy users thinking that you don’t have the knowledge needed to take care of their data.

Let’s mention a small example, you have the archive log area alert set to fire when it is 95% full, and this happens in the middle of the night, some DBA’s will take seriously the alert and solve the problem quickly, others will wait until the next day to take care of it because they are tired, or sleeping, or they are in a place without internet access at the moment the alert arrived. Will be a lot easier if they set a proactive alert to be fire when 75% or 85%, or even better, take a look in the general health status of the DB before leave their work shift, to try to detect and solve any possible problem before be a real problem and be awake in the middle of the night or during the weekend (Remember how important is your personal and family time). I’ll always recommend to DBA’s to run 2 checklists daily, one in the start of their shift and other before they leave their shift.

I know several DBA’s that complain all the time that they got so many calls when they are on call, but they don’t do anything to solve the root problem, they only expend their time to solve the symptoms.

So, let’s change our mentality, let stop being a firefighter and start to be a real hero!

Third, Educate and prepare yourself for the future

Finally, here are some things you should be concentrate at learning and improving your skills, as per example:

  • How to manage different RDBMS technologies (as per example: MySQL, SQL Server, DB2, etc).
  • How to manage NoSQL technologies (as per example: Cassandra, Druid, HBase, and MongoDB).
  • How to resolve unavailability issues.
  • Execute recovery test from current and old backups and document the process for your company DRP (Disaster and Recovery Plan).
  • Ensure your company RPO and RTO SLAs are being fulfilled by your high availability plan and backup and recovery strategy.
  • Gaining deep knowledge at performance tuning.
  • Learn how your applications work and how they interact with the database and middle layers.
  • Learn on how to review and implement security
  • Keep up with DB trends & technologies.
  • Use new technologies when applicable (as per example Kafka, Microservices, Containers, Virtualization)
  • Know how to perform storage and physical design.
  • Diagnose, troubleshoot and resolve any DB related problems.
  • Ensure that Oracle networking software is configured and running properly.
  • Mentor and train new DBA’s (This allow you to review and learn new things).
  • Learn about XML, Java, Python, PHP, HTML, and Linux, Unix, Windows Scripting.
  • Automate all BAU work or delegate it.
  • Implement Capacity Planning /Hardware Planning
  • Architect, Deploy and Maintain Cloud Environments
  • Improve your SQL and PL/SQL skills and Review SQL and PL/SQL codes in your environment.
  • Control and execute code promotions to production environments
  • Master Cloud technologies (IaaS, DBaaS, PaaS and SaaS)

Like you can easily see, as DBAs we have a lot of things to do and learn about, so stop losing time with BAU, because you have a lot of more important things to do and learn about.

Embrace the future, the Cloud wave, the change, and the evolution. Do not stay in the past anymore, it would only affect yourself and your career in the future!

Francisco’s Bio:

Francisco Munoz Alvarez is an author and popular speaker at many Oracle conferences around the world. He is also the President of CLOUG (Chilean Oracle Users Group), APACOUC (APAC Oracle Users Group Community, which is the umbrella organization for all of APAC), IAOUG (Independent Australia Oracle Users Group) and NZOUG (New Zealand Oracle Users Group. He also worked in the first team to introduce Oracle to South America (Oracle 6 and the beta version of Oracle 7). He was also the first Master Oracle 7 Database Administrator in South America, as well as the first Latin American Oracle professional to be awarded a double ACE (ACE in 2008 and ACE Director in 2009) by Oracle HQ. In 2010, he had the privilege to receive a prestigious Oracle Magazine Editor’s Choice Award as the Oracle Evangelist of the Year–a huge recognition for his outstanding achievements in the Oracle world that includes the creation and organization of the already famous OTN Tours that are the biggest Oracle evangelist events in the world.

Currently, Francisco works for Data Intensity, which is a global leader in data management consulting and services, as the Director of Innovation. He also maintains an Oracle blog (http://www.oraclenz.org) and you can always contact him through this or Twitter (@fcomunoz) regarding any questions about Oracle.

Follow Us On Social Media:

Related:

  • No Related Posts

How does Oracle’s Data Lake Enable Big Data Solutions?

By: Wes Prichard

Senior Director Industry Solution Architecture

When I took wood shop back in eighth grade, my shop teacher taught us to create a design for our project before we started building it. The way we captured the design was in what was called a working drawing. In those days it was neatly hand sketched showing shapes and dimensions from different perspectives and it provided enough information to cut and assemble the wood project.

The big data solutions we work with today are much more complex and built with layers of technology and collections of services, but we still need something like working drawings to see how the pieces fit together.

Solution patterns (sometimes called architecture patterns) are a form of working drawing that help us see the components of a system and where they integrate but without some of the detail that can keep us from seeing the forest for the trees. That detail is still important, but it can be captured in other architecture diagrams.

In this blog I want to introduce some solution patterns for data lakes. (If you want to learn more about what data lakes are, read “What Is a Data Lake?“) Data lakes have many uses and play a key role in providing solutions to many different business problems.

Register for a guided trial to build your own data lake

The solution patterns described here show some of the different ways data lakes are used in combination with other technologies to address some of the most common big data use cases. I’m going to focus on cloud-based solutions using Oracle’s platform (PaaS) cloud services.

These are the patterns:

  • Data Science Lab
  • ETL Offload for Data Warehouse
  • Big Data Advanced Analytics
  • Streaming Analytics

Data Science Lab Solution Pattern

Let’s start with the Data Science Lab. We call it a lab because it’s a place for discovery and experimentation using the tools of data science. Data Science Labs are important for working with new data, for working with existing data in new ways, and for combining data from different sources that are in different formats. The lab is the place to try out machine learning and determine the value in data.

Before describing the pattern, let me provide a few tips on how to interpret the diagrams. Each blue box represents an Oracle cloud service. A smaller box attached under a larger box represents a required supporting service that is usually transparent to the user. Arrows show the direction of data flow but don’t necessarily indicate how the data flow is initiated.

The data science lab contains a data lake and a data visualization platform. The data lake is a combination of object storage plus the Apache Spark™ execution engine and related tools contained in Oracle Big Data Cloud. Oracle Analytics Cloud provides data visualization and other valuable capabilities like data flows for data preparation and blending relational data with data in the data lake. It also uses an instance of the Oracle Database Cloud Service to manage metadata.

The data lake object store can be populated by the data scientist using an Open Stack Swift client or the Oracle Software Appliance. If automated bulk upload of data is required, Oracle has data integration capabilities for any need that is described in other solution patterns. The object storage used by the lab could be dedicated to the lab or it can be shared with other services, depending on your data governance practices.

ETL Offload for Data Warehouse Solution Pattern

Data warehouses are an important tool for enterprises to manage their most important business data as a source for business intelligence. Data warehouses, being built on relational databases, are highly structured. Data therefore must often be transformed into the desired structure before it is loaded into the data warehouse.

This transformation processing in some cases can become a significant load on the data warehouse driving up the cost of operation. Depending on the level of transformation needed, offloading that transformation processing to other platforms can both reduce the operational costs and free up data warehouse resources to focus on its primary role of serving data.

Oracle’s Data Integration Platform Cloud (DIPC) is the primary tool for extracting, loading, and transforming data for the data warehouse. Oracle Database Cloud Service provides required metadata management for DIPC. Using Extract-Load-Transform (E-LT) processing, data transformations are performed where the data resides.

For cases where additional transformation processing is required before loading (Extract-Transform-Load, or ETL), or new data products are going to generated, data can be temporarily staged in object storage and processed in the data lake using Apache Spark™. Additionally, this also provides an opportunity to extend the data warehouse using technology to query the data lake directly, a capability of Oracle Autonomous Data Warehouse Cloud.

Big Data Advanced Analytics Solution Pattern

Advanced analytics is one of the most common use cases for a data lake to operationalize the analysis of data using machine learning, geospatial, and/or graph analytics techniques. Big data advanced analytics extends the Data Science Lab pattern with enterprise grade data integration.

Also, whereas a lab may use a smaller number of processors and storage, the advanced analytics pattern supports a system scaled-up to the demands of the workload.


Oracle Data Integration Platform Cloud provides a remote agent to capture data at the source and deliver it to the data lake either directly to Spark in Oracle Big Data Cloud or to object storage. The processing of data here tends to be more automated through jobs that run periodically.

Results are made available to Oracle Analytics Cloud for visualization and consumption by business users and analysts. Results like machine learning predictions can also be delivered to other business applications to drive innovative services and applications.

Stream Analytics Solution Pattern

The Stream Analytics pattern is a variation of the Big Data Advanced Analytics pattern that is focused on streaming data. Streaming data brings with it additional demands because the data arrives as it is produced and often the objective is to process it just as quickly.

Stream Analytics is used to detect patterns in transactions, like detecting fraud, or to make predictions about customer behavior like propensity to buy or churn. It can be used for geo-fencing to detect when someone or something crosses a geographical boundary.


Business transactions are captured at the source using the Oracle Data Integration Platform Cloud remote agent and published to an Apache Kafka® topic in Oracle Event Hub Cloud Service. The Stream Analytics Continuous Query Language (CQL) engine running on Spark subscribes to the Kafka topic and performs the desired processing like looking for specific events, responding to patterns over time, or other work that requires immediate action.

Other data sources that can be fed directly to Kafka, like public data feeds or mobile application data, can be processed by business-specific Spark jobs. Results like detected events and machine learning predictions are published to other Kafka topics for consumption by downstream applications and business processes.

Conclusion

The four different solution patterns shown here support many different data lake use cases, but what happens if you want a solution that includes capabilities from more than one pattern? You can have it. Patterns can be combined, but the cloud also makes it easy to have multiple Oracle Big Data Cloud instances for different purposes with all accessing data from a common object store.

Now you’ve seen some examples of how Oracle Platform Cloud Services can be combined in different ways to address different classes of business problem. Use these patterns as a starting point for your own solutions. And even though it’s been a few years since eighth grade, I still enjoy woodworking and I always start my projects with a working drawing.

If you’re ready to test these data lake solution patterns, try Oracle Cloud for free with a guided trial, and build your own data lake.

The Documents contained within this site may include statements about Oracle’s product development plans. Many factors can materially affect Oracle’s product development plans and the nature and timing of future product releases. Accordingly, this Information is provided to you solely for information only, is not a commitment to deliver any material, code, or functionality, and SHOULD NOT BE RELIED UPON IN MAKING PURCHASING DECISIONS. The development, release, and timing of any features or functionality described remains at the sole discretion of Oracle. THIS INFORMATION MAY NOT BE INCORPORATED INTO ANY CONTRACTUAL AGREEMENT WITH ORACLE OR ITS SUBSIDIARIES OR AFFILIATES. ORACLE SPECIFICALLY DISCLAIMS ANY LIABILITY WITH RESPECT TO THIS INFORMATION. Refer to the LEGAL NOTICES AND TERMS OF USE (http://www.oracle.com/html/terms.html) for further information.

Related:

  • No Related Posts

4 Data Lake Solution Patterns for Big Data Use Cases

By: Wes Prichard

Senior Director Industry Solution Architecture

When I took wood shop back in eighth grade, my shop teacher taught us to create a design for our project before we started building it. The way we captured the design was in what was called a working drawing. In those days it was neatly hand sketched showing shapes and dimensions from different perspectives and it provided enough information to cut and assemble the wood project.

The big data solutions we work with today are much more complex and built with layers of technology and collections of services, but we still need something like working drawings to see how the pieces fit together.

Solution patterns (sometimes called architecture patterns) are a form of working drawing that help us see the components of a system and where they integrate but without some of the detail that can keep us from seeing the forest for the trees. That detail is still important, but it can be captured in other architecture diagrams.

In this blog I want to introduce some solution patterns for data lakes. (If you want to learn more about what data lakes are, read “What Is a Data Lake?“) Data lakes have many uses and play a key role in providing solutions to many different business problems.

Register for a guided trial to build your own data lake

The solution patterns described here show some of the different ways data lakes are used in combination with other technologies to address some of the most common big data use cases. I’m going to focus on cloud-based solutions using Oracle’s platform (PaaS) cloud services.

These are the patterns:

  • Data Science Lab
  • ETL Offload for Data Warehouse
  • Big Data Advanced Analytics
  • Streaming Analytics

Data Science Lab Solution Pattern

Let’s start with the Data Science Lab use case. We call it a lab because it’s a place for discovery and experimentation using the tools of data science. Data Science Labs are important for working with new data, for working with existing data in new ways, and for combining data from different sources that are in different formats. The lab is the place to try out machine learning and determine the value in data.

Before describing the pattern, let me provide a few tips on how to interpret the diagrams. Each blue box represents an Oracle cloud service. A smaller box attached under a larger box represents a required supporting service that is usually transparent to the user. Arrows show the direction of data flow but don’t necessarily indicate how the data flow is initiated.

The data science lab contains a data lake and a data visualization platform. The data lake is a combination of object storage plus the Apache Spark™ execution engine and related tools contained in Oracle Big Data Cloud. Oracle Analytics Cloud provides data visualization and other valuable capabilities like data flows for data preparation and blending relational data with data in the data lake. It also uses an instance of the Oracle Database Cloud Service to manage metadata.

The data lake object store can be populated by the data scientist using an Open Stack Swift client or the Oracle Software Appliance. If automated bulk upload of data is required, Oracle has data integration capabilities for any need that is described in other solution patterns. The object storage used by the lab could be dedicated to the lab or it can be shared with other services, depending on your data governance practices.

ETL Offload for Data Warehouse Solution Pattern

Data warehouses are an important tool for enterprises to manage their most important business data as a source for business intelligence. Data warehouses, being built on relational databases, are highly structured. Data therefore must often be transformed into the desired structure before it is loaded into the data warehouse.

This transformation processing in some cases can become a significant load on the data warehouse driving up the cost of operation. Depending on the level of transformation needed, offloading that transformation processing to other platforms can both reduce the operational costs and free up data warehouse resources to focus on its primary role of serving data.

Oracle’s Data Integration Platform Cloud (DIPC) is the primary tool for extracting, loading, and transforming data for the data warehouse. Oracle Database Cloud Service provides required metadata management for DIPC. Using Extract-Load-Transform (E-LT) processing, data transformations are performed where the data resides.

For cases where additional transformation processing is required before loading (Extract-Transform-Load, or ETL), or new data products are going to generated, data can be temporarily staged in object storage and processed in the data lake using Apache Spark™. Additionally, this also provides an opportunity to extend the data warehouse using technology to query the data lake directly, a capability of Oracle Autonomous Data Warehouse Cloud.

Big Data Advanced Analytics Solution Pattern

Advanced analytics is one of the most common use cases for a data lake to operationalize the analysis of data using machine learning, geospatial, and/or graph analytics techniques. Big data advanced analytics extends the Data Science Lab pattern with enterprise grade data integration.

Also, whereas a lab may use a smaller number of processors and storage, the advanced analytics pattern supports a system scaled-up to the demands of the workload.


Oracle Data Integration Platform Cloud provides a remote agent to capture data at the source and deliver it to the data lake either directly to Spark in Oracle Big Data Cloud or to object storage. The processing of data here tends to be more automated through jobs that run periodically.

Results are made available to Oracle Analytics Cloud for visualization and consumption by business users and analysts. Results like machine learning predictions can also be delivered to other business applications to drive innovative services and applications.

Stream Analytics Solution Pattern

The Stream Analytics pattern is a variation of the Big Data Advanced Analytics pattern that is focused on streaming data. Streaming data brings with it additional demands because the data arrives as it is produced and often the objective is to process it just as quickly.

Stream Analytics is used to detect patterns in transactions, like detecting fraud, or to make predictions about customer behavior like propensity to buy or churn. It can be used for geo-fencing to detect when someone or something crosses a geographical boundary.


Business transactions are captured at the source using the Oracle Data Integration Platform Cloud remote agent and published to an Apache Kafka® topic in Oracle Event Hub Cloud Service. The Stream Analytics Continuous Query Language (CQL) engine running on Spark subscribes to the Kafka topic and performs the desired processing like looking for specific events, responding to patterns over time, or other work that requires immediate action.

Other data sources that can be fed directly to Kafka, like public data feeds or mobile application data, can be processed by business-specific Spark jobs. Results like detected events and machine learning predictions are published to other Kafka topics for consumption by downstream applications and business processes.

Conclusion

The four different solution patterns shown here support many different data lake use cases, but what happens if you want a solution that includes capabilities from more than one pattern? You can have it. Patterns can be combined, but the cloud also makes it easy to have multiple Oracle Big Data Cloud instances for different purposes with all accessing data from a common object store.

Now you’ve seen some examples of how Oracle Platform Cloud Services can be combined in different ways to address different classes of business problem. Use these patterns as a starting point for your own solutions. And even though it’s been a few years since eighth grade, I still enjoy woodworking and I always start my projects with a working drawing.

If you’re ready to test these data lake solution patterns, try Oracle Cloud for free with a guided trial, and build your own data lake.

The Documents contained within this site may include statements about Oracle’s product development plans. Many factors can materially affect Oracle’s product development plans and the nature and timing of future product releases. Accordingly, this Information is provided to you solely for information only, is not a commitment to deliver any material, code, or functionality, and SHOULD NOT BE RELIED UPON IN MAKING PURCHASING DECISIONS. The development, release, and timing of any features or functionality described remains at the sole discretion of Oracle. THIS INFORMATION MAY NOT BE INCORPORATED INTO ANY CONTRACTUAL AGREEMENT WITH ORACLE OR ITS SUBSIDIARIES OR AFFILIATES. ORACLE SPECIFICALLY DISCLAIMS ANY LIABILITY WITH RESPECT TO THIS INFORMATION. Refer to the LEGAL NOTICES AND TERMS OF USE (http://www.oracle.com/html/terms.html) for further information.

Related:

  • No Related Posts

Oracle Exadata Cloud Service Certified for SAP Applications

By: Ilona Gabinsky

Principal Product Marketing Manager

Today we have guest blogger – Bertrand Matthelie – Senior Principal Product Marketing Director

In an earlier blog, I described how moving SAP applications & Oracle databases to Oracle Cloud enables customers to preserve existing investments while accelerating innovation, relying on the only cloud architected for enterprise workloads and optimized for Oracle Database.

Further enhancing the unique value of Oracle Cloud, SAP Applications based on NetWeaver 7.x have now been certified on Oracle Exadata Cloud Service.

Oracle Exadata is the best-performing, most available, and most secure architecture for running Oracle Database. Oracle Exadata Cloud Service enables you to:

  • Get the full performance of Oracle Exadata in the cloud
  • Combine the benefits of on-premises and cloud
  • Increase business agility and operational flexibility with zero CapEx
  • Scale-out quickly and easily

The Oracle Exadata Database Machine has proven to be a very popular solution to power SAP deployments. Customers running SAP applications with Exadata on-premises can easily move their SAP workloads to Exadata Cloud Service and benefit from Oracle’s BYOL to PaaS program.

All features & options of Oracle Database 12c Release 1 (12.1.0.2) and Release 2 (12.2.0.1), including Real Application Clusters (RAC), Automatic Storage Management (ASM) and Oracle Database In-Memory supported for on-premises deployments of SAP NetWeaver are supported and certified for Exadata Cloud Service.

For more information, read SAP Note 2614028 “SAP NetWeaver Application Server ABAP/Java on Oracle Database Exadata Cloud Service” and our White Paper “SAP NetWeaver Application Server ABAP/Java on Oracle Database Exadata Cloud Service”.

Let us know if you have any question or comment.

Follow Us On Social Media:

Related:

  • No Related Posts

Integrating Autonomous Data Warehouse and Big Data Using Object Storage

By: Peter Jeffcock

Big Data Product Marketing

While you can run your business on the data stored in Oracle Autonomous Data Warehouse, there’s lots of other data out there which is potentially valuable. Using Oracle Big Data Cloud, it’s possible to store and process that data, making it ready to be loaded into or queried by the Autonomous Data Warehouse. The point of integration for these two services is object storage which I will explore below. Of course, you need more than this for a complete big data solution. If that’s what you’re looking for, you should read about data lake solution patterns.

Sign up for a free trial to build and populate a data lake in the cloud

Use Cases for the Data Lake and Data Warehouse

Autonomous Data Warehouse and Big Data

Almost all big data use cases involve data that resides in both a data lake and data warehouse. With predictive maintenance, for example, we would want to combine sensor data (stored in the data lake) with official maintenance and purchase records (stored in the data warehouse).

When trying to determine the next best action for a given customer, we would want to work with both customer purchase records (in the data warehouse) and customer web browsing or social media usage (details of which would most likely be stored in the data lake). In use cases from manufacturing to healthcare, having a complete view of all available data means working with data in both the data warehouse and the data lake.

The Data Lake and Data Warehouse for Predictive Maintenance

Take predictive maintenance as an example. Official maintenance records and purchase or warranty information are all important to the business. It may be needed for regulators to check that proper processes are being followed or for purchasing departments to manage budgets or order new components.

On the other hand, sensor information from machines, weather stations, thermometers, seismometers, and similar devices all produce data that is potentially useful to help understand and predict the behavior of some piece of equipment. If you asked your data warehouse administrator to store many terabytes of this raw, less well-understood, multi-structured data, they would not be very enthusiastic. This kind of data is much better suited for a data lake, where it can be transformed or used as the input for machine learning algorithms. But ultimately, you want to combine both data sets to predict failures or a component moving out of tolerance.

Examples: How Object Storage Works with the Data Warehouse

We talked previously about how object storage is the foundation for a modern data lake. But it’s much more than that. Object storage is used, amongst other things, for backup and archive, to stage data for a data warehouse, or to offload data that is no longer stored there. And these use cases require that the data warehouse can also work easily with object storage, including data in the data lake.

Let’s go back to that predictive maintenance use case. After being loaded into the data lake (in object storage) the sensor data can be processed in a Spark cluster spun up by Oracle Big Data Cloud. “Processing” in this context could be anything from a simple filter or aggregation of results to a running a complex machine learning algorithm to uncover hidden patterns.

Once that work is done, a table of results will be written back to object storage. At that point, it could be loaded into the Autonomous Data Warehouse or queried in place. Which approach is best? Depends on the use case. In general, if that data is accessed more frequently, or performance of the query is more important, then loading into the Autonomous Data Warehouse is probably optimal. Here you can think of object storage as another tier in your storage hierarchy (note that Autonomous Data Warehouse already has RAM, flash, and disk as storage tiers).

We can also see a similar approach in an ETL offload use case. Raw data is staged into object storage. Transformation processes then run in one or more Big Data Cloud Spark clusters, with the results written back to object storage. This transformed data is then available to load into Autonomous Data Warehouse.

Autonomous Data Warehouse and Big Data Cloud: Working Together

Don’t think of Oracle Autonomous Data Warehouse and Oracle Big Data Cloud as two totally separate services. They have complementary strengths and can interoperate via object storage. And when they do, it will make it easier to take advantage of all your data, to the benefit of your business as a whole.

If you’re interested in learning more, sign up for an Oracle free trial to build and populate your own data lake. We have tutorials and guides to help you along.

Related:

  • No Related Posts

Autonomous Capabilities Will Make Data Warehouses—and DBAs—More Valuable

By: Ilona Gabinsky

Principal Product Marketing Manager

Today we have guest blogger – Alan Zeichick – principal analyst at Camden Associates

As the old saying goes, you can’t manage what you don’t measure. In a data-driven organization, the best tools for measuring the performance are business intelligence (BI) and analytics engines, which require data. And that explains why data warehouses continue to play such a crucial role in business. Data warehouses often provide the source of that data, by rolling up and summarizing key information from a variety of sources.

Data warehouses, which are themselves relational databases, can be complex to set up and manage on a daily basis, so they typically require significant human involvement from database administrators (DBAs). In a large enterprise, a team of DBAs ensure that the data warehouse is extracting data from those disparate data sources, as well as accommodating new and changed data sources. They’re also making sure the extracted data is summarized properly and stored in a structured manner that can be handled by other applications, including those BI and analytics tools.

On top of that, DBAs are managing the data warehouse’s infrastructure, everything from server processor utilization, the efficiency of storage, security of the data, backups, and more.

However, the labor-intensive nature of data warehouses is about to change, with the advent of Oracle Autonomous Data Warehouse Cloud, announced in October 2017. The self-driving, self-repairing, self-tuning functionality of Oracle’s Data Warehouse Cloud is good for the organization—and good for the DBAs.

No Performance-Tuning Knobs

Data-driven organizations need timely, up-to-date business intelligence, which can feed instant decision-making, short-term predictions and business adjustments, and long-term strategy. If the data warehouse goes down, slows down, or lacks some information feeds, the impact can be significant. No data warehouse may mean no daily operational dashboards and reports, or inaccurate dashboards or reports.

Oracle Autonomous Data Warehouse Cloud is a powerful platform, because the customer doesn’t have to worry about the system itself, explains Penny Avril, vice president of product management for Oracle Databases.

“Customers don’t have to worry about the operational management of the underlying database—provisioning, scaling, patching, backing up, failover, all of that is fully automated,” she says. “Customers also don’t have to worry about performance. There are no performance knobs for the customer: DBAs don’t have to tweak anything themselves.”

For example, one technique used to drive Autonomous Data Warehouse’s performance is by automating the process of creating storage indexes, which Avril describes as the top challenge faced by database administrators. Those indexes allow applications to quickly extract data required to handle routine reports or ad-hoc queries.

“DBAs manually create custom indexes when they manage their own data warehouse. Now, the autonomous data warehouse transparently, and continually, generates indexes automatically based on the queries coming in,” she says. Those automatically created indexes keep the performance high, without any manual tuning or interventional required by DBAs.

  • Related: Join Oracle CEO Mark Hurd for the release of the world’s first autonomous database cloud service. Register now.

The organization also can benefit by the automatic scaling features of Autonomous Data Warehouse. When the business requires more horsepower in the data warehouse to maintain performance during times of high utilization, the customer can add more processing power by adding more CPUs to the cloud service, for which there is an additional cost. However, Avril says, “Customers can scale back down again when the peak demand is over”—eliminating that extra cost until the next time the CPUs are needed.

Customers can even turn off the processing entirely if needed. “When a customer suspends the service, they pay for storage, but not CPU,” she says. “That’s great for developers and test beds. It’s great for ad-hoc analytics for people running queries. When you don’t need a particular data warehouse, you can just suspend it.”

Freedom for the Database Administrator

Performance optimization, self-repairing, self-securing, scalability up and down—those benefits serve the organization. What about the poor DBA? Is he or she out of work? Not at all, says Avril, laughing at the question. “They can finally tackle the task backlog,” adding more value to the business, she says.

Avril explains that DBAs do two types of day-to-day work. “There are generic tasks, common to all databases, including data warehouses. And there are tasks that are specific to the business. With Oracle’s Autonomous Data Warehouse, the generic tasks go away. Configuring, tuning, provisioning, backup, optimization—gone.”

That leaves the good stuff, she explains: “If they aren’t overloaded with generic tasks, DBAs can do business-specific tasks, like data modeling, integrating new data sources, application tuning, and end-to-end service level management.”

For example, DBAs will have to manage how applications connect to the data warehouse—and what happens if things go wrong. “If the database survives a failure through failover, does the application know to failover instantly and transparently? The DBA still needs to manage that,” Avril says.

In addition, data security still must be managed. “Oracle will take care of patching the data warehouse itself, but Oracle doesn’t see the customer’s data,” she says, “DBAs still need to understand where the data lives, what the data represents, and which people and applications should get to see which data.”

No need for a resume writer: DBAs will still have plenty of work to do.

For C-level executives, Autonomous Data Warehouse can improve the value of the data warehouse—and the responsiveness of business intelligence and other important applications—by improving availability and performance. “The value of the business is driven by data, and by the usage of the data,” says Avril. “For many companies, the data is the only real capital they have. Oracle is making it easier for the C-level to manage and use that data. That should help the bottom line.”

For the DBA, Autonomous Data Warehouse means the end of generic tasks that, on their own, don’t add significant value to the business. Stop worrying about uptime. Forget about disk-drive failures. Move beyond performance tuning. DBAs, you have a business to optimize.

Alan Zeichick is principal analyst at Camden Associates, a tech consultancy in Phoenix, Arizona, specializing in software development, enterprise networking, and cybersecurity. Follow him @zeichick.

Follow Us On Social Media:

Related:

  • No Related Posts

Machine Learning Use Case: Real-Time Support for Engineered Systems

When customers have a fully supported engineered system, they know they have a single point of contact for when things go wrong. So when an Oracle banking customer experienced motherboard failures in engineered systems located in business-critical data centers, they knew who to turn to.

Conventional telemetry monitoring for IT systems yielded no clues.

And frustratingly, when these multi-CPU motherboards costing $100,000 each were swapped and brought to service centers for extensive testing, none of the failure triggers could be reproduced.

How Machine Learning Discovered the Problem

We talked to Oracle Architect, Kenny Gross, to find out how they discovered the answer at Oracle Labs. Oracle applied machine-learning prognostics with AI-based pattern recognition to all systems in the data center using a machine-learning pattern recognition algorithm called Multivariate State Estimation Technique (MSET).

Register for a free trial to build a data lake and try machine learning techniques

It almost immediately discovered the root cause of these motherboard trips.

In this case, there was an issue with the third-party bulk power supplies, where a tiny component called a voltage regulator would issue a short sequence of voltage wiggles. These wiggles weren’t large enough to trigger any warning alerts for the power supplies, but they would cause the “downstream” motherboards to experience random (and unreproducible) fatal errors.

System-wide big-datapattern recognition identified the root problem, and the solution was actually very inexpensive. It involved more machine learning to find the small number of power supplies that had the defective voltage regulators.

Machine learning-based global pattern-recognition surveillance would proactively identify the incipience of tiny voltage irregularities in power supplies, and the power supplies were inexpensive and easy to swap with no interruption in service.

This was especially beneficial because of the ability to proactively swap a small number of power supplies exhibiting elevated risk, as opposed to having to swap all power supplies in all the data centers, which is what they would have done without machine learning surveillance.

Use Case Outcome After Applying Machine Learning

After applying machine-learning techniques and MSET, the customer went to “5-nines” availability for their business-critical IT assets.

The bank’s CIO and his staff of IT experts were so impressed by the evidence and solution derived from machine learning, they immediately ordered three more engineered systems and requested to leave the machine learning prognostics on all of the IT systems.

In addition, the evidence from the spurious fault mode was made available to the power-supply manufacturing company, which enabled them to implement a more robust voltage regulator component in their power supplies going forward.

That’s the benefit of an engineered system, where the customer is supported at every step.

What Is MSET?

We talked a little about what MSET can do. But what is it?

MSET is an advanced prognostic pattern recognition method that was originally developed by Argonne National Laboratory for high-sensitivity prognostic fault monitoring applications in commercial nuclear power applications.

It has since been spun off and met with commercial success for prognostic machine learning applications in a broad range of applications, including NASA space shuttles, Lufthansa air fleets, and Disney theme park structural and system safety instrumentation, just to name a few examples.

In the last few years, Oracle has pioneered the use of real-time MSET prognostics for sensitive early detection of anomalies in business-critical enterprise computing servers (called Electronic Prognostics) and software systems (where MSET detects performance anomalies from resource-contention issues and complex memory leaks), storage systems, and networks.

The MSET advantages (versus conventional machine learning approaches such as neural networks and support vector machines) include:

  • Higher prognostic accuracy
  • Lower false-alarm probabilities
  • Lower missed-alarm probabilities
  • Lower overhead compute cost

Much of this is crucial for real-time dense-sensor streaming prognostics. It will also be crucial for the new Internet of Things (IoT), where sensors are ubiquitous, to discriminate between real problems and an inexpensive sensor failure.

How Does MSET Work?

The MSET framework consists of a training phase and a monitoring phase. The training procedure is used to characterize the monitored equipment using historical, error-free operating data covering the envelope of possible operating regimes for the system variables under surveillance. This training procedure evaluates the available training data and selects an optimal subset of the data observations (memory vectors) that are determined to best characterize the monitored asset’s normal operation.

It creates a stored model of the equipment that is used in the monitoring phase to estimate the expected values of the signals under surveillance. In the monitoring step, new incoming observations for all the asset signals are used in conjunction with the trained MSET model to estimate the expected values of the signals.

How Is MSET Used?

We mentioned one use case in this article already. But MSET has been gaining in popularity because its advantages scale to truly big data streaming analytics applications, which is a vital capability for IoT use cases. In MSET’s earlier days, as an example, some of the challenges involved massive-sensor fleets of critical assets. Today, one large (refrigerator-sized) enterprise server has 3400 internal sensors (same as a commercial nuclear reactor). One medium-sized datacenter contains a million sensors.

We’re now finding beneficial spinoff prognostic applications for multiple industries where the numbers of sensors have been growing with IoT digital transformation initiatives. For example, one jumbo jet now contains 75,000 sensors. One transmission grid for a medium sized US utility now comprises over 50,000 critical assets, all of which contain multiple sensors. And one modern oil refinery now contains a million sensors.

Most MSET use cases are realtime. But some customer use cases extract just as much prognostic value out of storing all of the telemetry into a “data historian” file and then running MSET once or twice per day. MSET is flexible and equally valuable for realtime surveillance or batch-mode prognostics from data historian signals.

Benefits of MSET

Some of the advantages of MSET for big data prognostic applications include:

  • Estimates are extremely accurate, with uncertainty bounds that are usually only one to two percent of the standard deviation of the raw sensor input signals
  • An extremely high sensitivity for detecting subtle disturbances in noisy process variables, but with extremely low false positives and false negatives.

Oracle’s MSET-based prognostic innovations help increase component reliability margins and system availability goals. At the same time, these innovations reduce (through improved root-cause analysis) costly sources of “no trouble found” events that have become a significant issue across enterprise computing and other industries.

The benefits for Oracle’s MSET approach for big-data prognostics are the same for applications to other fields, including oil and gas and utilities, where proactive maintenance of business-critical assets is essential. It enables reduced operations and maintenance costs, improved up-time availability of revenue-generating assets, and improves safety margins for life-critical systems.

If you have further questions about MSET and its capabilities, feel free to contact us. And if you’d like to try building a data lake and use machine learning on the data, Oracle offers a free trial.

Related:

  • No Related Posts

Autonomous vs. Automated

By: Ilona Gabinsky

Principal Product Marketing Manager

The invention of the telephone in 1876 made an immense impact on human communication, paving the wave for an explosion of inventions and business opportunities.

In the late 1970s the dawn of the personal computer ushered in an even more intimate connection with technology. The Internet became the ultimate technological disruption in 1991, when it enabled easy communication around the globe with personalized content delivery.

Fast-forward to 2007 when the iPhone let us carry our computer in our pocket and engage with technology anytime, anywhere. Since then, technological disruption has been the norm, not the exception, driven in large part by advances in Artificial Intelligence (AI) and machine learning. In 2008, Tesla made the dream of the electric car real, and just last year Elon Musk showcased the first fully autonomous self-driving car.

When it comes to cars, some might think of autonomous, automated, and self-driving as interchangeable terms, but there’s a critical difference to note. An automated car allows drivers to take control of limited functions—think cruise control—but the driver still must keep overall manual control of the vehicle. Whereas an autonomous car eliminates the need for human interaction so the driver can sit back and enjoy the ride—reducing the stress of commute, and giving them a chance to focus on other important activities. And it’s all thanks to AI that an autonomous car has a level of intelligence and independence that only machine learning can bring.

Machine learning is a subset of artificial intelligence with the sophistication to discover hidden opportunities, accelerate tedious processes, and identify which data insights matter. And the good news is that it’s not limited to cars.

At Oracle OpenWorld last year, Oracle Chairman of the Board and CTO Larry Ellison unveiled his vision for the world’s first autonomous database cloudthat is self-driving, self-securing and self-repairing. “This is the most important thing Oracle has done in a long, long time. The same way self-driving cars open a new world of possibilities in our life, self-driving database will bring the world of technology to a different level – reduced risk & cost, unprecedented availability, performance, security, flexibility.”

Building on the next generation of the industry-leading Oracle Database 18c, the Oracle Autonomous Database Cloud uses ground-breaking machine learning to eliminate human labor, human error, and manual tuning. The result? Unprecedented availability, high performance, and security, all for a much lower cost.

The Oracle Autonomous Database is “self-driving,” meaning that it autonomously upgrades and patches itself while running. No human intervention required.

The Oracle Autonomous Database, like the technological innovation that proceeded it, didn’t just happen overnight. Oracle has been developing sophisticated database automation for decades, and invested thousands of engineer years automating key database functions, as this journey map shows.

Oracle Database 18c is the latest generation of the world’s most popular database is now available. It provides businesses of all sizes with access to the worlds fastest, most scalable, and reliable technology.

Oracle’s self-driving database disrupts the world of database management in the same way self-driving changes the way we commute and revolutionizing the transportation industry. What does it mean for Oracle customers when the World’s #1 Database becomes the World’s 1st Autonomous Database? The payoff is huge. Because no human intervention is needed, the Oracle Autonomous Database eliminates mundane management tasks, reduces labor, costs, and errors, all while increasing security and availability.

Game changing technologies of the past such as the phone, computer and the internet fundamentally changed lives and set a course to open new opportunities, innovations, allowing us to dream big. The Oracle Autonomous Database Cloud is revolutionizing how data is managed, enabling faster, easier data access, helping to unlock the potential of your data so your business can benefit – and dream big.

Read more about the Oracle Autonomous Database Cloud.

Related:

  • No Related Posts