Accelerating SQL Queries that Span Hadoop and Oracle Database

By: Peter Jeffcock

Big Data Product Marketing

It’s hard to deliver “one fast, secure SQL query on all your data”. If you look around you’ll find lots of “SQL on Hadoop” implementations which are unaware of data that’s not on Hadoop. And then you’ll see other solutions that combine the results of two different SQL queries, written in two different dialects, and run mostly independently on two different platforms. That means that while they may work, the person writing the SQL is effectively responsible for optimizing that joint query and implementing the different parts in those two different dialects. Even if you get the different parts right, the end result is more I/O, more data movement and lower performance.

Big Data SQL is different in several ways. (Start with this blog to get the details). From the viewpoint of the user you get one single query, in a modern, fully functional dialect of SQL. The data can be located in multiple places (Hadoop, NoSQL databases and Oracle Database) and software, not a human, does all the planning and optimization to accelerate performance.

Under the covers, one of the key things it tries to do is minimize I/O and minimize data movement so that queries run faster. It does that by trying to push down as much processing as possible to where the data is located. Big Data SQL 3.0 completes that task: now all the processing that can be pushed down, is pushed down. I’ll give an example in the next post.

What this means is cross-platform queries that are as easy to write, and as highly performant, as a query written just for one platform. Big Data SQL 3.0 further improves the “fast” part of “one fast, secure SQL query on all your data”. We’d encourage you to test it against anything else out there, whether it’s a true cross-platform solution or even something that just runs on one platform.

Related:

Delegation and (Data) Management

By: Peter Jeffcock

Big Data Product Marketing

Every business book you read talks about delegation. It’s a core requirement for successful managers: surround yourself with good people, delegate authority and responsibility to them, and get out of their way. It turns out that this is a guiding principle for Big Data SQL as well. I’ll show you how. And without resorting to code. (If you want code examples, start here).

Imagine a not uncommon situation where you have customer data about payments and billing in your data warehouse, while data derived from log files about customer access to your online platform is stored in Hadoop. Perhaps you’d like to see if customers who access their accounts online are any better at paying up when their bills come due. To do this, you might want to start by determining who is behind on payments, but has accessed their account online in the last month. This means you need to query both your data warehouse and Hadoop together.

Big Data SQL uses enhanced Oracle external tables for accessing data in other platforms like Hadoop. So your cross-platform query looks like a query on two tables in Oracle Database. This is important, because it means from the viewpoint of the user (or application) generating the SQL, there’s no practical difference between data in Oracle Database, and data in Hadoop.

But under the covers there are differences, because some of the data is on a remote platform. How you process that data to minimize both data movement and I/O is key to maximizing performance.

Big Data SQL delegates work to Smart Scan software that runs on Hadoop (derived from Exadata’s Smart Scan software). Smart Scan on Hadoop does its own local scan, returning only the rows and columns that are required to complete that query, thus reducing data movement, potentially quite dramatically. And using storage indexing, we can avoid some unnecessary I/O as well. For example, if we’ve indexed a data block and know that the minimum value of “days since accessed accounts online” is 34, then we know that none of the customers in that block has actually accessed their accounts in the last month (30 days). So this kind of optimization reduces I/O. Together, these two techniques increase performance.

Big Data SQL 3.0 goes one step further, because there’s another opportunity for delegation. Projects like ORC or Parquet, for example, are efficient columnar data stores on Hadoop. So if your data is there, Big Data SQL’s Smart Scan can delegate work to them, further increasing performance. This is the kind of optimization that the fastest SQL on Hadoop implementations do. Which is why we think that with Big Data SQL you can get performance that’s comparable to anything else that’s out there.

But remember, with Big Data SQL you can also use the SQL skills you already have (no need to learn a new dialect), your applications can access data in Hadoop and NoSQL using the same SQL they already use (don’t have to rewrite applications), and the security policies in Oracle Database can be applied to data in Hadoop and NoSQL (don’t have to write code to implement a different security policy). Hence the tagline: One Fast, Secure SQL Query on All Your Data.

Related: