information from a UNIX syslog file in the Bluemix Time Series
Database. I also show you how to use queries on that database to create a
dashboard to present the information from that file graphically.
The title’s not mine. It comes from a video done for us by ESG, based on their white paper, which looks at the TCO of building your own Hadoop cluster vs buying one ready-built (Oracle Big Data Appliance). You should watch or read, depending on your preference, or even just check out the infographic. The conclusion could be summed up as “better, faster, cheaper, pick all three”. Which is not what you’d expect. But they found that it’s better (quicker to deploy, lower risk, easier to support), faster (from 2X to 3X faster than a comparable DIY cluster) and cheaper (45% cheaper if you go with list pricing).
So while you may not think that an engineered system like the Big Data Appliance is the right system for you, it should always be on your shortlist. Compare it with building your own – you’ll probably be pleasantly surprised.
There’s a lot more background in the paper in particular, but let me highlight a few things:
– We have seen some instances where other vendors offer huge discounts and actually beat the BDA price. If you see this, check two things. First, will that discount be available for all future purchases or is this just a one-off discount. And second, remember to include the cost that you incur to setup, manage, maintain and patch the system.
– Consider performance. We worked with Intel to tune Hadoop for this specific configuration. There are something like 500 different parameters on Hadoop that can impact performance one way or the other. That tuning project was a multi-week exercise with several different experts. The end result was performance of nearly 2X, sometimes up to 3X faster than a comparable, untuned DIY cluster. Do you have the resources and expertise to replicate this effort? Would a doubling of performance be useful to you?
– Finally, consider support. A Hadoop cluster is a complex system. Sometimes problems arise that result from the interaction of multiple components. It can be really hard to figure those out, particularly when multiple vendors are involved for different pieces. When no single component is “at fault” it’s hard to find somebody to fix the overall system. You’d never buy a computer with 4 separate support contracts for operating system, CPU, disk and network card – you’d want one contract for the entire system. The same can be true for your Hadoop clusters as well.
Leading into 2016, Oracle made ten big data predictions, and one in particular around security. We are nearly four months into the year and we’ve seen these predictions coming to light.
Early February saw the creation of the Federal Privacy Council, “which will bring together the privacy officials from across the Government to help ensure the implementation of more strategic and comprehensive Federal privacy guidelines. Like cyber security, privacy must be effectively and continuously addressed as our nation embraces new technologies, promotes innovation, reaps the benefits of big data and defends against evolving threats.”
The European Union General Data Protection Regulation is a reform of EU’s 1995 data protection rules (Directive 95/46/EC). Their Big Data fact sheet was put forth to help promote the new regulations. “A plethora of market surveys and studies show that the success of providers to develop new services and products using big data is linked to their capacity to build and maintain consumer trust.” As a timeline, the EU expects adoption in Spring 2016 and enforcement will begin two years later in Spring 2018.
Earlier this month, the Federal Communications Commission announced a proposal to restrict Internet providers’ ability to share the information they collect about what their customers do online with advertisers and other third parties.
Infosecurity Magazine article highlights the challenge of data growth and the requirement for classification: “As storage costs dropped, the attention previously shown towards deleting old or unnecessary data has faded. However, unstructured data now makes up 80% of non-tangible assets, and data growth is exploding. IT security teams are now tasked with protecting everything forever, but there is simply too much to protect effectively – especially when some of it is not worth protecting at all.”
The three benefits of classification highlighted include the ability to raise security awareness, prevent data loss, and address records management regulations. All of these are legitimate benefits of data classification that organizations should consider. Case in point, Oracle customer Union Investment increased agility and security by automatically processing investment fund data within their proprietary application, including complex asset classification with up to 500 data fields, which were previously distributed to IT staff using spreadsheets.
This is sort of a no-brainer. We know more breaches are coming, such as here, here and here. And we know companies increase security spending after they experience a data breach or witness one close to home. Most organizations now know that completely eliminating the possibility of a data breach is impossible, and therefore, appropriate detective capabilities are more important than ever. We must act as if the bad guys are on our network and then detect their presence and respond accordingly.
See the rest of the Enterprise Big Data Predictions, 2016.
It’s hard to deliver “one fast, secure SQL query on all your data”. If you look around you’ll find lots of “SQL on Hadoop” implementations which are unaware of data that’s not on Hadoop. And then you’ll see other solutions that combine the results of two different SQL queries, written in two different dialects, and run mostly independently on two different platforms. That means that while they may work, the person writing the SQL is effectively responsible for optimizing that joint query and implementing the different parts in those two different dialects. Even if you get the different parts right, the end result is more I/O, more data movement and lower performance.
Big Data SQL is different in several ways. (Start with this blog to get the details). From the viewpoint of the user you get one single query, in a modern, fully functional dialect of SQL. The data can be located in multiple places (Hadoop, NoSQL databases and Oracle Database) and software, not a human, does all the planning and optimization to accelerate performance.
Under the covers, one of the key things it tries to do is minimize I/O and minimize data movement so that queries run faster. It does that by trying to push down as much processing as possible to where the data is located. Big Data SQL 3.0 completes that task: now all the processing that can be pushed down, is pushed down. I’ll give an example in the next post.
What this means is cross-platform queries that are as easy to write, and as highly performant, as a query written just for one platform. Big Data SQL 3.0 further improves the “fast” part of “one fast, secure SQL query on all your data”. We’d encourage you to test it against anything else out there, whether it’s a true cross-platform solution or even something that just runs on one platform.
Every business book you read talks about delegation. It’s a core requirement for successful managers: surround yourself with good people, delegate authority and responsibility to them, and get out of their way. It turns out that this is a guiding principle for Big Data SQL as well. I’ll show you how. And without resorting to code. (If you want code examples, start here).
Imagine a not uncommon situation where you have customer data about payments and billing in your data warehouse, while data derived from log files about customer access to your online platform is stored in Hadoop. Perhaps you’d like to see if customers who access their accounts online are any better at paying up when their bills come due. To do this, you might want to start by determining who is behind on payments, but has accessed their account online in the last month. This means you need to query both your data warehouse and Hadoop together.
Big Data SQL uses enhanced Oracle external tables for accessing data in other platforms like Hadoop. So your cross-platform query looks like a query on two tables in Oracle Database. This is important, because it means from the viewpoint of the user (or application) generating the SQL, there’s no practical difference between data in Oracle Database, and data in Hadoop.
But under the covers there are differences, because some of the data is on a remote platform. How you process that data to minimize both data movement and I/O is key to maximizing performance.
Big Data SQL delegates work to Smart Scan software that runs on Hadoop (derived from Exadata’s Smart Scan software). Smart Scan on Hadoop does its own local scan, returning only the rows and columns that are required to complete that query, thus reducing data movement, potentially quite dramatically. And using storage indexing, we can avoid some unnecessary I/O as well. For example, if we’ve indexed a data block and know that the minimum value of “days since accessed accounts online” is 34, then we know that none of the customers in that block has actually accessed their accounts in the last month (30 days). So this kind of optimization reduces I/O. Together, these two techniques increase performance.
Big Data SQL 3.0 goes one step further, because there’s another opportunity for delegation. Projects like ORC or Parquet, for example, are efficient columnar data stores on Hadoop. So if your data is there, Big Data SQL’s Smart Scan can delegate work to them, further increasing performance. This is the kind of optimization that the fastest SQL on Hadoop implementations do. Which is why we think that with Big Data SQL you can get performance that’s comparable to anything else that’s out there.
But remember, with Big Data SQL you can also use the SQL skills you already have (no need to learn a new dialect), your applications can access data in Hadoop and NoSQL using the same SQL they already use (don’t have to rewrite applications), and the security policies in Oracle Database can be applied to data in Hadoop and NoSQL (don’t have to write code to implement a different security policy). Hence the tagline: One Fast, Secure SQL Query on All Your Data.