Internet of Things and Big Data – Better Together

By: Peter Jeffcock

Big Data Product Marketing

What’s the difference between the Internet of Things and Big Data? That’s not really the best question to ask, because these two are much more alike than they are different. And they complement each other very strongly which is one reason we’ve written a white paper on the convergence.

Big data is all about enabling organizations to use more of the data around them: things customers write in social media; log files from applications and processes; sensor and device data. And there’s IoT! One way to think of it is as one of the sources for big data.

But IoT is more than that. It’s about collecting all that data, analyzing it in real time for events or patterns of interest, and making sure to integrate any new insight into the rest of your business. With you add the rest of big data to IoT, there’s much more data to work with and powerful big data analytics to come up with additional insights.

Best to look at an example. Using IoT you can track and monitor assets like trucks, engines, HVAC systems, and pumps. You can correct problems as you detect them. With big data, you can analyze all the information you have about failures and start to uncover the root causes. Combine the two and now you can not just react to problems as they occur. You can predict them, and fix them before they occur. Go from being reactive to being proactive.

Check out this infographic. The last data point, down at the bottom right hand side may be the most important one. Only 8% of businesses are fully capturing and analyzing IoT data in a timely fashion.

Nobody likes to arrive last to a party and find the food and drink all gone. This party’s just getting started. You should be asking every vendor you deal with how they can help you take advantage of IoT and big data – they really are better together, and there’s lots of opportunity. Next post will highlight 3 customers who are taking advantage of that opportunity.

Related:

DIY Hadoop: Proceed At Your Own Risk

Could your security and performance be in jeopardy?

Nearly half (3.2 billion, or 45%) of the seven billion people in the world used the Internet in 2015, according to a BBC news report. If you think all those people generate a huge amount of data (in the form of website visits, clicks, likes, tweets, photos, online transactions, and blog posts), wait for the data explosion that will happen when the Internet of Things (IoT) meets the Internet of People. Gartner, Inc. forecast that there will be twice as many–6.4 billion–Internet-connected gadgets (everything from light bulbs to baby diapers to connected cars) in use worldwide in 2016, up 30 percent from 2015, and will reach over 20 billion by 2020.

Companies of all sizes and in virtually every industry are struggling to manage the exploding amounts of data. To cope with the problem, many organizations are turning to solutions based on Apache Hadoop, the popular open-source software framework for storing and processing massive datasets. But purchasing, deploying, configuring, and fine-tuning a do-it-yourself (DIY) Hadoop cluster to work with your existing infrastructure can be much more challenging than many organizations expect, even if your company has the specialized skills needed to tackle the job.

But as both business and IT executives know all too well, managing big data involves far more than just dealing with storage and retrieval challenges—it requires addressing a variety of privacy and security issues as well. Beyond the brand damage that companies like Sony and Target have experienced in the last few years from data breaches, there’s also the likelihood that companies that fail to secure the life cycle of their big data environments will face regulatory consequences. Early last year, the Federal Trade Commission released a report on the Internet of Things that contains guidelines to promote consumer privacy and security. The Federal Trade Commission’s document, Careful Connections: Building Security in the Internet of Things, encourages companies to implement a risk-based approach and take advantage of best practices developed by security experts, such as using strong encryption and proper authentication.

While not calling for new legislation (due to the speed of innovation in the IoT space), the FTC report states that businesses and law enforcers have a shared interest in ensuring that consumers’ expectations about the security of IoT products are met. The report recommends several “time-tested” security best practices for companies processing IoT data, such as:

  • Implementing “security by design” by building security into your products and services at the outset of your planning process, rather than grafting it on as an afterthought.
  • Implementing a defense-in-depth approach that incorporates security measures at several levels.

Business and IT executives who try to follow the FTC’s big data security recommendations are likely to run into roadblocks, especially if you’re trying to integrate Hadoop with your existing IT infrastructure. The main problem with Hadoop is that is it wasn’t originally built with security in mind; it was developed solely to address massive distributed data storage and fast processing, which leads to the following threats:

  • DIY Hadoop. A do-it-yourself Hadoop cluster presents inherent risks, especially since many times it’s developed without adequate security by a small group of people in a laboratory-type setting, closed off from a production environment. As a cluster grows from small project to advanced enterprise Hadoop, every period of growth—patching, tuning, verifying versions between Hadoop modules, OS libraries, utilities, user management, and so forth—becomes more difficult and time-consuming.
  • Unauthorized access. Built under the principle of “data democratization”—so that all data is accessible by all users of the cluster— Hadoop has had challenges complying with certain compliance standards, such as the Health Insurance Portability and Accountability Act (HIPAA) and the Payment Card Industry Data Security Standard (PCI DSS). That’s due to the lack of access controls on data, including password controls, file and database authorization, and auditing.
  • Data provenance. With open source Hadoop, it has been difficult to determine where a particular dataset originated and what data sources it was derived from. Which means you can end up basing critical business decisions on analytics taken from suspect or compromised data.

2X Faster Performance than DIY Hadoop

In his keynote at last year’s Oracle OpenWorld 2015, Intel CEO Brian Krzanich described work Intel has been doing with Oracle to build high performing datacenters using the pre-built Oracle Big Data Appliance, an integrated, optimized solution powered by the Intel Xeon processor family. Specifically, he referred to recent benchmark testing by Intel engineers that showed an Oracle Big Data Appliance solution with some basic tuning achieved nearly two times better performance than a comparable DIY cluster built on comparable hardware.

Not only is it faster, but it was designed to meet the security needs of the enterprise. Oracle Big Data Appliance automates the steps required to deploy a secure cluster – including complex tasks like setting up authentication, data authorization, encryption, and auditing. This dramatically reduces the amount of time required to both set up and maintain a secure infrastructure.

Do-it-yourself (DIY) Apache Hadoop clusters are appealing to many business and IT executives because of the apparent cost savings from using commodity hardware and free software distributions. As I’ve shown, despite the initial savings, DIY Hadoop clusters are not always a good option for organizations looking to get up to speed on an enterprise big data solution, both from a security and performance standpoint.

Find out how your company can move to an enterprise Big Data architecture with Oracle’s Big Data Platform at https://www.oracle.com/big-data.

Related:

The Surprising Economics of Engineered Systems

By: Peter Jeffcock

Big Data Product Marketing

The title’s not mine. It comes from a video done for us by ESG, based on their white paper, which looks at the TCO of building your own Hadoop cluster vs buying one ready-built (Oracle Big Data Appliance). You should watch or read, depending on your preference, or even just check out the infographic. The conclusion could be summed up as “better, faster, cheaper, pick all three”. Which is not what you’d expect. But they found that it’s better (quicker to deploy, lower risk, easier to support), faster (from 2X to 3X faster than a comparable DIY cluster) and cheaper (45% cheaper if you go with list pricing).

So while you may not think that an engineered system like the Big Data Appliance is the right system for you, it should always be on your shortlist. Compare it with building your own – you’ll probably be pleasantly surprised.

There’s a lot more background in the paper in particular, but let me highlight a few things:

– We have seen some instances where other vendors offer huge discounts and actually beat the BDA price. If you see this, check two things. First, will that discount be available for all future purchases or is this just a one-off discount. And second, remember to include the cost that you incur to setup, manage, maintain and patch the system.

– Consider performance. We worked with Intel to tune Hadoop for this specific configuration. There are something like 500 different parameters on Hadoop that can impact performance one way or the other. That tuning project was a multi-week exercise with several different experts. The end result was performance of nearly 2X, sometimes up to 3X faster than a comparable, untuned DIY cluster. Do you have the resources and expertise to replicate this effort? Would a doubling of performance be useful to you?

– Finally, consider support. A Hadoop cluster is a complex system. Sometimes problems arise that result from the interaction of multiple components. It can be really hard to figure those out, particularly when multiple vendors are involved for different pieces. When no single component is “at fault” it’s hard to find somebody to fix the overall system. You’d never buy a computer with 4 separate support contracts for operating system, CPU, disk and network card – you’d want one contract for the entire system. The same can be true for your Hadoop clusters as well.

Related:

Predictions for Big Data Security in 2016

Leading into 2016, Oracle made ten big data predictions, and one in particular around security. We are nearly four months into the year and we’ve seen these predictions coming to light.

Increase in regulatory protections of personal information

Early February saw the creation of the Federal Privacy Council, “which will bring together the privacy officials from across the Government to help ensure the implementation of more strategic and comprehensive Federal privacy guidelines. Like cyber security, privacy must be effectively and continuously addressed as our nation embraces new technologies, promotes innovation, reaps the benefits of big data and defends against evolving threats.”

The European Union General Data Protection Regulation is a reform of EU’s 1995 data protection rules (Directive 95/46/EC). Their Big Data fact sheet was put forth to help promote the new regulations. “A plethora of market surveys and studies show that the success of providers to develop new services and products using big data is linked to their capacity to build and maintain consumer trust.” As a timeline, the EU expects adoption in Spring 2016 and enforcement will begin two years later in Spring 2018.

Earlier this month, the Federal Communications Commission announced a proposal to restrict Internet providers’ ability to share the information they collect about what their customers do online with advertisers and other third parties.

Increase use of classification systems that categorize data into groups with pre-defined policies for access, redaction and masking.

Infosecurity Magazine article highlights the challenge of data growth and the requirement for classification: “As storage costs dropped, the attention previously shown towards deleting old or unnecessary data has faded. However, unstructured data now makes up 80% of non-tangible assets, and data growth is exploding. IT security teams are now tasked with protecting everything forever, but there is simply too much to protect effectively – especially when some of it is not worth protecting at all.”

The three benefits of classification highlighted include the ability to raise security awareness, prevent data loss, and address records management regulations. All of these are legitimate benefits of data classification that organizations should consider. Case in point, Oracle customer Union Investment increased agility and security by automatically processing investment fund data within their proprietary application, including complex asset classification with up to 500 data fields, which were previously distributed to IT staff using spreadsheets.

Continuous cyber-threats will prompt companies to both tighten security, as well as audit access and use of data.

This is sort of a no-brainer. We know more breaches are coming, such as here, here and here. And we know companies increase security spending after they experience a data breach or witness one close to home. Most organizations now know that completely eliminating the possibility of a data breach is impossible, and therefore, appropriate detective capabilities are more important than ever. We must act as if the bad guys are on our network and then detect their presence and respond accordingly.

See the rest of the Enterprise Big Data Predictions, 2016.

Image Source: http://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/

Related:

  • No Related Posts

Accelerating SQL Queries that Span Hadoop and Oracle Database

By: Peter Jeffcock

Big Data Product Marketing

It’s hard to deliver “one fast, secure SQL query on all your data”. If you look around you’ll find lots of “SQL on Hadoop” implementations which are unaware of data that’s not on Hadoop. And then you’ll see other solutions that combine the results of two different SQL queries, written in two different dialects, and run mostly independently on two different platforms. That means that while they may work, the person writing the SQL is effectively responsible for optimizing that joint query and implementing the different parts in those two different dialects. Even if you get the different parts right, the end result is more I/O, more data movement and lower performance.

Big Data SQL is different in several ways. (Start with this blog to get the details). From the viewpoint of the user you get one single query, in a modern, fully functional dialect of SQL. The data can be located in multiple places (Hadoop, NoSQL databases and Oracle Database) and software, not a human, does all the planning and optimization to accelerate performance.

Under the covers, one of the key things it tries to do is minimize I/O and minimize data movement so that queries run faster. It does that by trying to push down as much processing as possible to where the data is located. Big Data SQL 3.0 completes that task: now all the processing that can be pushed down, is pushed down. I’ll give an example in the next post.

What this means is cross-platform queries that are as easy to write, and as highly performant, as a query written just for one platform. Big Data SQL 3.0 further improves the “fast” part of “one fast, secure SQL query on all your data”. We’d encourage you to test it against anything else out there, whether it’s a true cross-platform solution or even something that just runs on one platform.

Related:

Delegation and (Data) Management

By: Peter Jeffcock

Big Data Product Marketing

Every business book you read talks about delegation. It’s a core requirement for successful managers: surround yourself with good people, delegate authority and responsibility to them, and get out of their way. It turns out that this is a guiding principle for Big Data SQL as well. I’ll show you how. And without resorting to code. (If you want code examples, start here).

Imagine a not uncommon situation where you have customer data about payments and billing in your data warehouse, while data derived from log files about customer access to your online platform is stored in Hadoop. Perhaps you’d like to see if customers who access their accounts online are any better at paying up when their bills come due. To do this, you might want to start by determining who is behind on payments, but has accessed their account online in the last month. This means you need to query both your data warehouse and Hadoop together.

Big Data SQL uses enhanced Oracle external tables for accessing data in other platforms like Hadoop. So your cross-platform query looks like a query on two tables in Oracle Database. This is important, because it means from the viewpoint of the user (or application) generating the SQL, there’s no practical difference between data in Oracle Database, and data in Hadoop.

But under the covers there are differences, because some of the data is on a remote platform. How you process that data to minimize both data movement and I/O is key to maximizing performance.

Big Data SQL delegates work to Smart Scan software that runs on Hadoop (derived from Exadata’s Smart Scan software). Smart Scan on Hadoop does its own local scan, returning only the rows and columns that are required to complete that query, thus reducing data movement, potentially quite dramatically. And using storage indexing, we can avoid some unnecessary I/O as well. For example, if we’ve indexed a data block and know that the minimum value of “days since accessed accounts online” is 34, then we know that none of the customers in that block has actually accessed their accounts in the last month (30 days). So this kind of optimization reduces I/O. Together, these two techniques increase performance.

Big Data SQL 3.0 goes one step further, because there’s another opportunity for delegation. Projects like ORC or Parquet, for example, are efficient columnar data stores on Hadoop. So if your data is there, Big Data SQL’s Smart Scan can delegate work to them, further increasing performance. This is the kind of optimization that the fastest SQL on Hadoop implementations do. Which is why we think that with Big Data SQL you can get performance that’s comparable to anything else that’s out there.

But remember, with Big Data SQL you can also use the SQL skills you already have (no need to learn a new dialect), your applications can access data in Hadoop and NoSQL using the same SQL they already use (don’t have to rewrite applications), and the security policies in Oracle Database can be applied to data in Hadoop and NoSQL (don’t have to write code to implement a different security policy). Hence the tagline: One Fast, Secure SQL Query on All Your Data.

Related: