How Graph Analytics Works: Six Degrees of Kevin Bacon

From a technical perspective, the term “graph analytics” means using a graph format to perform analysis of relationships between data based on strength and direction. That might be a bit hard to understand for the uninitiated, particularly when the traditional idea of data analysis brings up images of poring over spreadsheets—very big spreadsheets, when you’re looking at big data with petabytes or exabytes of data.

So let’s break down that definition piece by piece to offer some clarity. Assuming you know the general idea behind analytics, what is the difference when we add the word “graph” to it? Consider the general statement above:

“Using a graph format to perform analysis of relationships between data based on strength and direction.”

Drilling that statement down into individual pieces, we can look at segments to gain a greater understanding of the definition.

“Using a graph format”: The technical definition of a graph is the relationship between nodes (aka vertices or points) and edges (aka links or lines).

“Analysis of relationships”: Graph analytics excels at delivering insights from relationships. The visual nature of the method makes it much easier to identify unexpected relationships and derive insights faster and quicker than using, say, a tabular format of data. While you may be able to come to the same conclusion by analyzing, for example, a spreadsheet of data, a graph format can bring this about with far less effort. The phrase “a picture is worth a thousand words” essentially applies here, and with computing tools designed to maximize the capabilities of graph analytics, insights can be determined in much more efficient ways.

“Based on strength or direction”: If you consider data points to be nodes in a graph, then the edges connecting those points define the relationship between them. Thus, strength of relationship can be derived from the density of the edge (as in, two points have a dozen relationships, so it is denser than an edge with a single connection) as well as the direction of the edge (the visual layout of nodes can translate into spatial data, where physical distance offers insight into the node.)

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

A Real-World Example of Graph Analytics

The type of insight provided by graph analytics doesn’t have to be a complex technical concept; in fact, one of the easiest ways to explain graph analytics is through a party game that pretty much everyone has played at one point or another: Six Degrees of Kevin Bacon.

If you’re one of the few people on the planet who’s never heard of it, the idea behind Six Degrees of Kevin Bacon came about in the 1990s based on the theory that every actor was connected to Kevin Bacon through six degrees of working relationships. When playing with friends, the challenge is to minimize the number of degrees connecting the actor to Kevin Bacon—functionally, this is the same thing as running a query via graph analytics.

Imagine every actor working in Hollywood as a node in a massive graph, with Kevin Bacon at the center of it. Edges are drawn for every film connection between actors. We want to run a query to find the relationships that connect Kevin Bacon to another random node (actor). For this example, let’s pick Pedro Pascal (Game of Thrones, The Mandalorian). Pascal’s lengthy list of high-profile work means that he shares the cast list with many other notable actors, creating nearly limitless paths for connecting to Kevin Bacon. However, the goal is to find the shortest path to Kevin Bacon.

To do this, we run a query that analyzes the various paths (or, if you’re doing this in a party setting, you just think really hard), ultimately generating this output:

  • Kevin Bacon (node) was in Crazy, Stupid, Love (edge) with Julianne Moore (node).
  • Julianne Moore (node) was in Kingsman: The Golden Circle (edge) with Pedro Pascal.

 How Graph Analytics Works: Six Degrees of Kevin Bacon

The website The Oracle of Bacon (no relation to Oracle, of course) is an online database project built around this game. It should be noted that the site’s database uses a graph algorithm known as breadth-first, and in this instance, the site would give this a Bacon Number of two, because there are two edges Thus, the shortest path to connect Kevin Bacon is through Julianne Moore. That’s a pretty easy example given that Pedro Pascal currently has a high-profile stature. But what if we pick someone who’s a little more obscure—someone whose career came about prior to Kevin Bacon’s breakout period in the 1980s? For this example, let’s use Wendell Corey, a character actor who worked in the 1940s, 50s, and 60s.

If we were using an analytics tool, we would submit a query to search for the number of relationships between Kevin Bacon and Wendell Corey. This produces a Bacon Number of three:

  • Kevin Bacon (node) was in Animal House (edge) with Tim Matheson (node).
  • Tim Matheson (node) was in The Apple Dumping Gang Rides Again (edge) with Audrey Totter (node).
  • Audrey Totter (node) was in Any Number Can Play (edge) with Wendell Corey (node).

 How Graph Analytics Works: Six Degrees of Kevin Bacon

Using a breadth-first search, even seemingly obscure connections can find relationships quickly and efficiently. How else can we use the Kevin Bacon example to demonstrate elements of graph analytics? The principle behind Six Degrees is to minimize the number of edges between nodes. To try different types of graph analyses, we can look at the strength of relationships. In this case, if the nodes are Kevin Bacon and other actors, we can assign a single edge for every time they are in a film together. Thus, the theoretical strongest relationship between Kevin Bacon and another actor comes down to the quantity of edges in their respective filmographies.

 How Graph Analytics Works: Six Degrees of Kevin Bacon

Distance between nodes offers another dimension in how data is analyzed. For example, if we place Kevin Bacon at the center of this, then the placement of nodes (coordinates) is based on how recently the film with Kevin Bacon was made. In this case, Jill Hennessey co-starred in the series City on a Hill, which was made in 2019, so her node in this analysis would be placed immediately next to Kevin Bacon at the center. Queen Latifah appeared in the film Beauty Shop with Kevin Bacon in 2005, which would place her node further out. With the nodes distributed this way, a more-refined analysis could be run based on connections made in the last ten years.

 How Graph Analytics Works: Six Degrees of Kevin Bacon

The context of the query can change as well if you take the focus off Kevin Bacon. Degree centrality, or the calculation of a node’s volume of relationships in relation to the largest volume of relationships (mathematically, a percentage calculated by node’s relationships divided by largest volume of relationships), can easily be determined using graph analytics—which shows us, by the way, that Kevin Bacon does not have the largest degree centrality among actors listed in IMDb. Consider the logistics of trying to calculate such a thing by analyzing data in a tabular format versus a graph format where nodes and edges naturally take care of a significant amount of the prep work.

Digging Deeper into Graph Analytics

Six Degrees of Kevin Bacon provides a fun and accessible example of graph analytics, but there’s much more to this method from both a functional and technical perspective. To learn more about the how, what, and why behind graph analytics, check out What Is Graph Analytics for an in-depth explanation.

For more about how you can benefit from Oracle Big Data, visit Oracle’s Big Data page—and don’t forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox.

Related:

  • No Related Posts

Find the Truth With Data: 5 Fraud Detection Use Cases

According to Ernst and Young, $8.2 billion a year is lost to the marketing, advertising, and media industries through fraudulent impressions, infringed content, and malvertising.

The combination of fake news, trolls, bots and money laundering is skewing the value of information and could be hurting your business.

It’s avoidable.

By using graph technology and the data you already have on hand, you can discover fraud through detectable patterns and stop their actions.

We collaborated with Sungpack Hong, Director of Research and Advanced Development at Oracle Labs to demonstrate five examples of real problems and how graph technology and data are being used to combat them.

Get started with data—register for a guided trial to build a data lake

But first, a refresher on graph technology.

What Is Graph Technology?

With a graph technology, the basic premise is that you store, manage and query data in the form of a graph. Your entities become vertices (as illustrated by the red dots). Your relationships become edges (as represented by the red lines).

What Is Graph Technology

By analyzing these fine-grained relationships, you can use graph analysis to detect anomalies with queries and algorithms. We’ll talk about these anomalies later in the article.

The major benefit of graph databases is that they’re naturally indexed by relationships, which provides faster access to data (as compared with a relational database). You can also add data without doing a lot of modeling in advance. These features make graph technology particularly useful for anomaly detection—which is mainly what we’ll be covering in this article for our fraud detection use cases.

How to Find Anomalies with Graph Technology

Gartner 5 Layers of Fraud Detection

If you take a look at Gartner’s 5 Layers of Fraud Protection, you can see that they break the analysis to discover fraud into two categories:

  • Discrete data analysis where you evaluate individual users, actions, and accounts
  • Connected analysis where relationships and integrated behaviors facilitate the fraud

It’s this second category based on connections, patterns, and behaviors that can really benefit from graph modeling and analysis.

Through connected analysis and graph technology, you would:

  • Combine and correlate enterprise information
  • Model the results as a connected graph
  • Apply link and social network analysis for discovery

Now we’ll discuss examples of ways companies can apply this to solve real business problems.

Fraud Detection Use Case #1: Finding Bot Accounts in Social Networks

In the world of social media, marketers want to see what they can discover from trends. For example:

  • If I’m selling this specific brand of shoes, how popular will they be? What are the trends in shoes?
  • If I compare this brand with a competing brand, how do the results mirror actual public opinion?
  • On social media, are people saying positive or negative things about me? About my competitors?

Of course, all of this information can be incredibly valuable. At the same time, it can mean nothing if it’s all inaccurate and skewed by how much other companies are willing to pay for bots.

In this case, we worked with Oracle Marketing Cloud to ensure the information they’re delivering to advertisers is as accurate as possible. We sought to find the fake bot accounts that are distorting popularity.

As an example, there are bots that retweet certain target accounts to make them look more popular.

To determine which accounts are “real,” we created a graph between accounts with retweet counts as the edge weights to see how many times these accounts are retweeting their neighboring accounts. We found that the unnaturally popularized accounts exhibit different characteristics from naturally popular accounts.

Here is the pattern for a naturally popular account:

Naturally Popular Social Media Account

And here is the pattern for an unnaturally popular account:

Unnaturally Popular Social Media Account

When these accounts are all analyzed, there are certain accounts that have obviously unnatural deviation. And by using graphs and relationships, we can find even more bots by:

  • Finding accounts with a high retweet count
  • Inspecting how other accounts are retweeting them
  • Finding the accounts that also get retweets from only these bots

Fraud Detection Use Case #2: Identifying Sock Puppets in Social Media

In this case, we used graph technology to identify sockpuppet accounts (online identity used for purposes of deception or in this case, different accounts posting the same set of messages) that were working to make certain topics or keywords look more important by making it seem as though they’re trending.

Sock Puppet Accounts in Social Media

To discover the bots, we had to augment the graph from Use Case #1. Here we:

  • Added edges between the authors with the same messages
  • Counted the number of repeated messaged and filtered to discount accidental unison
  • Applied heuristics to avoid n2 edge generation per same message

Because we found that the messages were always the same, we were able to take that and create subgraphs using those edges and apply a connected components algorithm.

Sock Puppet Groups

As a result of all of the analysis that we ran on a small sampling, we discovered that what we thought were the most popular brands actually weren’t—our original list had been distorted by bots.

See the image below – the “new” most popular brands barely even appear on the “old” most popular brands list. But they are a much truer reflection of what’s actually popular. This is the information you need.

Brand Popularity Skewed by Bots

After one month, we revisited the identified bot accounts just to see what had happened to them. We discovered:

  • 89% were suspended
  • 2.2% were deleted
  • 8.8% were still serving as bots

Fraud Detection Use Case #3: Circular Payment

A common pattern in financial crimes, a circular money transfer essentially involves a criminal sending money to himself or herself—but hides it as a valid transfer between “normal” accounts. These “normal” accounts are actually fake accounts. They typically share certain information because they are generated from stolen identities (email addresses, addresses, etc.), and it’s this related information that makes graph analysis such a good fit to discover them.

For this use case, you can use graph representation by creating a graph from transitions between entities as well as entities that share some information, including the email addresses, passwords, addresses, and more. Once we create a graph out of it, all we have to do is write a simple query and run it to find all customers with accounts that have similar information, and of course who is sending money to each other.

Circular Payments Graph Technology

Fraud Detection Use Case #4: VAT Fraud Detection

Because Europe has so many borders with different rules about who pays tax to which country when products are crossing borders, VAT (Value Added Tax) fraud detection can get very complicated.

In most cases, the importer should pay the VAT and if the products are exported to other countries, the exporter should receive a refund. But when there are other companies in between, deliberately obfuscating the process, it can get very complicated. The importing company delays paying the tax for weeks and months. The companies in the middle are paper companies. Eventually, the importing company vanishes and that company doesn’t pay VAT but is still able to get payment from the exporting company.

VAT Fraud Detection

This can be very difficult to decipher—but not with graph analysis. You can easily create a graph by transactions; who are the resellers and who is creating the companies?

In this real-life analysis, Oracle Practice Manager Wojciech Wcislo looked at the flow and how the flow works to identify suspicious companies. He then used an algorithm in Oracle Spatial and Graph to identify the middle man.

The graph view of VAT fraud detection:

Graph View of VAT Fraud Detection

A more complex view:

Complex View of Graph Technology and Anomaly Detection

In that case, you would:

  • Identify importers and exporters via simple query
  • Aggregate of VAT invoice items as edge weights
  • Run Fattest Path Algorithm

And you will discover common “Middle Man” nodes where the flows are aggregated

Fraud Detection Use Case #5: Money Laundering and Financial Fraud

Conceptually, money laundering is pretty simple. Dirty money is passed around to blend it with legitimate funds and then turned into hard assets. This was the kind of process discovered in the Panama Papers analysis.

These tax evasion schemes often rely on false resellers and brokers who are able to apply for tax refunds to avoid payment.

But graphs and graph databases provide relationship models. They let you apply pattern recognition, classification, statistical analysis, and machine learning to these models, which enables more efficient analysis at scale against massive amounts of data.

In this use case, we’ll look more specifically at Case Correlation. In this case, whenever there are transactions that regulations dictate are suspicious, those transactions get a closer look from human investigators. The goal here is to avoid inspecting each individual activity separately but rather, group these suspicious activities together through pre-known connections.

Money Laundering and Financial Fraud

To find these correlations through a graph-based approach, we implemented this flow through general graph machines, using pattern matching query (path finding) and connected component graph algorithm (with filters).

Through this method, this company didn’t have to create their own custom case correlation engine because they could use graph technology, which has improved flexibility. This flexibility is important because different countries have different rules.

Conclusion

In today’s world, the scammers are getting ever more inventive. But the technology is too. Graph technology is an excellent way to discover the truth in data, and it is a tool that’s rapidly becoming more popular. If you’d like to learn more, you can find white papers, software downloads, documentation and more on Oracle’s Big Data Spatial and Graph pages.

And if you’re ready to get started with exploring your data now, we offer a free guided trial that enables you to build and experiment with your own data lake.

Related:

5 Graph Analytics Use Cases

According to Ernst and Young, $8.2 billion a year is lost to the marketing, advertising, and media industries through fraudulent impressions, infringed content, and malvertising.

The combination of fake news, trolls, bots and money laundering is skewing the value of information and could be hurting your business.

It’s avoidable.

By using graph technology and the data you already have on hand, you can discover fraud through detectable patterns and stop their actions.

We collaborated with Sungpack Hong, Director of Research and Advanced Development at Oracle Labs to demonstrate five examples of real problems and how graph technology and data are being used to combat them.

Get started with data—register for a guided trial to build a data lake

But first, a refresher on graph technology.

What Is Graph Technology?

With a graph technology, the basic premise is that you store, manage and query data in the form of a graph. Your entities become vertices (as illustrated by the red dots). Your relationships become edges (as represented by the red lines).

What Is Graph Technology

By analyzing these fine-grained relationships, you can use graph analysis to detect anomalies with queries and algorithms. We’ll talk about these anomalies later in the article.

The major benefit of graph databases is that they’re naturally indexed by relationships, which provides faster access to data (as compared with a relational database). You can also add data without doing a lot of modeling in advance. These features make graph technology particularly useful for anomaly detection—which is mainly what we’ll be covering in this article for our fraud detection use cases.

How to Find Anomalies with Graph Technology

Gartner 5 Layers of Fraud Detection

If you take a look at Gartner’s 5 Layers of Fraud Protection, you can see that they break the analysis to discover fraud into two categories:

  • Discrete data analysis where you evaluate individual users, actions, and accounts
  • Connected analysis where relationships and integrated behaviors facilitate the fraud

It’s this second category based on connections, patterns, and behaviors that can really benefit from graph modeling and analysis.

Through connected analysis and graph technology, you would:

  • Combine and correlate enterprise information
  • Model the results as a connected graph
  • Apply link and social network analysis for discovery

Now we’ll discuss examples of ways companies can apply this to solve real business problems.

Fraud Detection Use Case #1: Finding Bot Accounts in Social Networks

In the world of social media, marketers want to see what they can discover from trends. For example:

  • If I’m selling this specific brand of shoes, how popular will they be? What are the trends in shoes?
  • If I compare this brand with a competing brand, how do the results mirror actual public opinion?
  • On social media, are people saying positive or negative things about me? About my competitors?

Of course, all of this information can be incredibly valuable. At the same time, it can mean nothing if it’s all inaccurate and skewed by how much other companies are willing to pay for bots.

In this case, we worked with Oracle Marketing Cloud to ensure the information they’re delivering to advertisers is as accurate as possible. We sought to find the fake bot accounts that are distorting popularity.

As an example, there are bots that retweet certain target accounts to make them look more popular.

To determine which accounts are “real,” we created a graph between accounts with retweet counts as the edge weights to see how many times these accounts are retweeting their neighboring accounts. We found that the unnaturally popularized accounts exhibit different characteristics from naturally popular accounts.

Here is the pattern for a naturally popular account:

Naturally Popular Social Media Account

And here is the pattern for an unnaturally popular account:

Unnaturally Popular Social Media Account

When these accounts are all analyzed, there are certain accounts that have obviously unnatural deviation. And by using graphs and relationships, we can find even more bots by:

  • Finding accounts with a high retweet count
  • Inspecting how other accounts are retweeting them
  • Finding the accounts that also get retweets from only these bots

Fraud Detection Use Case #2: Identifying Sock Puppets in Social Media

In this case, we used graph technology to identify sockpuppet accounts (online identity used for purposes of deception or in this case, different accounts posting the same set of messages) that were working to make certain topics or keywords look more important by making it seem as though they’re trending.

Sock Puppet Accounts in Social Media

To discover the bots, we had to augment the graph from Use Case #1. Here we:

  • Added edges between the authors with the same messages
  • Counted the number of repeated messaged and filtered to discount accidental unison
  • Applied heuristics to avoid n2 edge generation per same message

Because we found that the messages were always the same, we were able to take that and create subgraphs using those edges and apply a connected components algorithm.

Sock Puppet Groups

As a result of all of the analysis that we ran on a small sampling, we discovered that what we thought were the most popular brands actually weren’t—our original list had been distorted by bots.

See the image below – the “new” most popular brands barely even appear on the “old” most popular brands list. But they are a much truer reflection of what’s actually popular. This is the information you need.

Brand Popularity Skewed by Bots

After one month, we revisited the identified bot accounts just to see what had happened to them. We discovered:

  • 89% were suspended
  • 2.2% were deleted
  • 8.8% were still serving as bots

Fraud Detection Use Case #3: Circular Payment

A common pattern in financial crimes, a circular money transfer essentially involves a criminal sending money to himself or herself—but hides it as a valid transfer between “normal” accounts. These “normal” accounts are actually fake accounts. They typically share certain information because they are generated from stolen identities (email addresses, addresses, etc.), and it’s this related information that makes graph analysis such a good fit to discover them.

For this use case, you can use graph representation by creating a graph from transitions between entities as well as entities that share some information, including the email addresses, passwords, addresses, and more. Once we create a graph out of it, all we have to do is write a simple query and run it to find all customers with accounts that have similar information, and of course who is sending money to each other.

Circular Payments Graph Technology

Fraud Detection Use Case #4: VAT Fraud Detection

Because Europe has so many borders with different rules about who pays tax to which country when products are crossing borders, VAT (Value Added Tax) fraud detection can get very complicated.

In most cases, the importer should pay the VAT and if the products are exported to other countries, the exporter should receive a refund. But when there are other companies in between, deliberately obfuscating the process, it can get very complicated. The importing company delays paying the tax for weeks and months. The companies in the middle are paper companies. Eventually, the importing company vanishes and that company doesn’t pay VAT but is still able to get payment from the exporting company.

VAT Fraud Detection

This can be very difficult to decipher—but not with graph analysis. You can easily create a graph by transactions; who are the resellers and who is creating the companies?

In this real-life analysis, Oracle Practice Manager Wojciech Wcislo looked at the flow and how the flow works to identify suspicious companies. He then used an algorithm in Oracle Spatial and Graph to identify the middle man.

The graph view of VAT fraud detection:

Graph View of VAT Fraud Detection

A more complex view:

Complex View of Graph Technology and Anomaly Detection

In that case, you would:

  • Identify importers and exporters via simple query
  • Aggregate of VAT invoice items as edge weights
  • Run Fattest Path Algorithm

And you will discover common “Middle Man” nodes where the flows are aggregated

Fraud Detection Use Case #5: Money Laundering and Financial Fraud

Conceptually, money laundering is pretty simple. Dirty money is passed around to blend it with legitimate funds and then turned into hard assets. This was the kind of process discovered in the Panama Papers analysis.

These tax evasion schemes often rely on false resellers and brokers who are able to apply for tax refunds to avoid payment.

But graphs and graph databases provide relationship models. They let you apply pattern recognition, classification, statistical analysis, and machine learning to these models, which enables more efficient analysis at scale against massive amounts of data.

In this use case, we’ll look more specifically at Case Correlation. In this case, whenever there are transactions that regulations dictate are suspicious, those transactions get a closer look from human investigators. The goal here is to avoid inspecting each individual activity separately but rather, group these suspicious activities together through pre-known connections.

Money Laundering and Financial Fraud

To find these correlations through a graph-based approach, we implemented this flow through general graph machines, using pattern matching query (path finding) and connected component graph algorithm (with filters).

Through this method, this company didn’t have to create their own custom case correlation engine because they could use graph technology, which has improved flexibility. This flexibility is important because different countries have different rules.

Conclusion

In today’s world, the scammers are getting ever more inventive. But the technology is too. Graph technology is an excellent way to discover the truth in data, and it is a tool that’s rapidly becoming more popular. If you’d like to learn more, you can find white papers, software downloads, documentation and more on Oracle’s Big Data Spatial and Graph pages.

And if you’re ready to get started with exploring your data now, we offer a free guided trial that enables you to build and experiment with your own data lake.

Related:

How can I get speaker_labels from speech-to-text node?

I want to get speaker_labels from speech-to-text node on Node-RED on Bluemix,
but I can’t.

I set the speech-to-text node on my
flow and checked the “SpeakerLabels” checkbox on the node.

I chose Japanese Broadband model.

However, the msg.fullresult didn’t contain “speaker_labels”.

What shoud I do to get speaker_labels from speech-to-text node’s outputs?

Related: