What Is Fraud Analytics?

What Is Fraud Analytics?

Overview

Fraud analytics is the practice of leveraging data science techniques and analysis to assist in detecting potential fraud, either before a transaction is completed or after it has occurred.

Fraud detection systems combine data analytics, statistics, machine learning and artificial intelligence to identify fraud risk factors, predict fraudulent transactions and schedule/send fraud alerts in real-time. These systems can save companies hundreds of millions of dollars each year.

As an example, the insurance provider Highmark Inc. secured an estimated $245 million in savings in 2021 thanks to its fraud analytics and detection system.

Traditionally, simple rule-based systems were used to identify fraud, particularly in the banking, finance and insurance industries. Since those early days, data science has revolutionized fraud analytics, increasing the precision and speed of companies identification and response. Fraud detection is now used in nearly every industry, stretching from governmental agencies to retail and media companies.

Some of the most common use cases for fraud analytics include detecting:

  • Fraudulent transactions, scams, and bribery.
  • Platform abuse and spam.
  • Card Not Present (CNP) transactions.
  • Invalid clicks on ad platforms.
  • Phishing scams on messaging / social media platforms.
  • Account takeovers or SIM swapping (when stolen user data is used to gain access to an account).
  • Malware attacks and “Man in the Middle” attacks.

How Do You Use Data Analytics to Detect Fraud?

Fraud weak password image

When a customer is trying to process a transaction, data can help to uncover if the transaction is legitimate or not. This can include data points like: device type, location/country of origin, account verification, if the requested transaction fits into normal behavior patterns, etc.

This authentication data falls into four categories:

  • Knowledge- Any data a customer must know themselves, e.g. password, SSN, account number, etc.
  • Possession - Data that can be input via something that the user possesses, like a mobile phone or authenticated device.
  • Inherence - Physical data that the user inputs, like a fingerprint or face ID.
  • Behavioral - Data about the actions that the user is taking, like the type of transaction or details about the transaction.

Types of Fraud Detection Systems

Fraud analytics systems leverage these data types using data science, automated rules, and/or machine learning to determine if a transaction is legitimate. Traditionally, the three most common types of fraud detection systems include: rule-based, supervised learning, and unsupervised learning models.

1. Rule-based fraud detection - Rule-based systems have been used for more than twenty years, and are still widely deployed. These systems work by detecting fraud when unusual account activity is detected. For example, if the majority of a users’ transactions occur in San Francisco, but are suddenly being processed in Germany, a rule-based system would be triggered.

2. Supervised classifications - This type of fraud detection system leverages algorithmic learning. An algorithm is trained to detect fraud based on company-wide historical data and previous instances of fraud. Supervised learning models have greatly improved the precision of fraud detection systems.

3. Unsupervised classifications - Unsupervised models are traditionally used to uncover fraudulent tactics by clustering unlabeled data. Hidden relationships in the data can be detected, and as a result, new or emerging fraud tactics can be discovered.

Rule-based systems offer distinct advantages, as they are easier to deploy, well documented and cost-effective. However, the rule sets for these systems have grown extremely complex, while, at the same time, often fail to adapt to hidden threats. As a final handicap, they tend to result in a high number of false positives and/or missed fraudulent transactions as they lack the ability to look beyond binary triggers to the rules in place.

Machine learning models (both supervised and unsupervised) provide the ability to analyze data at scale, and as a result, they have increased precision. Unsupervised models in particular can adapt to and identify changing threats on an ongoing basis, finding associations that human monitors would not connect.

What Jobs Require Fraud Analytics Experience?

Fraud alert signage

Fraud analysts are a type of data analyst that specializes in identifying suspicious activity in customer transactions. Typically, fraud analysts are responsible for investigating theft and fraud, as well as performing risk management; developing, deploying and maintaining fraud detection systems, as well as performing fraud market research.

A key responsibility for a fraud analyst is maintaining fraud analytics models and databases, as any downtime can cause security issues for businesses and customers.

Fraud analysts are employed in a wide range of industries. Certain fields do often employ more of these analysts, specifically in finance and banking, insurance, government and retail. Many job roles and titles fall under the umbrella of fraud analyst, including:

  • Credit card fraud analyst
  • Fraud data scientist
  • Fraud risk analyst
  • Fraud intelligence analyst
  • Data quality analyst

Traditionally, fraud analysts have extensive experience in data science, statistics, and machine learning, though some analysts do start in finance and/or accounting. The majority of fraud analytics roles require a bachelor’s degree, as well as 1-4 years of data analytics and/or fraud experience.

Fraud Analytics Case Study

For risk analyst roles, fraud analytics case studies are used to test a candidate’s knowledge of fraud detection. Fraud case study questions include specific information about the case, and the candidate must then use the provided information to propose a solution to the problem.

Here is an example of a fraud analytics case study question:

1. How would you build a model to detect fraud on a banking platform, and then send the customer an approve/deny text message when fraud is detected?

When answering a question like this, you should start with clarifying questions to the interviewer like:

  • How accurate is our data? Is all of the data labeled carefully? How much fraud are we not detecting if customers do not even know they’re being defrauded?
  • How much do we care about interpretability? Building a highly accurate model for our dataset may not be the best method if we do not learn anything from it. In the case that our customers are being compromised without us even knowing, we run into the issue of building a model that we cannot learn from and feature engineer for in the future.
  • What are the costs of misclassification? If we look at precision versus recall, we can understand which metrics we care about given the business problem at hand.

Next make assumptions about the case study. We can assume that low recall in a fraudulent case scenario would be a disaster. With low predictive power on false negatives, fraudulent purchases would go under the radar with consumers not even knowing they were being defrauded. This could cost the bank thousands of dollars in lost revenue given they would have to refund the cost to the consumer, plus the potential reputational risk.

Meanwhile if there was low precision, customers would think their accounts were being defrauded all the time. They would continue to get text messages until they switched over to another bank, because the transactions would always be flagged as fraudulent, an annoying situation when a customer knows that the transaction is valid.

Since the question prompts for a text messaging service, it might make sense then to optimize for recall in order to minimize risk and avoid costly fraudulent charges.

Finally, ask yourself this question: What model works well on an imbalance dataset? Generally, tree models come to mind.

More Fraud Analytics Case Studies

Here are more fraud case studies you can use to prepare for fraud and risk analyst interviews. These questions include credit card fraud modeling, platform abuse case studies and anomaly detection cases.

2. You are given a dataset of 600,000 credit card transactions. Use that dataset to build a fraud detection model.

At first glance we would have to do some analysis on the dataset to get a clearer picture:

  • Exactly how frequently do fraudulent transactions occur?
  • How are we getting the fraud data?
  • Are users reporting fraud and then support staff determining they are not? What about vice versa?

Hint: The most important part after looking at all of these credit card transactions is determining how we can feature engineer in our solution which data points are fraudulent transactions as our response variable. Once we have determined a high confidence for fraud, then we can build a model and extract features.

3. If given a univariate dataset, how would you design a function to detect anomalies? What if the data is bivariate?

This type of question gets asked early in interviews to determine your confidence with anomaly detection, which is widely used in fraud detection and risk mitigation.

To answer this question, first, provide a definition of a univariate dataset. Univariate means one variable. For example, travel time in hours from your city to 10 other cities is given in an example list below:

12, 27, 11, 41, 35, 22, 18, 43, 26, 10

This kind of single column data set is called a univariate dataset. Anomaly detection is a way to discover unexpected values in datasets. The anomaly means data exists that is different from the normal data. For example you can see below the dataset where one data point is unexpectedly high intuitively:

12, 27, 11, 41, 35, 22, 76767676, 18, 43, 26, 10

With that visual, how would you design a function to flag that value?

4. Write a SQL query to identify potentially fraudulent ATM transactions.

Note: SQL questions are common in fraud analytics interviews. These questions test your ability to pull metrics that can help solve fraud detection problems.

In this case study question, you are informed about an ATM robbery at a bank. Some unauthorized withdrawals were made, and you are asked to investigate these transactions.

However, the only information you have to begin with is that there was more than one withdrawal, that they were all performed in 10-second gaps, and that no legitimate transactions were performed between two fraudulent withdrawals.

For the query, you should retrieve all user IDs in ascending order whose transactions have exactly a 10-second gap.

Hint: Since we need to identify users making transactions that occur exactly ten seconds apart, it is useful to first order the transactions by created_at.

5. Given a dataset, how would you determine if users are creating multiple accounts to upvote their own comments?

More context: You are provided three tables representing forum users.

From here, answer these questions: What metrics would you use to investigate this problem? How would you write a query to represent the percentage of users who are acting fraudulently?

One metric that could shed insight would be: the number of users who have upvoted one account multiple times that have also never commented.