In the face of liars who “don’t speak martial virtue”, we deal with TA like this

Time:2021-9-19

In the face of liars who

“I, XXX, pay…” have you ever received such a message?

In fact, the history of fraud using information technology may be far beyond your imagination. Even before the birth of the Internet, the “Nigerian Prince” scam of fraud through paper letters and faxes was widely spread all over the world. Today, through various channels, online fraud carried out by means of social engineering is more multifarious, which is impossible to prevent.

Today, we will see from the perspective of enterprises how to use technical means to prevent online swindlers from committing fraud through their own company’s platform.

===

Fraudulent users and malicious accounts may cause billions of dollars in revenue loss to enterprises every year. Although many enterprises have been using rule-based filters to prevent all kinds of malicious activities in the system, such filters are often quite fragile and can not capture all malicious behaviors.

On the other hand, some solutions (such as graph technology) are outstanding in detecting fraudsters and malicious users. Fraudsters can adjust their activities to deceive rule-based systems or simple feature-based models, but it is difficult to forge the graph structure, especially the relationship between users and other entities in the transaction / interaction log. Graph neural network (GNN) can combine the information in the graph structure with the attributes of users or transactions, extract meaningful representations, and finally distinguish malicious users and events from legitimate users and events.

This article describes how to use Amazon sagemaker and deep graph library (DGL) to train GNN model to detect malicious users or fraudulent transactions. Users who want to use fully hosted Amazon AI services to achieve fraud detection can also consider using itAmazon Fraud DetectorTo significantly reduce the difficulty of identifying potential fraudulent online activities, such as creating forged accounts or online payment fraud.

The following will focus on how to use Amazon sagemaker for data preprocessing and model training. To train a GNN model, we first need to build a set of heterogeneous graphs using the information in the transaction table or access log. The so-called heterogeneous graph refers to a graph containing different types of nodes and edges. If nodes represent users or transactions, each node will reflect a variety of different relationships between the current user and other users or entities (such as device identifiers, institutions, applications, IP addresses, etc.).

Here are some use cases applicable to this solution:

  • A financial network for transactions between users and between users and specific financial institutions or applications.
  • A game network in which users continuously interact with other users and even different games or devices.
  • Social networks with many different types of links between users and other users.

The following figure shows the basic architecture of heterogeneous financial transaction network.

In the face of liars who

GNN can combine user characteristics (such as demographic information) or transaction characteristics (such as activity frequency). In other words, we can use the features of nodes and edges as metadata to enrich the representation of heterogeneous graphs. After completing the establishment of nodes, relationships and their associated features in heterogeneous graphs, GNN models can be trained to guide them to learn how to use node or edge features and add graph structure to classify different nodes as malicious or legitimate nodes. Model training is completed in a semi supervised manner – some nodes in the graph need to be marked as fraudulent or legal nodes in advance. Taking the subset containing these markers as the training signal, we can gradually find out the optimal parameter matching of GNN model. Then, the trained GNN model can predict the remaining unlabeled nodes in the graph.

framework

First, we can run processing jobs and training jobs using Amazon sagemaker’s complete solution architecture. You can useAmazon Simple Storage Service(Amazon S3) the Amazon lambda function that responds to the put event automatically triggers the Amazon sagemaker job, or manually triggers the corresponding job through the unit running in the sample Amazon sagemaker notebook. The following figure is a visual representation of this architecture:

In the face of liars who

Full implementation can be achieved throughGitHub repoGet, with a set ofAmazon CloudFormationTemplate to start the whole architecture in AWS account.

GNNFraud detection preparation: Data Preprocessing

In this section, we will introduce how to preprocess the sample data set to determine the relationship between nodes in a heterogeneous graph!

data set

In this use case, we useIeee-cis fraud datasetBenchmark the modeling method. This is an anonymous dataset containing up to 500000 transactions between users. The dataset contains two main tables:

  • Transactions table: a transaction table that contains information about transactions or interactions between users.
  • Identity table: identity table, which contains the log access, equipment and network information of the specific user executing the transaction.

We can use subsets of these transactions and their labels as supervision signals in model training. For transactions in the test data set, their labels will be blocked during training. The task of the model is very clear: predict which blocked transactions are fraudulent and which are legal.

The following example code takes the data and uploads it to the Amazon S3 bucket used by Amazon sagemaker to access the dataset during preprocessing and training (running in the Jupiter notebook unit):

# Replace with an S3 location or local path to point to your own dataset
raw_data_location = 's3://sagemaker-solutions-us-west-2/Fraud-detection-in-financial-networks/data'
bucket = 'SAGEMAKER_S3_BUCKET'
prefix = 'dgl'
input_data = 's3://{}/{}/raw-data'.format(bucket, prefix)
!aws s3 cp --recursive $raw_data_location $input_data
# Set S3 locations to store processed data for training and post-training results and artifacts respectively
train_data = 's3://{}/{}/processed-data'.format(bucket, prefix)
train_output = 's3://{}/{}/output'.format(bucket, prefix)

Although fraudsters will try to cover up their malicious activities, such behaviors still have very obvious characteristics in the graph structure, such as high degree or activity aggregation tendency. The following sections will explain how to perform feature extraction and graph construction, and then use these patterns to realize fraud prediction by GNN model.

feature extraction

Feature extraction includes performing digital coding on classification features, and then performing a series of transformations on digital columns. For example, we need to perform logarithmic conversion on the transaction amount to indicate the relative size of the amount, and its category attribute can be converted to digital form through independent heat coding method. For each transaction, the eigenvector will contain the inherent attributes in the transaction table, which contain information such as time increment, name and address matching and matching count compared with previous transactions.

Construction diagram

To build a complete interaction diagram, we need to divide the relationship information in the data into edge lists corresponding to various relationship types. Each edge list belongs to a bipartite graph between transaction nodes and other entity types. These entity types constitute transaction related identification attributes. For example, for the card type used in the transaction (debit card or credit card), we can create it as the entity type, the IP address of the device used to complete the transaction, and the device ID or operating system of the device used. The entity type used in the figure construction includes all attributes in the identity table and a subset of attributes in the transaction table, such as credit card information or e-mail domain. Heterogeneous graph is composed of edge list representing each relationship category and feature matrix of nodes.

Using Amazon sagemaker processing

You can use Amazon sagemaker processing to perform data preprocessing and feature extraction steps. Amazon sagemaker processing is a feature in Amazon sagemaker that allows you to run pre-processing and post-processing workloads on top of a fully managed infrastructure. For more details, seeData processing and evaluation model

First, we need to define the containers used in Amazon sagemaker processing jobs. This container should contain all the dependencies required by the data preprocessing script. Since the data preprocessing in this use case only needs to use the pandas library, the minimum dockerfile can be used to implement the container definition. See the following code for details:

FROM python:3.7-slim-buster
RUN pip3 install pandas==0.24.2
ENV PYTHONUNBUFFERED=TRUE
ENTRYPOINT ["python3"]

You can enter the following code to build the container and push the built container toAmazon Elastic Container Registry(Amazon ECR) mirror warehouse:

import boto3
region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'sagemaker-preprocessing-container'
ecr_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account_id, region, ecr_repository)
!bash data-preprocessing/container/build_and_push.sh $ecr_repository docker

When the data preprocessing container is ready, we can create an Amazon sagemaker scriptprocessor, which is responsible for setting up the processing job environment using the preprocessing container. Next, you can use scriptprocessor to run the script responsible for implementing data preprocessing in a container defined environmentPython scripts 。 After the python script completes execution and saves the preprocessed data back to Amazon S3, the processing job ends. The whole process is completely managed by Amazon sagemaker. When running scriptprocessor, we can choose to pass parameters to the data preprocessing script to specify which columns in the transaction table should be regarded as identity columns and which columns belong to classification characteristics. All other columns are assumed to be numeric characteristic columns. See the following code for details:

from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
script_processor = ScriptProcessor(command=['python3'],
 image_uri=ecr_repository_uri,
 role=role,
 instance_count=1,
 instance_type='ml.r5.24xlarge')
script_processor.run(code='data-preprocessing/graph_data_preprocessor.py',
 inputs=[ProcessingInput(source=input_data,
 destination='/opt/ml/processing/input')],
 outputs=[ProcessingOutput(destination=train_data,
 source='/opt/ml/processing/output')],
 arguments=['--id-cols', 'card1,card2,card3,card4,card5,card6,ProductCD,addr1,addr2,P_emaildomain,R_emaildomain',
 '--cat-cols',' M1,M2,M3,M4,M5,M6,M7,M8,M9'])

The following example code shows the output results of the Amazon sagemaker processing job stored in Amazon S3:

from os import path
from sagemaker.s3 import S3Downloader
processed_files = S3Downloader.list(train_data)
print("===== Processed Files =====")
print('n'.join(processed_files))Output:
===== Processed Files =====
s3://graph-fraud-detection/dgl/processed-data/features.csv
s3://graph-fraud-detection/dgl/processed-data/relation_DeviceInfo_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_DeviceType_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_P_emaildomain_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_ProductCD_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_R_emaildomain_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_TransactionID_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_addr1_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_addr2_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card1_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card2_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card3_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card4_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card5_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_card6_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_01_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_02_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_03_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_04_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_05_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_06_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_07_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_08_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_09_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_10_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_11_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_12_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_13_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_14_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_15_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_16_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_17_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_18_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_19_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_20_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_21_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_22_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_23_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_24_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_25_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_26_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_27_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_28_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_29_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_30_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_31_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_32_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_33_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_34_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_35_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_36_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_37_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/relation_id_38_edgelist.csv
s3://graph-fraud-detection/dgl/processed-data/tags.csv
s3://graph-fraud-detection/dgl/processed-data/test.csv

All relational edgelist files represent different types of edges used to construct heterogeneous graphs during training. Features.csv contains the features after the final conversion of the transaction node, while tags.csv contains the node label as the training supervision signal. Test.csv contains transactionid data as a test data set to evaluate the performance of the model. These node labels are shielded during training to avoid interference with model prediction.

GNNmodel training

Now we can use the deep graph library (DGL) to create graphs and define GNN models, and then use Amazon sagemaker to start the infrastructure to train GNN. Specifically, we can use the relational graph convolution neural network model to learn the embedding of nodes in heterogeneous graphs and the full connection layer for final node classification.

Super parameter

To train GNN model, you also need to define a series of fixed super parameters before training, such as the types of graphs you want to construct, the types of GNN model used, network architecture, optimizer and optimization parameters. See the following code for details:

edges = ",".join(map(lambda x: x.split("/")[-1], [file for file in processed_files if "relation" in file]))
params = {'nodes' : 'features.csv',
 'edges': 'relation*.csv',
 'labels': 'tags.csv',
 'model': 'rgcn',
 'num-gpus': 1,
 'batch-size': 10000,
 'embedding-size': 64,
 'n-neighbors': 1000,
 'n-layers': 2,
 'n-epochs': 10,
 'optimizer': 'adam',
 'lr': 1e-2
 }

The above code contains some super parameters. For more details on superparameters and their default values, see in GitHub repoestimator_fns.py

Using Amazon sagemakerTraining model

After the super parameter definition is completed, you can now officially start the training process. The training task uses DGL (mxnet as the back-end deep learning framework) to realize the definition and training of GNN model. Amazon sagemaker provides a framework fitter, in which the deep learning framework environment can greatly reduce the training difficulty of GNN model. For more details on training GNN models with DGL on Amazon sagemaker, seeTraining depth map network

Now we can create an Amazon sagemaker mxnet fitter and pass in the model training script, super parameters and the required number / type of training instances. Next, you can call fit on the fitter and transfer it to the training data storage location on Amazon S3. See the following code for details:

from sagemaker.mxnet import MXNet
estimator = MXNet(entry_point='train_dgl_mxnet_entry_point.py',
 source_dir='dgl-fraud-detection',
 role=role,
 train_instance_count=1,
 train_instance_type='ml.p2.xlarge',
 framework_version="1.6.0",
 py_version='py3',
 hyperparameters=params,
 output_path=train_output,
 code_location=train_output,
 sagemaker_session=sess)
estimator.fit({'train': train_data})

result

After GNN model training, the model has learned how to distinguish between legitimate transactions and fraudulent transactions. The training operation will generate a pred.csv file, which is the prediction result of the model for the transaction in test.csv. ROC curve reflects the relationship between correct prediction rate and false alarm rate under various thresholds, and the area under the curve (AUC) can be used as an evaluation index. It can be seen from the figure below that the GNN model we trained is better than the fully connected feedforward network and the gradient lifting tree that uses the same characteristics but does not make full use of the graph structure.

In the face of liars who

summary

In this paper, we explain how to build a heterogeneous graph according to user transactions and activities, use the graph and other collected features to train the GNN model, and finally predict the fraud of transactions. This article also describes how to use DGL and Amazon sagemaker to define and train GNN models with high prediction performance. For the complete implementation of this project and other GNN model details, seeGitHub repo

In addition, we also introduced how to implement data processing to extract meaningful features and relationships from the original transaction data log using Amazon sagemaker processing. You can directly deploy the cloudformation template provided in the example and pass in your own dataset to detect malicious users and fraudulent transactions in the data.

In the face of liars who