Amazon sagemaker: marking suspicious medical insurance claims


Amazon sagemaker: marking suspicious medical insurance claims

We have published a number of technical articles about Amazon sagemaker. Today, we might as well take an example to see how simple and powerful the service is when it is used for machine learning. Let’s try to detect insurance fraud with sage maker.


according toNational Health Insurance anti fraud Association(nhcaa) estimates that medical insurance fraud causes about $68 billion in losses annually to the United States, accounting for 3% of the national health insurance expenditure (US $2.26 trillion). This is a conservative estimate, with estimates showing that the loss is as high as 10% of the annual health insurance expenditure, or $230 billion.

Medical insurance fraud will inevitably lead to the increase of consumers’ premium and out of pocket expenses, damage of interests or reduction of insurance coverage.

Identifying a claim as fraud may require a complex and detailed investigation. This paper introduces how to train Amazon sagemaker model to mark the abnormal inpatient claims of post paid medical insurance, and further investigate their fraud suspicion. This solution does not require tagging data; it uses unsupervised machine learning (ML) to create a model to locate suspicious claims.

Due to the following challenges, anomaly detection in this area is difficult

  • The difference between normal and abnormal data is often not obvious. Anomaly detection methods may be application specific. For example, in clinical data, a small deviation can be used to determine outliers; in marketing applications, a significant deviation is required to determine outliers.
  • Noise data in the data may be displayed as attribute value deviation or missing value. Noise data may mask outliers or mark deviations as outliers.
  • It can be difficult to explain outliers clearly and reasonably.

This solution usesAmazon SageMakerIt can help developers and data scientists build, train, and deploy ml models. Amazon sagemaker is a fully hosted service that covers the entire workflow of ML, marking and preparing data, selecting algorithms, training models, adjusting and optimizing models for deployment, prediction, and execution.

In addition, we can use Amazon sagemaker jupyter notebook end-to-end. See for more informationGitHub repository

Solution overview

In this example, we will use Amazon sagemaker to do the following:

  1. Download and visualize the dataset using jupyter notebook;
  2. Clean up local data and view data samples in jupyter notebook;
  3. Use word2vec to perform feature engineering on text column;
  4. The principal component analysis (PCA) model was fitted to the preprocessing data set;
  5. The whole data set was scored;
  6. Apply a threshold to the score to identify any suspicious or unusual claims.

Download and visualize datasets using Jupiter notebook

This paper uses the 2008 Medicare inpatient claims data set. The dataset, CMS 2008 BSA patient claims PUF, is a publicly available Basic Edition (BSA) inpatient public use file (PUF).

The jupyter notebook in this article explains how to download datasets; for more information, seeGitHub repository

The dataset contains a claim primary key as a record index and six analysis variables, as well as some basic attribute variables and claim related variables. However, since the file does not provide a beneficiary ID, we cannot correlate claims from the same beneficiary, although the dataset contains sufficient information to model this solution.

In terms of characteristics, this is the smallest data set. The data set does not provide some of the required features (e.g., the postcode of the treatment facility). We can add more data to build a set of features to continuously improve the accuracy of this solution.

You canDownload a copy of the datasetYou can also access the dataset through the GitHub repository.

Next, we will analyze the seven analysis variables, clear the data in each variable by fixing null values, and replace the ICD 9 diagnostic and treatment codes with their corresponding descriptions.

Clear column names

Clear the column name by following the steps below.

  1. Open the file ColumnNames.csv
  2. Remove all spaces and double quotes

This generates the relevant names of the encoded columns, and you are ready to start processing the dataset. See the following code example:

colnames =\_csv("./data/ColumnNames.csv")

colnames\[colnames.columns\[-1\]\] = colnames\[colnames.columns\[-1\]\].map(lambda x: x.replace('"','').strip())


The following table shows the column names used in this project in this dataset.
Amazon sagemaker: marking suspicious medical insurance claims

The following are the characteristics of the dataset used:

  • Medical insurance inpatient claims in 2008
  • Each record is an inpatient claim filed by a Medicare beneficiary (5% of the sample)
  • Identity of beneficiary not provided
  • The zip code of the patient treatment facility is not provided
  • The file contains eight variables, one primary key and seven analysis variables
  • Provides the data dictionary needed to interpret the code in the dataset

Visualization data set

It is obvious from the following screen shot that it is difficult to distinguish between abnormal and non abnormal records by visual inspection. Even with statistical techniques, it’s not that easy. This is because of the following challenges:

  • It can effectively model normal objects and abnormal objects. The boundary between normal and abnormal data (outliers) is usually unclear.
  • Outlier detection method is specific to the application. For example, in clinical data, with a small deviation can be determined as outliers; in marketing applications, to determine outliers requires a larger deviation.
  • The noise data in the data may be displayed as attribute value deviation or even missing value. Noise data may mask outliers or mark deviations as outliers.
  • It may be difficult to reasonably explain outliers from an interpretable point of view.

The following screen shot shows a sample record in the dataset:
Amazon sagemaker: marking suspicious medical insurance claims

Cleaning up local data and viewing data samples in jupyter notebook

Generating column statistics on a dataset

The following commands identify columns with null values:

\# check null value for each column


We will see some “Nan” in the results, as well as the average value of ICD9 main procedure code (0.469985). “Nan” for “non numeric” – this floating-point value is obtained if a calculation is performed but the result cannot be expressed as a number. This indicates that null values need to be fixed for ICD9 main procedure code.

Replace the ICD9 diagnostic code

To replace null values, execute the following code and change the type from float to Int64. The dataset encodes all process codes as integers.

#Fill NaN with -1 for "No Procedure Performed"

procedue\_na = -1

df\_cms\_claims\_data\['ICD9 primary procedure code'\].fillna(procedue\_na, inplace = True)

#convert procedure code from float to int64

df\_cms\_claims\_data\['ICD9 primary procedure code'\] = df\_cms\_claims\_data\['ICD9 primary procedure code'\].astype(np.int64)

Analysis of gender and age data

Next, the gender and age distribution is analyzed. Perform the following procedure to draw a bar chart for each gender and age field.

  1. Read gender / age dictionary CSV file
  2. The beneficiary category code is linked to the age group / gender definition, and the distribution of different age groups in the claim data set is described
  3. Project the gender / age distribution in the dataset onto a bar chart

The following screen shot shows a bar chart of the distribution of age groups. We can see that the distribution of claims is slightly unbalanced, under_ 65 and 85_ and_ Older has a higher proportion of claims. Since the age groups represented by these two categories are open-ended and cover a wider range, this imbalance can be ignored.
Amazon sagemaker: marking suspicious medical insurance claims

The following screen shot shows the gender bar chart, which is also slightly unbalanced. The proportion of women’s claims is slightly higher. But because it is only slightly unbalanced, it can be ignored.
Amazon sagemaker: marking suspicious medical insurance claims

Analyze days, payment codes, and payment amount data

At this stage, we do not need to convert the data of length of stay code, DRG quintile payment code and DRG quintile payment amount. The data is clearly coded, and any unbalanced data may imply a model to capture abnormal signals, so no further imbalance analysis is required.

Using word2vec to perform feature engineering on text columns

There are seven analysis variables in the data set. Among the seven variables, we directly used patient age, patient gender, length of stay, DRG quintile payment code and DRG quintile payment amount as features without further conversion. You do not need to perform feature engineering on these fields. These fields are encoded as integers, and mathematical operations can be safely applied.

However, we still need to extract relevant features from diagnosis and process description. Diagnostic and procedural fields are encoded as integers, but the results of mathematical operations on the encoded values distort their meaning. For example, the average of two process codes or diagnostic codes can cause the code used for the third process / diagnostic to be completely different from the two process / diagnostic codes used to calculate the average. The techniques discussed in this paper encode the process and diagnostic description fields in the dataset in a more meaningful way. The technology uses the continuous bag of words model (cbow), which is a specific word2vec implementation of word embedding technology.

Word embedding technology is to convert words into numbers. There are many ways to convert text to numbers, such as frequency counting and single hot coding. Most traditional methods generate a sparse matrix, which is inefficient in context and computation.

Word2vec is a shallow neural network, which can map words to the same target variables. In the training process, the neural network learning acts as the weight of the word vector representation.

The cbow model predicts a word in a given context (it can be a sentence, etc.). The dense vector representation of words learned by word2vec has semantics.

Text preprocessing for diagnosis and process description

The following code performs text processing on the diagnostic description to enhance the meaning of some acronyms for word embedding.

a) Change to lowercase

b) Will

i. Replace ‘&’ with ‘and’

II. Replace ‘non -‘ with ‘non’

III. replace ‘w / O’ with ‘without’

IV. replace ‘W’ with ‘with’

v. Replace ‘major’ with ‘major’

Vi. replace ‘proc’ with ‘procedure’

VII. ‘O.R.’ with ‘operating room’

c) Split phrases into words

d) Return word vector

\# function to run pre processing on diagnosis descriptions

from nltk.tokenize import sent\_tokenize, word\_tokenize

def text\_preprocessing(phrase):

 phrase = phrase.lower()

 phrase = phrase.replace('&', 'and')

 #phrase = phrase.replace('non-', 'non') #This is to ensure non-critical, doesn't get handled as {'non', 'critical'}

 phrase = phrase.replace(',','')

 phrase = phrase.replace('w/o','without').replace(' w ',' with ').replace('/',' ')

 phrase = phrase.replace(' maj ',' major ')

 phrase = phrase.replace(' proc ', ' procedure ')

 phrase = phrase.replace('o.r.', 'operating room')

 sentence = phrase.split(' ')

 return sentence

After marking and preprocessing the diagnostic description, the output is passed into word2vec to generate word embedding.

Generating word embedding for a single word

To generate word embedding for a single word in preprocessing and diagnostic descriptions, complete the following steps:

  1. The word2vec model is trained to transform the preprocessing process and diagnosis description into features, and the python visualization library named SNS is used to visualize the results in 2D space.
  2. Cbow is used to extract feature vectors from the preprocessed diagnosis and process code description.
  3. On the Amazon sagemaker Jupiter notebook instance, word2vec model is trained locally for diagnosis and process description.
  4. The model is used to extract a fixed length word vector for each word in the process and diagnosis description.

This paper uses word2vec (which can be obtained through gensim software package). For more information, see thegenism 3.0.0。 After completing the above steps, the final vector of each word contains 72 floating-point numbers. In diagnosis and process description, it is used as the feature vector of marker words.

Generating word embedding from process and diagnostic description phrases

After obtaining the word vector of each word, new word embedding can be generated.

  1. Using the mean value of all word vectors in the process and diagnosis description, a new vector is constructed for each complete phrase describing the diagnosis and process. The new vector will become the feature set of diagnosis and process description fields in the dataset. See the following code example:
\# traing wordtovec model on diagnosis description tokens

model\_drg = Word2Vec(tmp\_diagnosis\_tokenized, min\_count = 1, size = 72, window = 5, iter = 30)
  1. Gets the average value of all word vectors in the phrase. This will generate word embedding for the complete diagnostic description phrase. See the following code example:
#iterate through list of strings in each diagnosis phrase

for i, v in pd.Series(tmp\_diagnosis\_tokenized).items():

 #calculate mean of all word embeddings in each diagnosis phrase

 values.append(model\_drg\[v\].mean(axis =0))


tmp\_diagnosis\_phrase\_vector = pd.DataFrame({'Base DRG code':index, 'DRG\_VECTOR':values})
  1. The diagnosis description vector is extended to feature. See the following code example:
\# expand tmp\_diagnosis\_phrase\_vector into dataframe

\# every scalar value in phrase vector will be considered a feature

diagnosis\_features = tmp\_diagnosis\_phrase\_vector\['DRG\_VECTOR'\].apply(pd.Series)

\# rename each variable in diagnosis\_features use DRG\_F as prefix

diagnosis\_features = diagnosis\_features.rename(columns = lambda x : 'DRG\_F' + str(x + 1))

\# view the diagnosis\_features dataframe


The following screen shot shows the generated word embedding. But they are abstract and do not help with visualization.
Amazon sagemaker: marking suspicious medical insurance claims

  1. Repeat the process of executing the diagnostic code to the process code, and finally we will get the feature set of the process description. See the screenshot below.

Amazon sagemaker: marking suspicious medical insurance claims

Visual diagnosis and process description vector

In this paper, we use a technology called t-sne to visualize the result of word embedding in 2D or 3D form (multidimensional space). The following screen shot shows the t-sne diagram, which draws a 2D projection of the word vector generated by the word2vec algorithm.

Even if the parameters used in the training model are the same, the word2vec and t-sne diagrams are not necessarily the same. This is because each new training session is randomly initialized.

There is no ideal shape in t-sne graphs. But avoid the pattern that all words appear in a cluster and are very close to each other. The picture below is good.
Amazon sagemaker: marking suspicious medical insurance claims

Repeat the process for the process description. The following screen shot shows the 2D projection after processing and applying word2vec. Again, this image works well.
Amazon sagemaker: marking suspicious medical insurance claims

All the feature sets are gathered to form the final training feature set

Next, all the features extracted from the six analysis variables are aggregated to form the final feature set. We can use the standard Python library for data science.

The principal component analysis (PCA) model was fitted to the preprocessed data set

The next step demonstrates how to use PCA for anomaly detection. We will useA Novel Anomaly Detection Scheme Based on Principal Component Classifier(a new anomaly detection scheme based on principal component classifier) to demonstrate the PCA based anomaly detection method.

Split the data into training data and test data

Before using PCA for anomaly detection, we need to divide the data into training data and test data. Ensure that this randomly split sample can cover payment distributions of all sizes. In this paper, the DRG quintile payment amount code is divided into two parts: 30% for testing and 70% for training. See the following code example:

from sklearn.model\_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n\_splits=1, test\_size=0.3, random\_state=0)

splits = sss.split(X, strata)

for train\_index, test\_index in splits:

 X\_train, X\_test = X.iloc\[train\_index\], X.iloc\[test\_index\]

The next step is to standardize the data to avoid being dominated by high-scale variables.

Standardized data based on training samples

Since the subsequent PCA algorithm for training maximizes the orthogonal variance in the data, standardize the training data to have zero mean and unit variance before performing PCA. By doing so, we can ensure that PCA algorithm is idempotent with this scaling transformation, and prevent high-scale variables from dominating PCA projection. See the following code example:

from sklearn.preprocessing import StandardScaler

n\_obs, n\_features = X\_train.shape

scaler = StandardScaler()\_train)

X\_stndrd\_train = scaler.transform(X\_train)

So far, we have completed the feature extraction and standardization from the dataset. We can use itAmazon SageMaker PCACarry out anomaly detection. To do this, use Amazon sagemaker PCA to reduce the number of variables and ensure that variables are independent of each other.

Amazon sagemaker PCA is an unsupervised ML algorithm that can reduce the dimension (number of features) in a dataset while retaining as much information as possible. It does this by looking for a new set of features called components, which are composed of original features that are not related to each other. They are also constrained so that the first component explains the largest potential variability in the data, the second component explains the second variability, and so on.

The output model of Amazon sagemaker PCA based on data training is used to calculate how each variable is related to each other (covariance matrix), the direction of data dispersion (eigenvector) and the relative importance (eigenvalue) of these different directions.

Convert the data into a binary stream and upload it to Amazon S3

Before starting the Amazon sagemaker training assignment, convert the data into a binary data stream and upload it to theAmazon S3。 See the following code example:

\# Convert data to binary stream.

matrx\_train = X\_stndrd\\_matrix().astype('float32')

import io

import as smac

buf\_train = io.BytesIO()

smac.write\_numpy\_to\_dense\_tensor(buf\_train, matrx\_train)


Call the Amazon sagemaker fit function to start the training job

The next step is to call the Amazon sagemaker fit function to start the training job. See the following code example:

#Initiate an Amazon SageMaker Session

sess = sagemaker.Session()

#Create an Amazon SageMaker Estimator for Amazon SageMaker PCA.

#Container parameter has the image of Amazon SageMaker PCA algorithm #embedded in it.

pca = sagemaker.estimator.Estimator(container,






#Specify hyperparameter






#Start training by calling fit function{'train': s3\_train\_data})

Call Function triggers the creation of a separate training instance. This allows us to choose different types of instances for training, as well as building and testing.

The whole data set was scored

Download and decompress the trained PCA model

After completing the training task, Amazon sagemaker writes the model components to the specified S3 output location. We can download and decompress the returned PCA model artifacts to reduce the dimension.

Amazon sagemaker PCA components contain, feature vector principal components (in ascending order) and their eigenvalues. The characteristic value of the component is equal to the standard deviation of the component interpretation. For example, the square of the eigenvalue of a single component is equal to the variance explained by that component. Therefore, to calculate the variance ratio of the data interpreted by each component, ask for the square of the feature and divide it by the sum of the squares of all the eigenvalues.

If you want the component that explains the most variance to appear first, reverse the order of this return.

The PCA component diagram was drawn to further reduce the dimension

We can use PCA to reduce the dimension of the problem. We have a feature and − 1 component, but as you can see in the figure below, many components are not very useful in explaining the variance of the data. Only the main components are retained, which can explain 95% of the data variance.

Thirteen components explained 95.08% of the data variance. The red dashed line in the figure below highlights the truncation required for 95% of the data variance.
Amazon sagemaker: marking suspicious medical insurance claims

The Mahalanobis distance was calculated to score the anomaly for each claim

In this paper, Mahalanobis distance of each point is used as its abnormal score. The highest% of these points is considered an outlier, depending on the detection sensitivity required. In this paper, the highest 1%, i.e. = 0.01. Therefore, the (1 −) quantile of the distribution is calculated as the threshold value for judging the abnormal data points.

The following figure is generated based on Mahalanobis distance from the feature set, which is the output of Amazon sagemaker PCA algorithm. The red line describes the threshold of anomaly detection according to the defined sensitivity.
Amazon sagemaker: marking suspicious medical insurance claims

We can mark claims as “is abnormal” true / false using outlier scores derived from Mahalanobis distance and sensitivity. Records with “abnormal” true reach the exception threshold and should be considered suspicious. “Abnormal” false records do not reach the threshold and are not considered suspicious. This separates the exception claim from the standard claim.

Apply thresholds to scores to identify any suspicious or unusual claims

Draw and analyze the abnormal record chart

We can label abnormal claim records simply using mathematical techniques (without using unlabeled data) in the order of operations performed on the CMS claim dataset.

The following screen shot shows an example of a standard record.
Amazon sagemaker: marking suspicious medical insurance claims

The following screen shot shows an example of an exception record.
Amazon sagemaker: marking suspicious medical insurance claims

Now that we have separated the standard data from the exception data, we can now treat any data point that marks “abnormal” as true as suspicious data and investigate them in depth.

After expert investigation, we can confirm whether the claim is abnormal. If you are interested in this and want to present your own explanation, hypothesis or model, you can plot pairwise characteristic maps between different variables (such as age, gender, length of stay, quintile code, quintile payment method, process, and diagnostic code).

For basic analysis, we can useseabornThe library draws pairwise feature maps. The following screen shot shows the pairwise feature maps in a graph, with standard claims in blue and exception claims in orange, which overlap each other. As you can see, the orange dots are either asymmetrical or isolated from the blue dots (there are no blue dots nearby).

The pairwise feature map highlighted in red shows the asymmetric pattern. Between blue and orange are isolated areas with orange dots but no blue dots. You can go deeper into these graphs and analyze the data behind the highlighted graphs to find patterns or make assumptions. Since the label data is not provided in this paper, it is difficult to test the hypothesis. However, as time goes on, our tag data may increase, which can be used to test our hypothesis and improve the accuracy of the model.
Amazon sagemaker: marking suspicious medical insurance claims


This article demonstrates how to build a model to mark suspicious claims. We can use this model as a starting point when building processes that support payment integrity. We can further extend the model by introducing more data or adding more data sources from existing sources. The model in this paper can be extended and absorb more data to improve results and performance.

The use of this model helps to minimize fraud cases. Because of the fear of being marked, false claims will be curbed and users’ health care costs will be reduced. If you want to try out the technology described in this article, use your own Amazon sagemaker Jupiter notebook.GitHub repositoryAnd related components are provided.

Amazon sagemaker: marking suspicious medical insurance claims

Recommended Today

Summary of recent use of gin

Recently, a new project is developed by using gin. Some problems are encountered in the process. To sum up, as a note, I hope it can help you. Cross domain problems Middleware: func Cors() gin.HandlerFunc { return func(c *gin.Context) { //Here you can use * or the domain name you specify c.Header(“Access-Control-Allow-Origin”, “*”) //Allow header […]