Sagemaker model monitor and debugger are divine assistants to understand the “black box” of convolutional neural network


Sagemaker model monitor and debugger are divine assistants to understand the

Many things can be better improved and enhanced only after understanding the internal principle and operation mechanism. However, some things are born “incomprehensible”, such as convolutional neural network (CNN). So how to further improve it?

CNN has excellent practical performance in practical tasks such as image classification and target detection. Therefore, it has been introduced into a variety of application scenarios, including detecting traffic signs and objects on the street in the field of automatic driving, accurately classifying image-based data anomalies in the field of medical care, and inventory management in the field of retail.

However, CNN operates like a black box. If we can’t understand the reasoning process of prediction, we are likely to encounter problems in use. Similarly, after the model is deployed, the data used for reasoning may follow a completely different data distribution than the data specially used for model training. This phenomenon is often referred to as data drift(Data drift), it may lead to model prediction errors. In this case, understanding and explaining the causes of model prediction errors has become the only hope to get out of the fog.

Technical methods such as class activation maps and salience maps can help us intuitively see how CNN model makes decisions. These diagrams are in the form of heat maps, which can prompt the key parts of the image material that determine the prediction results. The following example is from the German traffic sign data set: the left image is used as the input material of the optimized RESNET model, and the model prediction result is that the image belongs to category 25 (road construction). The right figure shows the input image covered with heat map, in which red represents the pixel with the highest correlation with the prediction conclusion of category 25, and the more blue represents the lower correlation.
Sagemaker model monitor and debugger are divine assistants to understand the

If the model makes wrong predictions and cannot give a clear reason, we need to visualize CNN decisions. This method can also help you determine whether you need to add more representative samples to the training data set, or judge whether there are deviations in the data set. For example, if a set of object detection model is responsible for finding obstacles in road traffic, but the training data set used only covers the collected samples in the summer cycle, the model may not bring good reasoning results in the winter scene because it has never seen objects covered with ice and snow.

In this paper, we will deploy a set of models for traffic sign classification and set upAmazon SageMaker Model MonitorTo automatically detect model behaviors that do not match expectations, such as always low prediction scores or over prediction of certain image categories. When the model monitor finds a problem, we will useAmazon SageMaker DebuggerGet a visual description of the deployed model. You can update the reasoning endpoint, send tensors during reasoning, and use these tensors to calculate the significance mapping. To reproduce the various steps and results covered in this article, setamazon-sagemaker-analyze-model-predictionsClone repo into our own Amazon sagemaker notebook instance or Amazon sagemaker studio, and then run the notebook.

Define a sagemaker model

This article will use trainedResNet18Model, which uses the German traffic sign data set [2] to distinguish 43 traffic signs. When an input image is given, the model will output the corresponding probabilities of different image categories. Each category corresponds to a different type of traffic sign. We have tuned the model and uploaded its weights toGitHub repo

Before deploying the model to Amazon sagemaker, you need to first archive the model weights and upload them to Amazon sagemakerAmazon Simple Storage Service(Amazon S3)。 Enter the following code in the Jupiter notebook unit:

sagemaker_session.upload_data(path=’model.tar.gz’, key_prefix=’model’)

You can use the Amazon sagemaker managed service to set the persistent endpoint, and use this endpoint to obtain the prediction results from the model. To do this, we need to define a pytorch model object, which is responsible for extracting the model archive from Amazon S3 path. Define aentry_point file pretrained_model.pyFiles, implementationmodel_fnAndtransform_fnFunction. You can use these functions during hosting to ensure that the model has been correctly loaded into the reasoning container and can correctly handle incoming requests. See the following code for details:

from sagemaker.pytorch.model import PyTorchModel
model = PyTorchModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz',
 role = role,
 framework_version = '1.5.0',
 entry_point = '',

Set up model monitor and deploy the model

The model monitor automatically monitors machine learning models in the production environment and alerts you when data quality problems are detected. In this solution, we will capture the input and output of the endpoint and create a monitoring plan to ensure that the model monitor can check the collected data and relevant model prediction results. We use the datacaptureconfig API to specify the ratio of model monitor to store model input and output on the target Amazon S3 bucket. In the following example, the sampling percentage is set to 50%:

from sagemaker.model_monitor import DataCaptureConfig
data_capture_config = DataCaptureConfig(
 destination_s3_uri='s3://' + sagemaker_session.default_bucket() + '/endpoint/data_capture'

To deploy the endpoint to the ml.m5.xlarge instance, enter the following code:

predictor = model.deploy(initial_instance_count=1,
endpoint_name = predictor.endpoint

Reasoning on test images

Now you can call the endpoint with the input payload containing the serialized image. The endpoint will call furthertransform_fnFunction to preprocess the data before performing model reasoning. The endpoint will return the prediction classification result of the image in the form of integer list, and then encode it into JSON format. See the following code for details:

#invoke payload
response = runtime.invoke_endpoint(EndpointName=endpoint_name, Body=payload)
response_body = response['Body']
#get results
result = json.loads(

Now we can visualize the test image and its predicted classification results. In the following visualization process, we send the traffic sign image to the endpoint as the content to be predicted, and the label above is the corresponding prediction received from the endpoint. As shown in the figure below, category 23 (wet pavement) is correctly predicted for the endpoint.
Sagemaker model monitor and debugger are divine assistants to understand the

As shown in the figure below, category 25 (road construction) is correctly predicted for the endpoint.
Sagemaker model monitor and debugger are divine assistants to understand the

Create a model monitor plan

Next, we will demonstrate how to establish a monitoring plan using model monitor. Model monitor provides a built-in container for creating benchmarks to calculate constraints and statistics, such as mean, quantile, standard deviation, and so on. On this basis, we can start the monitoring plan, which regularly starts processing jobs to check the collected data, compare the data with the given constraints, and generate violation reports.

In this use case, we will create a custom container that can perform a simple model integrity check, that is, run the evaluation script and count the predicted image classification. If the recognition proportion of the model for specific traffic signs is always higher than that of other categories, or the confidence is always low, it indicates that there is a problem in the model.

For example, for a given input image, the model will return a list of prediction classes ranked based on confidence scores. If the first three predictions correspond to three unrelated categories, and the confidence score of each prediction is less than 50% (for example, the first prediction is a stop sign, the second is a left turn sign, and the third is a speed limit sign of 180km / h), I’m afraid none of the three conclusions can be accurately accepted.

About building custom containers and uploading them toAmazon Elastic Container Registry(Amazon ECR), seeNotebook。 The following code will create a model monitor object in which we can specify the location of the docker image on Amazon ECR and the environment variables required to evaluate the script. The entry point file of the container is the evaluation script.

monitor = ModelMonitor(
 image_uri='' %my_account_id,

Next, define a model monitor plan and attach it to the endpoint. This schedule runs a custom container every hour. See the following code for details:

from sagemaker.model_monitor import CronExpressionGenerator
from sagemaker.processing import ProcessingInput, ProcessingOutput
destination = 's3://' + sagemaker_session.default_bucket() + '/endpoint/monitoring_schedule'
processing_output = ProcessingOutput(output_name='model_outputs', source='/opt/ml/processing/outputs', destination=destination)
output = MonitoringOutput(source=processing_output.source, destination=processing_output.destination)

As mentioned earlier, the scriptevaluation.pyA simple model integrity check is performed, that is, the model prediction results are counted. Model monitor saves the input and output of the model as a JSON line format file in Amazon S3. These files will be downloaded to the processing container of / opt / ml / processing / input. Next, you can load these prediction results through ‘capturedata’ [‘data ‘]. See the following code for details:

for file in files:
 content = open(file).read()
 for entry in content.split('n'):
 prediction = json.loads(entry)['captureData']['endpointOutput']['data']

We can track the status of processing jobs in cloudwatch and sagemaker studio. In the following screenshot, sagemaker studio shows that no problems were found.

Capture unexpected model behavior

After defining the plan, we can monitor the model in real time. If we need to verify that this setting can correctly capture unexpected behavior, we need to perform wrong prediction first. Therefore, we use the advbox toolkit [3], which can add interference at the pixel level to ensure that the model cannot continue to recognize the correct class. This interference is also called counter attack, in which the adjustment is often undetectable by humans. We converted several test images correctly predicted as stop signs. In the following image sets, the left image is the original image, the middle is the confrontation image, and the right is the difference between the two. There is almost no difference between the original image and the countermeasure image, but the countermeasure image can not be classified correctly in the reasoning process.
Sagemaker model monitor and debugger are divine assistants to understand the

The following figure shows another traffic sign with wrong judgment:
Sagemaker model monitor and debugger are divine assistants to understand the

When model monitor plans the next processing job, it will analyze the prediction results that have been captured and stored in Amazon S3. The job calculates the predicted image category; If a specific category has a more than 50% chance of being selected as the prediction result, there is a problem. Since we have sent the countermeasure image to the endpoint, you can now see that the image has an exception count on category 14 (stop). You can track the status of processing jobs in sagemaker studio. In the following screenshot, sagemaker studio shows that a problem was found in the last scheduled job.
Sagemaker model monitor and debugger are divine assistants to understand the

You can get more details from Amazon cloudwatch logs: the processing job will output a dictionary, in which the key is one of 43 image categories, and the value is the corresponding count. For example, in the following output, the endpoint considers that image category 9 (no passage) occurs twice, while category 14 (stop) has an abnormal count. In all 400 predictions, 322 endpoints made category 14 predictions, which has been higher than the threshold level of 50%. The values of the dictionary are also stored as cloudwatch indicators, so you can use the cloudwatch console to create display charts for indicator data.

Warning: Class 14 ('Stop sign') predicted more than 80 % of the time which is above the threshold
Predicted classes {9: 2, 19: 2, 25: 1, 14: 322, 13: 5, 5: 1, 8: 10, 18: 1, 31: 4, 26: 8, 33: 4, 36: 4, 29: 20, 12: 8, 22: 4, 6: 4}

Now that the problem has been found in the processing operation, the next step must be to further understand the truth. Observing the previous test images, we found that there was no significant difference between the original image and the confrontation image. In order to better understand the content of the “eyes” of the model, we can use the paper《Full-Gradient Representation for Neural Network Visualization》[1] Technology mentioned in. This technology can sort out the corresponding scores for the importance of input feature and intermediate feature map. Later, we will show how to configure debugger to easily extract these variables (tensors) without modifying the model itself. In addition, we will describe in detail how to use these tensors to calculate significance mapping.

Create debugger hook configuration

To extract tensors, we need to update the python script of the pre training model in the step of building Amazon sagemaker pytorch。 We’re in the model_ FN to create a debugger hook configuration, which defines a set of regular expressions include_ Regex, the tensors matched by these regular expressions will be collected. Later, we will describe in detail how to calculate the significance mapping. This calculation needs to use the deviation and gradient of the middle layer (such as batchnorm and lower sampling layer) and model input. To get the tensor, specify the following regular expression:


Store the tensor in the default storage bucket of Amazon sagemaker. See the following code for details:

def model_fn(model_dir):
 #load model
 model = resnet.resnet18()
 #hook configuration
 save_config = smd.SaveConfig(mode_save_configs={
 smd.modes.PREDICT: smd.SaveConfigMode(save_interval=1)
 hook = Hook("s3://" + sagemaker_session.default_bucket() + "tensors",
 include_regex='.*bn|.*bias|.*downsample|.*ResNet_input|.*image' )
 #register hook
 #set mode
 return model

Use the new entry point, create a new pytorch model:

model = PyTorchModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz',
 role = role,
 framework_version = '1.3.1',
 entry_point = '',

Update the existing endpoint with a new pytorch model object that uses the modified model script with debugger hook:

predictor = model.deploy(
 instance_type = 'ml.m5.xlarge',

Now, every time a reasoning request is made, the endpoint records the tensor and uploads the results to Amazon S3. Now, we can calculate the significance mapping and get an intuitive explanation from the model.

Analyze incorrect predictions using debugger

The output of the classification model is usually a probability array between 0 and 1, where each entry corresponds to the label in the dataset. For example, in the case of MNIST (10 categories), the model may predict the input of the image as the number 8 as: [0.08, 0, 0, 0, 0, 0.12, 0, 0.5, 0.3], indicating that the probability of the image being predicted as 0 is 8%, the probability of 6 is 12%, the probability of 8 is 50%, and the probability of 9 is 30%. To generate significance mapping, we need to select the category with the highest probability (in this case, it is naturally 8), and then map the score back to the original layer in the network, so as to identify the important neurons making this prediction. CNN consists of multiple layers, so the importance score of each intermediate value will reflect how to calculate the contribution of each value to the current prediction result.

You can compare the prediction result gradient of the model input with the input to determine the importance score. The gradient represents how much the output will change when the input changes. To record the results, register a reverse hook on the layer output and trigger the reverse call during reasoning. We have configured the debugger hook to capture the relevant tensors.

After updating the endpoint and executing some reasoning requests, you can create a trial object to access, query and filter the data saved by the debugger. See the following code for details:

from smdebug.trials import create_trial
trial = create_trial('s3://' + sagemaker_session.default_bucket() + '/endpoint/tensors')

Using debugger, we can access data through trial. Tensor(). Value(). For example, to obtain the deviation tensor of the first batchnorm layer in the first inference request, enter the following code:

trial.tensor('ResNet_bn1.bias').value(step_num=0, mode=modes.PREDICT).

The function trial. Steps (mode = modes. Predict) returns the number of available steps, where the number of steps corresponds to the number of recorded reasoning requests.

In the following steps, we will calculate the significance mapping based on the fullgrad method. This method will aggregate the input gradient and the deviation gradient at the feature level.

Calculate implicit deviation

In the fullgrad method, the batchnorm layer of resnet18 introduces implicit deviation. You can calculate the implicit deviation by obtaining the running average, variance and layer weight. See the following code for details:

weight = trial.tensor(weight_name).value(step_num=step, mode=modes.PREDICT)
running_var = trial.tensor(running_var_name).value(step_num=step, mode=modes.PREDICT)
running_mean = trial.tensor(running_mean_name).value(step_num=step, mode=modes.PREDICT)
implicit_bias = - running_mean / np.sqrt(running_var) * weight

Gradient deviation multiplication

Deviation refers to the sum of explicit and implicit deviations. You can obtain the output gradient of feature mapping and calculate the product of deviation and gradient. See the following code for details:

gradient = trial.tensor(gradient_name).value(step_num=step, mode=modes.PREDICT)
bias = trial.tensor(bias_name).value(step_num=step, mode=modes.PREDICT)
bias = bias + implicit_bias
bias_gradient = normalize(np.abs(bias * gradient))

Interpolation and aggregation

The dimension of the middle layer is usually different from the input image, so we need to interpolate it. You can perform interpolation for all deviation gradients and summarize the corresponding results. The sum is the significance map we superimposed on the original input image in the form of heat map. See the following code for details:

for channel in range(bias_gradient.shape[1]):
interpolated = scipy.ndimage.zoom(bias_gradient[0,channel,:,:], image_size/bias_gradient.shape[2], order=1)

saliency_map += interpolated


In this section, we introduce some examples of countermeasure images, which are recognized as stop signs by the model. The image on the right shows the superposition result of the model input image and significance mapping. Red represents the most influential part in the model prediction, and it is likely to represent the location of pixel interference. It can be seen that under the influence of interference, the model can no longer correctly consider the relevant object characteristics, and the confidence score of most predictions provided is also very low.
Sagemaker model monitor and debugger are divine assistants to understand the

Sagemaker model monitor and debugger are divine assistants to understand the

Sagemaker model monitor and debugger are divine assistants to understand the

For comparison, we also use the original (non confrontational) image for reasoning. In the following image sets, the left image is the confrontation image and significance mapping corresponding to the category “stop”. The image on the right is the original input image (non confrontation image) and significance mapping corresponding to the correct class judgment. In non confrontational images, the model obviously only focuses on the relevant object features, so it is more likely to make correct image category prediction. When processing countermeasure images, the model takes into account many other features other than related objects, which is obviously caused by random pixel interference.
Sagemaker model monitor and debugger are divine assistants to understand the


This article shows how to use Amazon sagemaker model monitor and Amazon sagemaker debugger to automatically detect unexpected model behavior and get intuitive explanations from CNN. For more details, seeGitHub repo


[1] Suraj Srinivas, Francois fleuret, “full gradient representation of neural network visualization”(Full-gradient representation for neural network visualization), advances in neural information processing systems (neurips), 2019

[2] Johannes stallkamp, Marc schlipsing, Jan Salmen, Christian Igel, German traffic sign recognition benchmark: multi category competition(The German traffic sign recognition benchmark: A multi-class classification competition,) 2011 International Joint Conference on neural networks, 2011

[3] Dou Goodman, Hao Xin, Wang Yang, Wu Yuesheng, Xiong Junfeng, Zhang Huan, advbox: a toolbox for generating countermeasures to interfere with neural network judgment(Advbox: a toolbox to generate adversarial examples that fool neural networks

Sagemaker model monitor and debugger are divine assistants to understand the