Curious to understand a bit more about how Groundlight’s Computer Vision software can detect anomalies in industrial manufacturing? Below we describe an approach which uses a deep learning and computer vision technique to learn patterns in video called future frame prediction.
What Is Anomaly Detection in Industrial Manufacturing?
Can we detect industrial process anomalies the moment they occur? Anomaly detection is the task of identifying when something unusual is happening and is crucial in areas like manufacturing automation where detecting process failures is essential to preventing human injuries and damages to equipment.
How Groundlight Uses Computer Vision to Monitor Manufacturing Processes
At Groundlight, we've developed a prototype computer vision system designed to monitor ongoing manufacturing processes through a live video feed and raise alerts if they go awry. Of course, what it means to go awry is context-dependent; manufacturing processes are different. This open-ended nature makes anomaly detection extremely challenging.
To make the problem tractable, we allow our system to observe demonstration video of the process operating correctly to illustrate how events should unfold under normal conditions. The demonstration footage is used to train a deep learning model to predict the next frame of video given the current state of the video stream. The system can then detect anomalies by comparing each frame to its prediction. An anomaly is therefore defined as a deviation from predictions. The beauty of the method lies in its simplicity; it only requires demonstrations of the process running correctly. And it can leverage recent advances in generative AI which have improved dramatically at producing realistic images and video [1].
Experiments on video of a repetitive machine tending task show our prototype system can flag anomalies in the manufacturing process after observing just 30 minutes of nominal training footage while raising no false alarms.
An Example of How Groundlight’s Computer Vision Software Detects Anomalies
The example video below shows how our system's anomaly score spikes when a human operator threads the same part multiple times instead of just once, as was always the case in the training footage. The video also renders the system's predictions about what is likely to happen in the current frame given what it was observing a few moments ago.
The left-hand panel of the video shows the actual frame, the middle panel plots the anomaly score, and the right-hand panel renders the model's prediction about how the frame should look under normal circumstances:
Anomaly scores and predicted video for the "double cycle" anomaly. Left panel: actual frame, middle panel: plotted anomaly score, right panel: model's prediction about how the frame should look under normal circumstances
How Anomaly Detection by Future Frame Prediction Works
Following Wen et al. (2018), we reduce the problem of anomaly detection in video to the task of future frame prediction, that of predicting the next frame in video based on the frames leading up to it. With an accurate frame prediction model, our system can spot anomalies by measuring the extent to which the actual frame differs from the predicted frame. When the prediction error is large, the system has reason to believe something anomalous is afoot.
Future frame prediction is a well studied problem in deep learning and computer vision [2, 3, 4]. Here, a video is treated as a sequence of frames, (x1, …, xT), where each frame xt in the sequence is an image containing HxW pixels with three (RGB) color values that we scale to lie within the range 0 to 1.
The goal in future frame prediction is to learn a model, fθ, that accurately predicts the next frame based on the ones that preceded it. Training the model works by finding a parameter θ* with small prediction error between the actual and predicted frames:
Here, the output of the model at each time step t is a predicted image, x̂t = fθ(x1, …, xt-1), with the same size and shape as the original image, xt. The error function is a measure of discrepancy between the original and predicted frame, and can be something simple like the sum of squared errors over all HxW pixels and their C=3 color channels (RGB), e.g.,
However, as discussed below, this simple sum of squared errors is not appropriate for anomaly detection in video since different regions of the frame exhibit different levels of variation. To handle these different scales of variation, our frame prediction model also learns the scale of the typical variation. That is, it learns to predict the mean and variance for each pixel of each image location and point in time. The prediction errors are then baselined against the predicted level of variation in order to generate the anomaly score.
U-Net Architecture for Future Frame Prediction
Like the authors of [3, 4], we base our frame prediction model on the U-Net architecture [5]. U-Nets have an encoder-decoder bottleneck architecture where the encoder's hidden layers become more narrow while the subsequent decoder layers become progressively wider. What distinguishes U-Nets from other encoder-decoder architectures is that they have ResNet-style skip connections from the encoder layers to corresponding decoder layers. These skip connections provide a further way to model spatial dependencies in the unfolding scene, letting each pixel's color in the predicted future frame be a direct function of its color in earlier, preceding frames.
Markov Assumption used in Prediction
In practice, instead of using the full video history to predict the next frame of video, we make a Markov assumption, that the contents of the next frame depend only on the k most recent frames rather than the full video history. Therefore, we feed just the k most recent frames as inputs to the model when predicting the next frame, x̂t = fθ(xt-k, …, xt-1). The number of input channels to the U-Net is therefore Cin = 3k; three color channels for each of the k input frames.
Mean-Variance Estimation and Training
In addition to predicting a point estimate of each pixel's RGB color, our model predicts a variance value to express its degree of uncertainty about the color at each image location. This type of regression is called mean-variance estimation [6, 7] and it allows us to model the fact that different regions of the scene vary by differing amounts. The number of output channels is therefore Cout=4, where now there are the three predicted outputs for the RGB color at each pixel and one more for the predicted variance.
Formally, the model predicts a pair (x̂t, σt2) = fθ(xt-k, …, xt-1) where x̂t contains the predicted RGB colors at each pixel and σt2 is an HxW array specifying the variance of each pixel. For training, we use the modified error function to baseline the prediction error against the predicted variance at each image location:
Experiments on Machine Tending Tasks
To test our ability to detect anomalies, we trained the above model on 30 minutes of footage of a human operator performing a repetitive threading task on aluminum blocks. The process cycles nearly once every 6 seconds, so overall the model observes around 300 manual demonstrations of the task. The test data consists of 5 video clips totalling 246 seconds in length, three of which contain process anomalies and the other two do not. All video is downsampled to 3 frames per second, and frames are scaled down to 350x300 pixels. The number of context frames used to predict the next frame is k=7, corresponding to just over 2 seconds of predictive context.
We trained the U-Net future frame prediction model for 6 epochs using the AdamW learning rule on a single NVidia GeForce RTX 3090 GPU. Training took 27 minutes but this can easily be reduced with additional GPUs.
To generate the anomaly score at each time step, we compute the root mean normalized squared error for each frame,
Since the variance in the denominator is also learned, the anomaly score typically has a value that's close to 1.0 when the process is running normally. We can therefore flag an anomaly as a period in which the anomaly score is above some threshold substantially larger than 1.0. Examining the anomaly scores on a few minutes of held-out training video, we set the anomaly threshold to 1.25 and this was sufficient to detect all 3 of the process anomalies in the 4 minutes of video without raising any false alarms.
Here is the system detecting an anomaly in which the human operator has failed to place a aluminum block underneath the threading head as it's being lowered.
Again, the left-hand panel shows the current frame that's being predicted, the middle panel shows the anomaly score, and the right-hand panel shows the frame predicted by the model:
Anomaly scores and predicted video for the "missing block" anomaly. Left panel: current frame being predicted, middle panel: anomaly score, right panel: frame predicted by the model
Visualizing the Predicted Variance
Since mean-variance estimation allows the model to predict the variances for each pixel, σt2(h,w), we can visualize which regions of the frame the model is most uncertain about. Below, we render the predicted variances for the same "missing block" anomaly video used above. White pixels correspond to areas where the model is unsure about which color to predict, while dark areas correspond to regions of certainty.
Anomaly scores and predicted uncertainties for the "missing block" anomaly
Sensibly, the model seems most uncertain in areas of the frame that are changing frequently, such as the region where the hand is moving in and out of frame, and also the region where the threading head is moving up and down. Moreover, the model is highly uncertain at the end of the video, when the threading head is moving but the hand has oddly not reappeared to thread the next block.
Conclusion
Making industrial manufacturing robust to process anomalies is critical to ensuring their safety and cost effectiveness. Here we’ve reduced the problem of spotting anomalies in industrial videos to future frame prediction and built a deep learning system based on this approach. Our system combines recent advances in generative AI with mean-variance estimation in statistical deep learning in order to produce a robust, yet sensitive anomaly score. Our system is able to detect anomalies in a repetitive manufacturing process after training on 30 minutes of video filmed under normal operation.
Our ongoing work includes more testing in industrial settings and extending the frame prediction model with video vision transformers so it can leverage a longer context window in prediction [8].
Learn how Groundlight AI Can Detect Anomalies in Industrial Manufacturing
--
References
- Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE conference on computer vision and pattern recognition. 2022.
- Mathieu, Michael, Camille Couprie, and Yann LeCun. "Deep multi-scale video prediction beyond mean square error." International Conference on Learning Representations. 2016.
- Liu, Wen, et al. "Future frame prediction for anomaly detection–a new baseline." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
- Carreira, João, et al. "Learning from One Continuous Video Stream." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
- Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." Medical image computing and computer-assisted intervention. 2015.
- Kendall, Alex, and Yarin Gal. "What uncertainties do we need in Bayesian deep learning for computer vision?." Advances in neural information processing systems. 2017.
- Sluijterman, Laurens, Eric Cator, and Tom Heskes. "Optimal training of mean variance estimation neural networks." Neurocomputing (2024).
- Arnab, Anurag, et al. "Vivit: A video vision transformer." Proceedings of the IEEE/CVF international conference on computer vision. 2021.