Text/Tingbo Hou & Tyler Mullen, software engineer, Google Research Institute
Technical reviser: Betta front-end expert, Wang Xingwei
Video conference is becoming more and more important in people’s work and life. We can improve the video experience by enhancing privacy protection or adding interesting visual effects, while helping people focus on the content of the meeting. Our recently announced method of blurring and replacing background in Google meet is a small step towards this goal.
We use machine learning_ （ML）_ To better highlight participants and ignore their surrounding background. Although other solutions require additional software to be installed, meet’s functionality is supported by cutting-edge web ml technologies, which are built with mediapipe and can work directly in your browser – no additional steps are required.
A key goal of developing these functions is that it can provide real-time in browser performance for almost all modern devices. Through xnnpack and tflite, we combine efficient ML model on devices, effect rendering based on webgl and ml reasoning based on web to achieve this goal.
Background blur and background replacement are supported by mediapipe on the web side.
Overview of Web ml solution
The new features in meet are developed together with mediapipe, Google’s open source framework, which is used to provide cross platform and customizable ml solutions for live and streaming media. It also supports real-time ml solutions such as hand, iris and body pose tracking on devices.
First, our solution divides users and their background_ (our segmentation model will be described in detail later), To process each video frame, ML reasoning is used to calculate a low resolution mask. Alternatively, we can further refine the mask to align it with the image boundary. Then use the mask to render the video through webgl2 to blur or replace the background.
Webml pipeline: all computationally heavy operations are implemented in C + + / OpenGL and run in the browser through webassembly.
In the current version, model reasoning is performed on the client’s CPU to achieve low power consumption and maximum device coverage. In order to achieve real-time performance, we designed an efficient ML model to accelerate reasoning through xnnpack library, which is the first reasoning engine specially designed for the new webassembly SIMD specification. Under the acceleration of xnnpack and SIMD, the segmentation model can run at real-time speed on the web.
With the support of mediapipe flexible configuration, the background blur / replacement solution can adjust its processing process according to the device capability. On high-end devices, it runs a complete workflow to provide the best visual quality, while on low-end devices, it can still maintain high performance by using lightweight ML model for calculation and bypassing mask refinement.
Segmentation model subdivision
The machine learning model on the device must be ultra lightweight to achieve fast reasoning, low power consumption and small download size. For models running in the browser, the input resolution greatly affects the floating-point operations required for each frame processed_ （FLOP）_ Therefore, the quantity must also be very small. We down sample the image to get a smaller size, and then provide it to the model. Recovering the segmented mask as accurately as possible from low resolution images increases the challenge of model design.
The whole partition network has a symmetrical structure about coding and decoding, and the decoder block_ (light green), also with the coding block_ (light blue)_ Shared symmetric layer structure. Specifically, the channel attention mechanism with global pooling layer technology is adopted in both encoder and decoder modules, which is conducive to efficient CPU reasoning.
The model architecture of mobilenetv3 encoder (light blue) and symmetric decoder (light green) is adopted.
We modified mobilenetv3 small as the encoder and optimized the network structure search to obtain the best performance with the lowest resource demand. In order to reduce the model size by 50%, we use float16 quantization technology to export the model to tflite. Only the weight accuracy decreases slightly, but it has no obvious impact on the quality. The obtained model has 193k parameters and the size is only 400KB.
After segmentation, we use OpenGL shader for video processing and effect rendering. The challenge is to render efficiently without introducing artifacts. In the thinning stage, we use the joint bilateral filter to smooth the low resolution mask
Rendering effects reduces artifacts. Left: joint bilateral filter smooth segmentation mask. Medium: detachable filter to remove halo artifacts in background blur. Right: replace light wrapping in the background.
The blur shader adjusts the blur intensity of each pixel in proportion to the split mask value to simulate the wave_ （bokeh）_ The effect is similar to the confusion circle in optics_ （CoC）_。 Pixels are weighted by their COC radius, so foreground pixels do not penetrate the background. We implement a separable filter for weighted blur instead of the popular Gaussian pyramid because it removes the halo artifacts around people. To improve efficiency, blur is performed at a low resolution and mixed with the input frame at the original resolution
Background blur example
For background replacement, we use a method called light wrapping_ (light wrapping_) synthesis technology is used to mix segmented characters and customized background images. The light wrap allows the background light to overflow onto the foreground elements, making the synthesis more immersive, which helps soften the split edges. It also helps minimize halo artifacts when there is a large contrast between the foreground and the replaced background
Background replacement example
In order to optimize the experience of different devices, we provide a variety of input sizes_ (i.e. 256×144 and 160×96 in the current version)_ The best model is automatically selected according to the available hardware resources.
We evaluated the speed of model reasoning and end-to-end delivery on two common devices: MacBook Pro 2018 with 2.2 GHz 6-core Intel Core i7 and Acer chromebook 11 with Intel Celeron n3060. For 720p input, macbook pro can run high-quality models at the speed of 120 FPS and end-to-end transmission paths at the speed of 70 FPS; Chromebook runs reasoning at 62 FPS, uses lower quality models, and runs end-to-end at 33 FPS.
Model models on high-end (MacBook Pro) and low-end (chromebook) laptops infer speed and end-to-end pipeline delivery.
In order to quantitatively evaluate the accuracy of the model, we use the popular intersection Union_ （IOU）_ And boundary f-metric. Both models have good performance, and they are still in the case of such a lightweight network
The accuracy of the model was assessed by IOU IOU IOU and boundary f-scores.
We also issued the attached model card for our segmentation model, which details our fairness evaluation. Our evaluation data contains images from 17 geographical regions around the world with comments on skin color and gender. Our analysis shows that the performance of the model is consistent in different regions, skin color and gender, with only a small IOU index deviation.
We have launched a new browser-side machine learning solution to blur and replace your background in Google meet. Using this scheme, the machine learning model and OpenGL shader can run efficiently on the web. The developed function can achieve real-time performance with low power consumption even on low-power devices
thank:Special thanks to the members of meet team and other personnel involved in this project, especially Sebastian Jansson, rikard lundmark, Stephan Reiter, Fabian bergmark, Ben Wagner, Stefan Holmer, Dan gunnarson, St é phane hulaud and all team members engaged in technical work with us: siargey pisarchyk, Karthik RAVEENDRAN, Chris McClanahan, Marat Dukhan, Frank Barchard，Ming Guang Yong，Chuo-Ling Chang，Michael Hays，Camillo Lugaresi，Gregory Karpiak，Siarhei Kazakou，Matsvei Zhdanovich，Matthias Grundmann。