Technical editor: mango fruit from the editorial department
SegmentFault has he reported the official account number: SegmentFault
Imagine watching a horror movie: the heroine walks through the dark basement with great vigilance, the classic music of suspense plays in the background, and some invisible sinister creatures wriggle in the shadows bang！ It hit an object.
Most of the sound effects in the film are edited in the later stage, which requires the editor to perfectly match the sound with the picture.
Recently, researchers have created an automated program called autofoley, which can analyze motion in video frames and create its own artificial sound effects to match the scene. In a survey, most of the people surveyed said they believed the fake sound effect was true.
The model has been described in a study published in IEEE Transactions on multimedia.
Using AI model to automatically score 1000 short films
Autofoley’s co researchers, Professor Jeff Prevost of the University of Texas at San Antonio, and his Ph.D. student, sanchita ghose, used autofoley to create sounds for 1000 short films that capture many common movements, such as rain, galloping horses, and ticking clocks.
Usually, these sound effects are later recorded by Foley artists in the studio, who use a large number of different objects to collide and rub to produce sound. For example, recording the sound of broken glass may require repeated broken glass recording in a studio until the sound matches the clip of the video.
“Since the 1930s, using Foley art to add sound effects to post production has been a complex part of film and television soundtracks, and without the controllable layers of realistic Foley tracks, the movie would seem empty and distant,” says Jeff Prevost. However, the Foley sound synthesis process thus increases the creation time and cost of dynamic images. “
Interested in the idea of automated Foley systems, Jeff Prevost and sanchita ghose set out to create a multi tiered machine learning program. They created two different models to recognize the actions in the video and determine the appropriate sound.
For example, to extract appropriate motion features from the first frame of the motion (e.g., moving the motion model from the first frame) to determine the appropriate motion effects of the motion model.
The second model analyzes the temporal relationship of objects in a single frame. By using relational reasoning to compare different frames across time, the second model can predict what is happening in the video.
In the final step, the sound is synthesized to match the activity or motion predicted by one of the models.
Autofoley cheated 73% of respondents
Autofoley is best suited to produce sounds that don’t take time to align perfectly with the video (e.g., rain, crackling). However, when the visual scene contains random actions over time (e.g., typing, thunderstorms), the program is more likely to be out of sync with the video.
Jeff Prevost and sanchita ghose surveyed 57 local college students to identify autofoley’s automatic dubbing, which they believed contained the original soundtrack of the movie.
When evaluating the tracks produced by the first model, 73% of the students surveyed chose a synthetic autofoley clip as the original clip rather than the actual original sound clip. When evaluating the second model, 66% of respondents chose autofoley clips instead of the original sound clips.
“The limitation of our approach is that it requires the classification topic to appear in the entire video frame sequence,” says Jeff Prevost He also noted that autofoley currently relies on a limited set of Foley categories. Although autofoley’s research is still in its early stages, they believe these limitations will be addressed in future research.