# The story behind the Divine Comedy: algorithm engineers take you to rationally deconstruct “Hey, ants”

Time：2021-6-3

Introduction:A few days ago, my friends were all brushed by the magic “ant ah hey”? In fact, the technical content behind it is not complicated. It mainly depends on face changing technology and automatic rhythm detection. The algorithm will find rhythm points, and shake face and change expression at these rhythm points. Such a magic short video is born. Today, I’d like to deconstruct those network divine tunes one by one, share music information retrieval algorithms, and show you the divine tunes rationally. Maybe the next creator of network divine tunes is you!

_** Author: Yi Shu
___ Checked by: Taiyi_**_

# What is music?

In the book of rites, it is said that “every sound is born of human heart, and emotion moves in the middle, forms in the sound, and the sound is written, which is called the sound”. Sound is a way to express one’s feelings. In terms of music theory, music is usually composed of rhythm, melody and harmony, which are closely related to mathematics. So we can say that music is a product of the combination of sensibility and rationality. Today’s music information retrieval (MIR) is to extract “information” from music and use algorithms to dig out the rational side of music.

# What is the information about music?

## 1. Rhythm

First, let’s listen to a tiktok.

https://y.qq.com/n/yqq/song/000ZCScz4g4BuM.html

In the song, you can’t help shaking your head and shaking your legs together. The strong drum beat is the rhythm point, which is a perceptual cognition of rhythm.

Strictly speaking, rhythm is the relationship between the length of the organized sound. It is like the skeleton of music, supporting the music. For example, “4 / 4” in the music score below is the time sign, which represents the organization form of the music’s length: quarter notes as a beat, 4 beats per bar. The first beat of each bar is usually downbeat, where the “down” is consistent with the conductor’s gesture.

Other common time signs are: 2 / 4, 3 / 4, etc. These are simple meters, that is, the natural bisection of each beat. In addition, there is a compound meter, which refers to the natural three equal parts of each beat. For example, 6 / 8, compound two beats, read as “one two three, two two three, one two three, two two two three”.

Much of the charm of jazz and soul lies in the use of rhythm. Unfortunately, a considerable number of Chinese pop songs are in 4 / 4 beat, which leads to a lot of similarities in listening.

## 2. Pitch

Music is a kind of sound, so it is also produced by the vibration of objects. The level of sound is determined by the vibration frequency. Generally speaking, the sound or human voice produced by musical instruments does not contain only one frequency, but can be decomposed into the superposition of several different frequencies.

These frequencies are multiples of a certain frequency. This frequency is called the fundamental frequency, which determines the pitch of the sound. The picture below shows the spectrogram of a male voice reading aloud. The bottom blue box shows the fundamental frequency of the current moment. Usually, the pitch of human voice is about 100Hz ~ 200Hz.

For pop songs, pitch often refers to its main melody, that is, the vocal part.

## 3. Chords

In the concept of modern music, the simultaneous production of multiple different pitches is called harmony, and the harmony of three or more tones is called chord. When these notes are played at the same time, they are called column chords, and when they are played successively, they are called split chords.

However, it’s not that pressing a few notes at random can bring a pleasant feeling. According to the law of twelve averages, an octave is divided into twelve equal parts, each of which is called a semitone. The frequency of each semitone is the 12th root of 2 of the previous semitone. There are three (tone) chords and seven (degree) chords. The basic composition of three chords is shown in the figure below.

## 4. Paragraph

Like articles, music can also be divided into paragraphs, so that the expression of emotion has a transition. Paragraphs are organized in various forms, including the following:

• AAAThe repetition of a melody, simple and clear, is common in religious music.

https://y.qq.com/n/yqq/song/004Yc5sF3Rq7Wn.html

• ABABThe two melodies overlap and repeat.

https://y.qq.com/n/yqq/song/000aKduC3mu0IL.html

Iron Man Music

• AABAAdd a different part to the repeated melody to avoid boredom, such as this Christmas song.

https://y.qq.com/n/yqq/song/002zL7ur42FYtK.html

• classical music

Some classical music has its own form of content organization, such as sonata form. Its structure is composed of three parts: presentation part, development part and reproduction part.

• pop music

The familiar structure is “prelude, (main song – Chorus) * n, epilogue”  ， Of course, the creation of pop music is relatively free, sometimes the creator will add a bridge between the two chorus to avoid monotony, or add a transition between the main song and the chorus to make the emotional transition more natural.

# How to extract this information

The traditional method is “feature extraction + Classifier”, which includes time domain and frequency domain features.

Audio signal is often characterized by two-dimensional features: one-dimensional frequency and one-dimensional time. In this way, we can use the way we treat the image to input the audio features into the classifier. However, the audio features are different from the image. The image has local correlation, that is, the features of adjacent pixels are relatively close, while the correlation of the spectrum is reflected in the various overtones, and the local similarity is relatively weak.

Take chestnuts for example. We use the chords (C, G, am, f) of the first 10 seconds of Jay Chou’s simple love to render an audio. The features of the four spectrum classes shown in the figure below are: short time Fourier transform, Mel spectrum, constant-Q transform, and chromagram from top left to bottom right. The first three can be understood as splitting the original audio signal through filter banks.

In the process of feature extraction, some abstract semantics hidden in the music signal come to the surface. In chromagram, the vertical axis is pitch class (e.g. “pitch set C”)   Including C1, C2, C3…). It can be seen from the figure that the three brightest tones of 0s-3s are C, e and g respectively. It can be inferred that this is a C chord, and the appropriate features allow “chord detection”   The difficulty of this classification task is reduced.

With the vigorous development of deep learning in various fields, deep network has gradually become a “Classifier”   But we should also “adjust measures to local conditions”  ， RNN network is often better for music “information” which depends on music context, such as rhythm.

## What can we do now

Because our application scenario includes both real-time and offline, many of the following algorithms have real-time versions, which can be used in real-time audio and video communication scenarios.

## one   Rhythm detection

As mentioned above, the rhythm point is where the beat of the song is. “Ant ah hey”, which was quite popular a while ago, the head movement in the photo is very magical. These movements are designed according to the rhythm of the song. Next, we present the beat detected by the algorithm with a sound similar to that of a metronome. Students who have studied musical instruments should be impressed by the “dada” sound. Through the rhythm detection algorithm to automatically identify the rhythm points of other songs, we can also make our own “ant ah hey” template.

The above also mentioned the “strong shot”, which can be used to distinguish between small sections. Our method can not only detect the beat, but also distinguish the downbeat. The following video shows the downbeat in “Hey, ants”

Some students may be curious. What’s the use of your algorithm to detect beat and downbeat? Isn’t it just some time points?

In fact, there are many ways to play music. For example, the popular music game a few years ago, master of rhythm, is a game that follows the melody and rhythm to crazy output.

We evaluated our method on the public data set (gtzan) and the internal data set composed of 100 popular songs, and the results are shown in the following table. Our method has more than 0.8 F values on both datasets, and has certain robustness.

## 2. Real time pitch detection

Our method can input a frame of audio, and then output the pitch of the current frame, in Hertz.

The above sentence seems to be a bit shriveled. What are the functions of these figures?

Take chestnuts for example. Many karaoke softwares have the contents in the red box in the figure below. They are the high pitch lines of songs. To judge the accuracy of singing is mainly to see the matching degree of the singer’s pitch, rhythm and the original song.

In this scenario, our pitch detection algorithm can analyze the user’s singing level in real time and give the score.

In addition, some scenes in real-time audio-video communication also rely on pitch detection, such as voice activity detection (VAD). If there is pitch in the current frame, it means there is a voice.

## 3. Paragraph detection

The common paragraph types of pop music are: Prelude, main song, transition, chorus, interlude, bridge and epilogue.

We have investigated the methods of paragraph detection in the market, many of which are based on self similarity matrix (SSM), and segment the music structure, that is, they can only divide the time interval, but can not give the specific paragraph type; Ullrich et al. Proposed a supervised learning method based on CNN, which can detect paragraph boundaries under different granularity. In addition to “segmentation”, our method can also achieve “classification”, that is, we can give the time interval and type of each paragraph (the above seven).

We selected the top 100 songs in the popular list of music software as our evaluation sample set. The evaluation results and indicators are as follows:

These 100 songs contain 19 earthy DJ Remix songs. Because these songs have small dynamic range and strong energy, the voice of “verb playing” covers many of the original features, so the paragraph detection algorithm is weak in such songs. After the DJ songs are removed, the F of the algorithm is improved\_ pairwise\_ Chorus can reach 0.863.
Considering that the consumption of music is more and more fast, and sometimes in order to score the video, we need to intercept the music clips, and the chorus is often the most “ear catching” part of a pop music. We encapsulate the algorithm as the function of “chorus detection”, which can help users to screen out all the chorus in the pop music with one click. The specific call method is as followshere

Here is Jay Chou’s exampleTalking about good happinessFor example, the output of the algorithm is shown in the figure below. The time unit is seconds. You can enjoy the MV and feel the emotion and energy of the chorus

## 4. Real time sound scene recognition

The effects and applications of some Mir algorithms mentioned above, but in which scenarios can they be used? In other words, how to distinguish the music scene?

To solve this problem, we provide the algorithm ability of sound scene recognition, which can recognize the current “voice, music, noise”, and further distinguish gender (male | female) for voice.

The following is the audio of news network. The sound scene recognition algorithm marks the music, male voice, female voice, and the part without sound (that is, without energy).