Getting started with the audio world


brief introduction

The purpose of this paper is to let you systematically understand the audio world, from sound generation, transmission, collection and analysis to packaging into audio files. You can skip what you understand.

Sound generation

Sound is a phenomenon, which refers to the wave phenomenon of sound waves perceived by human or animal auditory organs. Sound wave is a process in which the vibration of an object causes the vibration propagation of the surrounding medium. Therefore, sound and sound wave are not substances, just a concept like color and time. In ordinary life, people often confuse the two concepts of sound and sound wave. Sound is subjective and sound wave is objective. The thief in the story reads less and doesn’t understand this truth. He can’t hear sound after covering his ears, but it doesn’t mean that sound wave doesn’t exist.

Sound transmission

Sound wave is a kind of mechanical wave. The vibration of an object will cause the vibration of the surrounding medium. The molecules of the medium will collide with each other and transfer energy, similar to Newton’s pendulum. Therefore, the harder the medium – the harder it is to compress – the faster the sound travels. For example, sound travels faster in water than in air, and in hard solids (such as diamonds), sound travels faster than in water. That’s why sound can’t be heard in the universe or on the moon, because it can’t spread without media. (the textbook says that there is no medium between the universe and the moon. In fact, there may be some. Although science is rigorous, don’t get to the point. Here is a learning strategy. Don’t get stuck because you don’t understand a point, and then look for various materials to understand it. If it doesn’t affect your next study, write down this problem, and then continue to learn, etc If you have enough knowledge, the previous points will be solved easily. Because when we transfer knowledge, in order to facilitate understanding, we sometimes simplify the problems, which may be inconsistent with the actual situation, but will not affect the whole. I try to be as real as possible, and I will mention it in case of simplification)

Sound acquisition and quantization

    Acquisition is just a general idea, because the future is mainly to study the algorithm of quantized digital signals. For example, the piezoelectric pickup can use the diaphragm to receive the air vibration signal. A capacitance is formed between the diaphragm and the fixed plane electrode. The change of the distance between the diaphragm and the fixed plane electrode will lead to the change of its capacitance. When a voltage of fixed frequency and size is applied at both ends of the capacitor, the current passing through the capacitor will change. After quantifying the current signal, an audio signal (digital signal) convenient for calculation and processing will be obtained. Channel number, sampling frequency, quantization bits.
    Channel number: it is divided into mono channel and dual channel. Dual channel is actually two mono channels, also known as stereo channel. Stereo data is twice as large as mono data. In fact, two samplers simulate human ears to sample in two directions. When we listen to songs, we listen to the sound collected by one sampler with one ear to create a three-dimensional feeling.
    Sampling frequency: the number of samples extracted from continuous signals and composed of discrete signals per second, in Hertz (Hz). The sampling frequency must be at least 2 times higher than the signal frequency to collect this signal. It doesn’t matter if you don’t understand this sentence now. After reading the “sound analysis” below “Section, please think about it again. Briefly, if there is no double, the collected data cannot form a waveform. The standard sampling frequency is 44.1KHz, because our human ear listening range is 20Hz to 20kHz, so the sampling rate should be at least greater than 40KHz. In order to leave a little safety factor and take into account the engineering habits, we finally chose the value of 44.1KHz. (theoretically, the higher the sampling frequency, the better, but it is unnecessary. The higher the sampling frequency, the larger the audio data, which is a burden on the transmission and storage of audio files.)
     Quantization bits: quantization bits digitize the amplitude axis of analog audio signal, which determines the dynamic range of analog signal after digitization. For example, now I record a sound, assuming that the volume of the sound is just right and does not exceed the collection range of my recorder (someone may ask, what happens if it exceeds the maximum? The answer is that because it exceeds the maximum value, the excess part will be represented by the maximum value, which will cause audio data loss). If it is quantized according to 8 bits, the maximum amplitude of the sound just at that time is represented by 2 ^ 7-1, that is, 127 (because the first bit is a symbol bit, which is used to represent positive and negative values, only 7 bits can represent values). If it is quantized according to 16 bits, the maximum amplitude of the sound just at that time is represented by 2 ^ 15-1, that is, 32767. You should understand that the 8-bit 127 is actually as loud as the 16 bit 32767. (theoretically, the larger the quantization digit is, the better, but it is not necessary. The reason is the same as the sampling frequency. It wastes storage space. Look at the following figure, regardless of the horizontal time axis. This is determined by the sampling rate, mainly depending on the vertical axis. After the amplitude is quantized, the nearest point will be selected from the upper and lower points due to accuracy, resulting in deviation from the original data, which will affect the final player broadcast Incoming sound quality)

Getting started with the audio world

Amplitude quantization deviation diagram

The following figure shows how a sound is sampled and quantized in chronological order. Next, I will teach you how to understand this figure.

Getting started with the audio world

Time domain audio waveform

Sound analysis

In order to facilitate analysis, I re recorded a sound, and then cut the audio waveform into small pieces with software. As shown in the figure below, I divide the audio into 5 pieces, red, blue, green, yellow and pink. If I say “Hello, students”, and say it at a uniform speed, then only play the red audio, and you will hear “each”. Let’s enlarge the first piece of red to analyze. (don’t worry about the unit of coordinate axis here. It can be understood as horizontal axis time domain and vertical axis amplitude. It will be said later when talking about the algorithm.)

Getting started with the audio world

Time domain audio waveform-   Cut into red, blue, green and yellow powder

Getting started with the audio world

Red time domain audio waveform (enlarged the data in the red part in the above figure)

    Seeing this picture, many people will wonder why all kinds of complex sounds they usually hear are data fluctuating up and down? How does such data carry rich and colorful sounds? Don’t worry. These two questions will be explained one by one.
    This short paragraph first explains why the sound fluctuates up and down. From the red time domain audio waveform above, we will find that the audio data fluctuates up and down, which just reflects the physical fact that the sound wave is generated by the reciprocating vibration of objects. There are many examples in life, such as the “cry” of mosquitoes, which is actually the vibration of wings (inciting wings) The sound wave emitted is about 600 times per second, that is, 600 Hz (the position of the wing from lifting to falling back is calculated once. The lifting and falling of the wing correspond to the forward and backward of vibration, and also correspond to a cycle on the waveform diagram. When the wing is lifted, it is a trough and when it falls, it is a crest. How to look at the cycle, we will say later), Because people’s hearing range is between 20 Hz and 2W Hz, people can hear the sound of mosquito wings. Like butterflies, we can’t hear them only 5 or 6 times a second. (the sound of birds’ wings is actually the sound generated by the friction between wings and air, not the sound of wings’ incitement. Insect wings also have friction with air, but it is very small, and the main sound is the sound of wings’ incitement). So many times we deduce the number of object vibrations by collecting and analyzing the frequency of sound.

Getting started with the audio world

The black line is the fundamental waveform
Getting started with the audio world

The white line is the harmonic waveform

      This short paragraph explains how these fluctuating data carry rich and colorful sounds. Because sound waves are mechanical waves, when multiple sounds are emitted at the same time, they are a composite wave when they are finally transmitted to our ears or recording equipment. That is, the waveform we sampled is the result of the interaction of multiple sound sources. (does anyone think that such sound will be lost due to interaction? The answer is basically not, unless there is interference. Wave interference and diffraction will not be discussed at the beginning of the chapter.). In fact, whether a person speaks or a single musical instrument plays, the sound itself is a composite wave. Here is a concept. In reality, most of the sounds we hear are composite waves superimposed by a variety of sine waves. Of course, in reality, there is no example of a single sine wave. After hitting the tuning fork, wait a few seconds, The sound wave generated by the stable resonance of the tuning fork can be said to be a single sine wave. Fourier transform can be used to separate these superimposed sine waves from the composite wave we collected. For the convenience of later explanation, among these superimposed sine waves, the lowest frequency is the fundamental wave, and the others are uniformly called harmonics. If the frequency of a harmonic is twice the fundamental wave, it is called the second harmonic, and so on. (let’s talk about the algorithm after Fourier transform)
     Fundamental wave: fundamental wave refers to the sine wave component equal to the longest period of the oscillation in complex periodic oscillation. The frequency corresponding to this period is called fundamental wave frequency. Usually we say that the frequency of a certain sound refers to the frequency of the fundamental wave of the sound.
     Harmonic: harmonic refers to each component greater than an integral multiple of the fundamental frequency obtained by Fourier series decomposition of periodic non sinusoidal AC flow, which is usually called high-order harmonic.
     Let’s take another look at the red waveform. It’s a composite wave. The shape of the composite wave is strange. I’m fairly neat. First, we need to find out the fundamental wave, which is drawn by the black line. The period and frequency of the sine wave can be seen at a glance. The frequency of the fundamental wave is the frequency of this composite wave. In fact, the fundamental frequency does not need to be drawn and analyzed with the naked eye. Here I draw it for your convenience. I’ll talk about Fourier transform later.

Tone, loudness and timbre

     If you have no music theory knowledge before, please study this section carefully. Tone, loudness and timbre are the three elements of music. Here are the concepts about the three elements that general music teachers will teach you.
     Pitch: also called pitch. When an object vibrates fast, the tone of the sound is high. When the vibration is slow, the tone of sound is low.
     Loudness: also known as volume. The intensity of the sound felt by the human ear is a subjective feeling of the size of the sound.
     Timbre: refers to the sensory characteristics of sound, according to different timbres, even if. In the case of the same pitch and the same sound intensity, it can also be distinguished from different musical instruments or people.
    Next, I want to tell you that these three elements correspond in the audio data we collect.
The tone corresponds to the frequency of the fundamental wave. The higher the frequency, the sharper the sound, and the lower the frequency, the lower the sound.
Loudness corresponds to the ordinate in the acoustic time domain diagram, that is, the amplitude. When the frequency is constant, the greater the amplitude, the greater the loudness. In addition, because loudness is a subjective feeling, the sensitivity of human ears to different frequencies is different, and the details are shown in the figure below.

Getting started with the audio world

Sensitivity of human ear to sound loudness at different frequencies

The timbre corresponds to harmonics of different components, so although the fundamental waves are the same, the final synthesized waveform will be different due to different harmonics, as shown in the figure below. The following two audio signals have the same fundamental wave and amplitude, but the harmonic composition is different.

Getting started with the audio world

Time domain waveform of audio signal with the same fundamental wave and amplitude but different harmonics

Audio data encapsulation

The common audio file formats are WAV, MP3 and avi. Let’s talk about one so that we can learn the rest by ourselves. Although MP3 is more common, because wav is lossless and convenient for later audio algorithm learning, I’ll talk about wav. Wav file consists of three blocks: riff, format and data.

Getting started with the audio world

Riff block
Getting started with the audio world

Format block
Getting started with the audio world

Data block

The audio data behind ID and size in the data block can be described in detail. It is mainly divided into mono and dual channels. Taking this file as an example, the bitspersample of this file is 16, so each sampling point uses 16 bits to represent the amplitude value. If it is mono, it is very simple. Take 16 bits in order. If it is a dual channel, the left and right channels are alternating. The first 16 bits are the first sampling point of the left channel, the second 16 bits are the first sampling point of the right channel, the third 16 bits are the second sampling point of the left channel, the fourth 16 bits are the second sampling point of the right channel, and so on.


Congratulations first. If you basically understand it, you can even get started. Although each piece of this article is only mentioned, it is still very helpful to understand the audio algorithm to be learned next.