[Click to watch big coffee sharing]
The playing method is groundbreaking, and there is no gap in the experience. K song spared no effort to solve the benefit of application. Always envy other people’s “Song room” and complain that their own “thatched room” can’t eliminate echo and mixing? This time, we will take you to practice the karaoke function, subdivide the application scenarios, improve the product performance, provide you with “bricks and tiles” on the road of “building a house”, give you the most practical “weapon”, let your “karaoke house” get rid of the embarrassing aftersound, and soar with high quality from now on. Look down on the king of karaoke and push you to be the “king of karaoke” with technology!
The senior engineer of Tencent cloud university is invited to share the course, which mainly explains how to build a karaoke application from three aspects
- Karaoke function application scenarios and product performance
- Technical implementation scheme of karaoke function
- Development practice of karaoke function
The types of K songs are divided into local recording and online song rooms.
Local recording is mainly divided into two modules: singing with accompaniment, recording and mixer. The mixer is used to adjust the slow beat defect by adjusting the voice position, adjust the volume of voice and accompaniment, reverberation, sound change and other sound effects. Equalizer is to enhance or reduce different frequency bands of sound. For example, both the singing bar and the national karaoke application are in local recording mode.
Online song room is a room where the owner’s voice and accompaniment are sent to the room audience through network coding. For example, live broadcast.
Local recording process
The style file is decoded and transferred to the playback device. After the vocal accompaniment is recorded, the vocal accompaniment is synchronously collected by the acquisition device. The accompaniment sound is eliminated through echo cancellation, leaving the human voice and coexisting as a temporary human voice file. The function of the ear return of another branch is to deliver the human voice to the user’s ear. The whole recording process is that the human voice is transferred to the playback equipment together with the ear back mixing formed by the accompaniment after sound change and other settings. After recording, post-processing and style are performed, and finally audio files are generated.
The common style files are MP3, Ogg, ACC, wav. Although most mobile phones support mp3 decoding, Android models have poor fault tolerance and usually cannot play abnormal MP3 files (incomplete MP3 files or wrong format). Some Android systems support Ogg playback, but IOS systems do not support it. In these cases, the decoding library needs to be packaged. Possible problems caused by the diversification of decoding libraries: for apps, there is only one music file format, which does not need to be useless packaged in the size of the installation package
Solution: dynamic loading. That is, each decoding library operates separately as a dynamic library (so, DLL, etc.) to play music. When the dynamic library is loaded, it can be played normally. If it is not loaded, it returns, that is, the library does not exist. Users can flexibly select the packaging quantity. Dynamic libraries that can support dynamic download, such as Android and windows, can be selected according to the file format during app operation.
Playing while downloading means playing while downloading style files. At this time, there will be a jam problem. Even local files may get stuck during playback.
Causes of Caton:
- When the decoding thread is shared with other task threads, the coding is not timely due to the overload of other tasks. Therefore, coding requires a separate thread.
- Set the Caton for caching individual threads. In today’s concurrent systems, threads are scheduled through time slice rotation. When buffer is not used, when the device needs data, the thread is not scheduled, which will cause Caton.
The recording part needs echo cancellation, that is, the playing part does not need to be collected. It is usually the system’s own function. For example, when the speaker is turned on during a call, the other party’s voice will not be collected and forwarded back. There are some limitations in the system echo cancellation. For example, it only supports echo cancellation in case of a call. Turning on the media volume has no effect, and some devices display a return failure. When echo cancellation is turned on under call conditions, the sampling rate decreases. Because the collected sound frequency band is rich at high sampling rate, echo cancellation operation has high requirements for algorithm and equipment performance. In the karaoke scene, if 16K is used, it is difficult to meet the needs of users, so the self-developed echo cancellation with a sampling rate of 44.1k is adopted.
At time T0, the style is played after decoding. The time interval from playing to the user receiving the existence of the style is called playback delay. The playback delay is small and the IOS system is slow. Android system is relatively large, and the general playback delay is hundreds of milliseconds. From the moment when the user starts singing to T1, the acquisition of vocal accompaniment to memory is called acquisition delay. The acquisition delay and playback delay are similar. The two delays of Android devices are generally 100 ~ 200ms. The accompaniment at time t0 is collected at time T1. In order to align the vocal accompaniment, the time interval between t0 and T1 needs to be calculated. Calculate the playback delay and acquisition delay or calculate the total delay together. For example, to calculate the total delay, first play a test sound, and then compare the offset of the two signals to obtain the results. The method of calculating the total delay is relatively accurate. This method is commonly used in the laboratory, but this method is not suitable for the online environment. It is difficult to play a test music before singing. In this case, subsection calculation can be used. The playback delay is divided into two parts. The minimum buffer when acquiring data is mini buffer, and the time length can be calculated according to the code rate, sound noise, etc. After obtaining the mini buffer, 100 mainstream models are tested, and the playback delay is about twice that of the mini buffer. Similarly, the acquisition delay can be obtained. Vocal style alignment mainly focuses on the calculation of playback acquisition delay at the beginning of playback, and re alignment after pause.
Reverberation is the sound produced by the sound source and reflected by the reflector. The reflected sound is combined with the sound of the sound source.
Influencing factors of reverberation: distance, quantity and material of reflector. The distance of the reflector determines the arrival time of the reflected sound. For example, the echo in the room arrives quickly and the echo in the valley lasts for a long time. The number of reflectors determines the number of reflected signals. For example, the echo in the valley is clear and distinguishable, and the echo in the room is difficult to distinguish. The material of the reflector determines the reverberation time. There are many reflectors, the signal is absorbed more, and the sound duration is short.
The following is a comparison of the original sound and ethereal modes. In ethereal mode, the distance between direct sound and reflected sound is long, which is called Valley echo.
The original sound is the speaking state in the room. The long line on the left is the direct sound, and the reflected sound arrives, which is difficult to distinguish. There are many objects in the room, with strong sound absorption effect and short reverberation time. The arrival time of the first reflected sound, the number of reflected sound and the reverberation duration determine the reverberation degree. The above three factors are adjusted through 8 reverberation modes, which will be presented in the demo below.
The equalizer scales sounds of different frequencies. The following figure is a recording. After Fourier transform, it is found that a sound is composed of many frequencies. This recording is relatively large before 1000Hz and then shrinks. After 16000hz, the energy is almost zero. According to the frequency diagram, we can amplify or reduce the sound of different frequency bands. If the low-frequency sound is full, it indicates that the low-frequency signal is appropriate. If the low-frequency signal is small, the sound is thin. The low-frequency signal is high and the signal is thick. The high-frequency signal is bright. Generally speaking, the equalizer supports the following 10 levels. The numbers below illustrate an interval value, for example, 31, which is the interval value of the interval of equilibrium 31 ~ 61.
change of voice
Voice change support 12 types: Lori, uncle, bear child, cold, trapped animal, ethereal fat boy, heavy metal, foreigner, heavy machinery, strong current, local dialect. According to the above sound characteristics, the technical methods used are not single. For example, Lori and uncle are realized by raising and lowering the frequency, and ethereal is realized by reverberation.
The following is the spectrogram of the original sound and Lori. The abscissa is time and the ordinate is frequency. Color represents the sound intensity of a certain frequency at a specific time point. The sound frequency of the original sound is narrow, below 16000hz. Laurie’s audio frequency is lengthened, and some low-frequency sounds are amplified into high-frequency, with a frequency of about 14000hz. You can experience with QQ voice messages. Like the k-song function of GME, it is developed by the audio and video laboratory.
Online karaoke sending end is similar to stable recording, and coding is added. As shown in the flow chart, after encoding, contract awarding, packet receiving, etc., and finally decode and play.
K song room considerations: accompaniment, vocals, lyrics synchronization. Different from the above accompaniment vocal synchronization, although the local accompaniment is synchronized with the vocal, there will be offset between the accompaniment and the vocal after the receiver receives it due to the unstable network delay during transmission. Therefore, style and voice need to be mixed and transmitted. Lyrics synchronization is the synchronization of time stamps, which is based on the time of the currently received sound
To show the lyrics. The displayed lyrics are displayed word by word and sentence by sentence, which requires high time delay.
There are two ways to synchronize timestamps:
1 send signaling at the beginning, and then pause or end sending signaling again. The other party determines the accompaniment time through the accumulation of timers according to the reception time. The advantage of this method is that there is no need to change the format of the audio frame. If the scalability of the audio frame is not good, this method can be adopted.
2 if the audio frame is good in scalability and time stamp accuracy, simultaneous interpreting the current timestamp of the accompaniment must be placed on the head or tail of the audio frame, and it will be transmitted along with the audio frame. This method has good implementation effect. At present, we use this method.
Delay control mainly controls the time when the speaker’s voice is sent to the listener through the network in the live broadcast scene. This time can be subdivided into playback and acquisition delay, and the controllable range of equipment related problems is small. Another reason is network transmission. By adding a background server, each user can connect nearby and transmit through the nearest channel.
Another type of delay is the delay caused by the network packet buffer. Audio data is transmitted in the form of UDT, and UDT cannot guarantee continuity. For example, three audio packets may be received intermittently, which may not be in complete order. Therefore, the smaller the delay, the better. We need to optimize different scenes. For example, there is no interaction in the live broadcast mode, and the audience will not interact with the live broadcaster. We only need to ensure that the voice and picture of the live broadcaster are synchronized. Even if it is transmitted a few seconds later, the audience can’t clearly notice it. In this case, ensure to avoid jamming first, and the delay can be increased slightly.
The continuous wheat mode realizes the interaction between the audience and the broadcaster in the process of live broadcasting. In this case, it requires high real-time performance. Therefore, the continuous wheat mode requires low delay, and the Caton can exist slightly. The profile parameters need to be adjusted according to different scenarios.
When there is only one recorder during recording, record and upload. If multiple people participate in the recording, the background recording is adopted, that is, the background uniformly decodes the voices of multiple people and generates files. When the recording is completed, return to the server address for viewing.
The following is a practical exercise to realize a simple local recording function.
1. Initialization ITMGContext.GetInstance(this).Init(String.valueOf(mAppId), mUserId);// Initialize the SDK to log in ITMGContext.GetInstance(this).SetTMGDelegate(new MyDelegate());// Set the proxy class to accept various callbacks and events EnginePollHelper.createEnginePollHelper();// To call the Poll function regularly, you can use this helper class, or you can call it in a periodic function. 2. Enter the room byte authbuff = AuthBuffer.getInstance().genAuthBuffer(mAppId, mRoomId, mUserId,mAppKey);// To obtain authentication information, it is best to put it on the server ITMGContext.GetInstance(this).EnterRoom(mRoomId, 2, authbuff);// get into the room 3. Karaoke related interfaces /*Function: start recording **Parameters: **Type: k song scene itmg_ AUDIO_ RECORDING_ KTV **Dstfile: the path of the target file, which is used to save the recorded music **Accmixfile: generally, it is a accompaniment without original sound, which is used to synthesize music files with human voice **Accplayfile: a music file used for playing. Normally, it is the same file as accmixfile. However, when the user is not familiar with the song, it can be a music file with the original song */ int StartRecord(int type, String dstFile, String accMixFile, String accPlayFile); /*Function: stop recording*/ int StopRecord(); /*Function: pause recording*/ int PauseRecord(); /*Function: continue recording*/ int ResumeRecord(); /*Function: set music files for playback, which is generally used to switch between original singing and pure tone accompaniment **Parameters:accPlayFile，用于播放的音乐文件 */ int SetAccompanyFile(String accPlayFile); /*Function: get the length of style file*/ int GetAccompanyTotalTimeByMs(); /*Function: get the current recording time*/ int GetRecordTimeByMs(); /*Function: jump the recording time to the specified time. If the parameter is earlier than the current time, record again where it is repeated; If it is later than the current time, the non recorded part is filled with mute data **Parameters:timeMs，跳转的时刻，单位ms */ int SetRecordTimeByMs(int timeMs); /*Function: set sound effect **Parameters:type音效类型，参见ITMG_KaraokeType */ int SetRecordKaraokeType(int type); /*Function: get the length of the recorded file*/ int GetRecordFileDurationByMs(); /*Function: start previewing recorded files*/ int StartPreview(); /*Function: stop previewing recorded files*/ int StopPreview(); /*Function: pause preview of recorded files*/ int PausePreview(); /*Function: continue previewing recorded files*/ int ResumePreview(); /*Function: set the time point of the current preview **Parameters:time，预览文件的时间点单位ms */ int SetPreviewTimeByMs(int time); /*Function: get the time point of the current preview*/ int GetPreviewTimeByMs(); /*Function: set the scale of voice and style **Parameters: **Mic: scale of human voice. 1.0 is the original volume, less than 1.0 is the reduction, and more than 1.0 is the amplification **ACC: scale of style. 1.0 is the original volume, less than 1.0 is the reduction, and more than 1.0 is the amplification */ int SetMixWieghts(float mic, float acc); /*Function: set the offset of human voice relative to style, which is generally used to adjust the problem that the sound can't keep up with the beat **Parameters:time，人声相对于伴奏的偏移时间，单位ms。大于0为向后移动，小于0为向前移动 */ int AdjustAudioTimeByMs(int time); /*Function: combine the recorded voice and style into one file*/ int MixRecordFile(); /*Function: cancel merge operation*/ int CancelMixRecordFile(); Events to listen to: /*Function: callback of recording completion. Triggered when style playback ends or stoprecord is called **Parameters: **Result: recording result error code, 0 is success **Filepath: the path of the target file, which is passed in by startrecord **Duration: the length of the recording file, in MS */ ITMG_MAIN_EVENT_TYPE_RECORD_COMPLETED /*Function: preview the completed callback. Triggered when the playback of preview file ends or stoppreview is called **Parameters:result，播放结果错误码 */ ITMG_MAIN_EVENT_TYPE_RECORD_PREVIEW_COMPLETED /*Function: callback of composition file completion. The synthesized file completes the trigger. Before calling the CancelMixRecordFile, there is no callback. **Parameters: **Result: error code of synthesis result, 0 is success **Filepath: the path of the target file, which is passed in by startrecord **Duration: the length of the recording file, in MS */ ITMG_MAIN_EVENT_TYPE_RECORD_MIX_COMPLETED(32)
For detailed interface description, see: https://cloud.tencent.com/doc…
In order to provide the most practical, popular, cutting-edge and dry video tutorials for developers, please let us hear your needs. Thank you for your time! Click fill in [questionnaire]
Tencent cloud university is a one-stop learning and growth platform for cloud ecological users under Tencent cloud. Tencent cloud University invites internal technology experts every week to share the latest technology trends in the industry for free.