Honeycomb Video Editing Framework Design and Business Practice on iOS


(Horsehoneycomb Technology Public Number original content, ID: mfwtech)

Friends familiar with Honeycomb must know that by clicking on the release button on Honeycomb App’s home page, you will find that the content published has been simplified to “graphics” or “video“.

For a long time, the forms of pictures and texts, such as travel notes, questions and answers, and strategies, have been the advantages of Honeycomb’s development. The reason why short videos are upgraded to be parallel with pictures and texts is that for today’s mobile Internet users, short videos with more real and intuitive content, more information density and more immersion have become just needed. In order to make tourism users have a better content interaction experience, enrich and complete the original content ecosystem, Honeycomb coded the layout of the short video field.

Nowadays, there are a lot of short videos in the Hornets nest every day, covering many kinds of local entertainment scenes, such as food, shop, tourist card and accommodation experience. The hornet’s nest hopes that the short video content of the platform should not only be “good-looking”, but also “easy to use”. This “easy to use” is not only to provide users with useful travel information, but also to make short video creation easier through technology.

To this end, we provide two editing modes of “custom editing” and “template creation” in the video editing function of Machoneycomb Travel App. Users can create cool videos with the same style as template videos quickly through “template”, and also can enter the “custom editing” mode to display their creativity and generate personalized videos.

This article will focus on the video editing function in Machive Travel App iOS, and share with you the design and business practice of our team’s video editing framework.


Part.1 Requirement Analysis

As mentioned in the foreword, what we need to do is to support the video editing function of “customized editing” and “template creation”.

Figure 1: Product schematic


First, let’s sort out the functions that need to be provided in the “Custom Editing” mode.

  • Video mosaic: splicing multiple videos into one video sequence

  • Play Pictures: Combine Multiple Pictures into a Video

  • Video clipping: deleting content at a certain time in a video

  • Video Variable Speed: Adjusting the Play Speed of Video

  • Background music: add background music to mix with the original video

  • Video Reverse Play: Video Reverse Play

  • Transition transition: add some transition effects when switching the spliced two videos

  • Screen Editing: Screen rotation, canvas partition, setting background color, adding additional information such as filters, stickers, text, etc.

With these functions, we can meet the needs of the “customized editing” mode and enable users to complete their own creation through our video editing function. But in order to further reduce the threshold of video editing function and make making cool video easy, we also need to support the “template creation” mode. That is to say, “template video” is provided for users. Users only need to select video or pictures to create the same video with the same editing effect as “template video” and realize “one-click editing”.

After supporting the “template creation” mode, the final flow chart of our video editing function is as follows:

Figure 2: Complete flow chart


As shown in the figure, besides the media file, there is an additional input of template A, which describes the editing of the media file selected by the user. At the same time, there is an additional template B in the output of the editor. Template B describes what edits the user finally makes after editing. The output of template B solves the problem of the source of “template video”. That is, “template video” can be produced by operational means, and also can be used as template video created by users through “custom editing” mode. When other users browse the video released by the user, they can quickly create the same type of video.

Through the above process of requirement analysis, we can conclude that our video editing function mainly supports two abilities:

  1. The Ability of Conventional Video Editing

  2. Describe the ability to edit

The division of these two capabilities provides a direction for the next design of video editing framework.


Part.2 Framework Design

The ability of conventional video editing is the basic ability that a video editing framework needs to provide to support the business “custom editing” mode. The ability of describing how to edit is to abstractly model the conventional video editing ability, describe what editing to video, and then transform this description model into a specific video editing function, which can support the business “template creation” mode. So our editing framework can be divided into two main modules:

  • Edit module

  • Description module

Between the two modules, a conversion module is needed to complete the bidirectional conversion between the video editing module and the description module. The following is a sketch of the video editing framework we need:

Figure 3: Video Editing Framework Diagram


  • Edit moduleThe specific functions needed can be added iteratively with the business requirements. The functions we need to support at present are listed in the figure.

  • Description moduleA description model is needed to describe the media materials and various editing functions. At the same time, the model needs to be saved as a file so that it can be transmitted and distributed, which we call the description file.

  • In addition, on the basis of the description file, “template” in “template creation” mode also needs operation-related information such as title, cover chart, etc. So there’s a need to provide oneOperation processingFunctions that enable operating colleagues to process description files into templates.

  • Conversion moduleThe responsibility is to abstract the video editing function into a description file and parse the description file into a specific editing function. It is very important to ensure the correctness of the abstraction and parsing.

Video editing module has good implementation schemes on different development platforms, such as AVFoundation provided by iOS, GPUImage, a widely used third-party open source library, and ffmpeg. Specific implementation options can be combined with business scenarios and project planning. The solution we are currently using on the iOS side is Apple’s native AV Foundation. How to implement our video editing framework with AVFoundation is described in detail below. Next, we will look at the design and implementation of specific functional modules.


Part.3 Module Function and Implementation

3.1 Description Module

3.1.1 Functional Division

Firstly, we analyze the specific functions that need to be supported in the “custom editing” mode, and find that the editing functions can be divided into two categories: paragraph editing and picture editing.

  • Paragraph editorVideo segments are regarded as editing objects, not concerned about the content of the picture, but edited at the level of video segments, which includes the following functions:

Figure 4: Paragraph editing


  • Picture editingThe screen content is regarded as the editing object, which includes the following functions:

Figure 5: Screen Editing


3.1.2 Video Editing Description Model

With the division of editing functions, we need a video editing description model to describe “what edits are made to the video”. Define the following concepts:

  • TimelineA unidirectional incremental line consisting of time points with a starting point of 0.

  • trackContainers with time line as coordinate system store content materials and screen editing functions required at each time point.

  1. Orbits have types, and one track supports only one type.
  • paragraphA segment of an orbit, that is, the part between two points on the time line to which the orbit belongs.
  1. Paragraphs also have types, consistent with the type of track they belong to.

Track Type List:

Among them, “Video”, “Picture” and “Audio” track are the tracks that provide picture and sound content. The other types of tracks are used to describe the specific screen editing functions. Special effect type tracks can specify a number of screen editing effects, such as rotation, partition, etc.

Combining with the division of editing function, we can see that the editing object of paragraph editing function is the paragraph in the track, and the editing object of picture editing function is the content material stored in the track.

With the three concepts of time line, track and paragraph, and the division of paragraph editing and picture editing, we describe the video editing process at the abstract level as follows:

Figure 6: Video Editing Description Model Diagram


As shown in the figure above, through this model, we have been able to fully describe “what edits are made to the video”:

  • Create a 60-second video with video, pictures and music. It corresponds to track 1, track 2 and track 3 respectively. It also has transit and filter effects, which are specified by track 4 and track 5 (other effects are no longer described separately, referring to transit and filter effects).

  • The video is composed of [0-20] segments of track 1, [15-35] segments of track 2, [30-50] segments of track 1 and [45-60] segments of track 2.

  • [0-60] The whole video has background music, which is specified by track 3.

  • Transition effects are found in [15-20], [30-35], [45-50], and are specified by track 4.

  • [15-35] section has a filter effect, and the filter effect is specified by track 5.

3.1.3 Description Files and Templates

With the above video editing description model, we also need specific files to store and distribute the model, that is, description files, we use JSON files to achieve. At the same time, we need to provide the ability of operation and processing, so that our colleagues can add some operation information to the description file and generate templates.

  • Description fileGenerate a JSON file based on the video editing model

Let’s take a picture.

    "tracks": [{
            "type": "video",
            "name": "track_1",
            "duration": 20,
            "segments": [{
                "position": 0,
                "duration": 20
            }, ...]
        }, {
            "type": "photo",
            "name": "track_2",
            "duration": 20,
            "segments": [{
                "position": 15,
                "duration": 20
            }, ...]
            "type": "audio",
            "name": "track_3",
            "duration": 60,
            "segments": [{
                "position": 0,
                "duration": 60
        }, {
            "type": "transition",
            "name": "track_4",
            "duration": 5,
            "segments": [{
                "subtype": "fade_black",
                "position": 15,
                "duration": 5
            }, ...]
        }, {
            "type": "filter",
            "name": "track_5",
            "duration": 20,
            "segments": [{
                "position": 15,
                "duration": 20
        }, ...

  • Template: JSON file consisting of description file + several business information

Let’s take a picture.

    "Title": "template title"
    "Thumbnail": "cover address"
    "Description": "Introduction to Templates"
    "Profile": {description file //
        "tracks": [...]   

Through the definition of video editing description model and description file and template, combined with the converter, we can generate a description file to describe the user’s editing behavior based on the user’s “customized” editing function. Conversely, we can quickly generate videos with the same editing behavior in the description file by parsing the description file and editing the material selected by the user according to the description file.

3.2 Editing Module

Introduction to 3.2.1 AV Foundation

AVFoundation audio and video editing is divided into four processes: material mixing, audio processing, video processing and video export.

(1) Material Mixer AVMutable Composition

AVMutable Composition is a collection of one or more tracks, each of which stores file information of source media, such as audio and video, according to the time line.

// AVMutable Composition Creates a New API for AVCompositionTrack
- (nullable AVMutableCompositionTrack *)addMutableTrackWithMediaType:(AVMediaType)mediaType preferredTrackID:(CMPersistentTrackID)preferredTrackID;

Each track consists of a series of track segments, each of which stores part of the media data of the source file, such as URL, track identifier, time mapping, etc.

// Part attributes of AVMutable CompositionTrack

/* provides a reference to the AVAsset of which the AVAssetTrack is a part  */
AVAsset *asset;

/* indicates the persistent unique identifier for this track of the asset  */
CMPersistentTrackID trackID;

NSArray *segments;

Where the URL specifies the source container of the file, the track ID specifies the source track to be used, and the time map specifies the time range of the source track, as well as its time range on the composite track.

// Time Mapping of AV Composition TrackSegment
CMTimeMapping timeMapping;

// CMTime Mapping Definition
typedef struct 
	CMTimeRange source; // eg, media.  source.start is kCMTimeInvalid for empty edits.
	CMTimeRange target; // eg, track.
} CMTimeMapping;

Figure 7: The flow of AVMutable Composition synthesizing new video

(Source: Apple Official Developer Document)


(2) Audio Mix AVMutable Audio Mix

AVMutable AudioMix can specify the volume of any period of time for any orbit through AVMutable AudioMix Input Parameters.

// AVMutable AudioMix Input Parameters related API

CMPersistentTrackID trackID;

- (void)setVolumeRampFromStartVolume:(float)startVolume toEndVolume:(float)endVolume timeRange:(CMTimeRange)timeRange;

Figure 8: Audio mix schematic

(Source: Apple Official Developer Document)


(3) Video Rendering AVMutable Video Composition

We can also use AVMutable Video Composition to directly process video tracks in composition. When you process a single video composition, you can specify parameters such as render size, zoom ratio, frame rate, and output the final video file. Through some instructions for video composition (AVMutable Video Composition Instruction, etc.), we can modify the background color of the video and apply layer instructions.

These layer instructions (AVMutable Video Composition Layer Instruction, etc.) can be used to implement graphics transformation, add graphics gradient, transparency transformation, and increase transparency gradient for video tracks in composition. In addition, you can apply the animation effect in the Core Animation Framework framework by setting the animation Tool attribute of the video composition.

Figure 9: AVMutable Video Composition Processing Video

(Source: Apple Official Developer Document)


(4) Export AVAsset Export Session

The steps of export are relatively simple. The final product can be exported by assigning the processing object created in the previous steps to the export class object.

Figure 10: Export process

(Source: Apple Official Developer Document)


Implementation of 3.2.2 Editing Module

Combining with AVFoundation framework, we implement the following roles in video editing module:

  • trackThere are two types of video and audio, which store frames and sounds.
  1. In the video track, the image track is extended, that is, the video track is generated through the empty video file, and the selected image is provided to the mixer as a frame image.

  2. Additional Track: AVFoundation provides the AVVideo Composition Core Animation Tool, a tool that can easily apply content within the Core Animation framework to video frames. So with the function of adding text, we created a series of preview views through UIKit on the preview side and converted them to CALayer of the tool when we export them.

  • paragraphA certain period of time in the track, as the object of paragraph editing.

  • instructionsRelevant to the specified video segment, image processing, drawing each frame of the picture.

  1. An instruction can associate multiple video tracks, and obtain frames from these video tracks within a specified period of time as the object of picture editing.

  2. The specific implementation scheme of screen editing in instructions is based on CoreImage framework. CoreImage itself provides some built-in real-time image processing capabilities. Special effects that CoreImage does not support are implemented by customizing CIKernel.

  • Audio mixerFor adding music, we use AVMutable AudioMix.

  • Video MixerThe final video file we want to get includes a video track and an audio track. Mixer is to convert our input media resources into tracks, edit video segments and assemble instructions according to user’s operation or description model conversion, mix audio tracks, and provide real-time preview and final synthesis combined with AVPlayerItem and AVExportSession.

With these roles, the video editing module on the iOS side is implemented as follows:

Figure 11: Schematic diagram of video editing module


As shown in the figure above, the mixer contains two video tracks and one audio track. Generally speaking, the input video and picture files will generate a corresponding video track. In theory, there should be more than one video track in the mixer. In our picture, the mixer only maintains two video tracks and one audio track. First, it is to solve the problem of the number limitation of video decoders, which will be described in detail later. The second is to ensure the transition function.

The sequence of instructions consists of several successive instructions over time. Each instruction consists of time period, frame source track and picture editing effect. The paragraph editing function in video editing function is to stitch up the instruction segments, and the picture editing function is to edit and process the frame diagram for each instruction segment. The preview function provided by the mixer can show editing changes to users in real time. After determining the editing effect, the final video file is synthesized through the synthesizing function provided by the mixer.

3.2.3 Converter

With the implementation of video editing module, we have been able to support the “custom editing” mode. Finally, through the connection of the converter, the description model and the editing module can be integrated to complete the support of the “template creation” mode. The implementation of the converter is relatively simple. The JSON format description file is parsed into a data model. The mixer creates its own internal track model according to the material and description model selected by the user, and splices the instruction segments.

On the other hand, when compiling and exporting, the mixer assembles its internal orbital model and instruction information into a data model and generates a JSON-formatted description file.

Figure 12: Description model and editing module related transformation


Recent optimization directions of Part.4

4.1 Trampled pits

In the process of realizing the above editing framework, we have encountered many problems, most of which are due to the lack of clear error information in AVFoundation and the time-consuming positioning problem. In summary, most of them are caused by the alignment of track time axis. In addition to the time axis alignment problem, here we summarize a few issues that need to be considered in the implementation, and share with you to avoid stepping on the same pit.

(1) Mixer Track Number Limitation

  • problemAVMutable Composition can add a lot of tracks at the same time, that is, multiple video tracks can exist in a composite at the same time, and can be previewed through AVPlayer normally. So we initially implemented the editing module, which supports multiple video tracks in the mixer, as shown in the following figure. This multi-track structure has no problem in preview, but an error of “unable to decode” appears when it is exported. The mixer structure before conversion:

Figure 13: Mixer structure before conversion


  • ReasonAfter verification, it is found that the number of video playback decoders is limited by Apple devices. When exporting video, each video track will use a decoder, which results in that if the number of video tracks exceeds the limit of the number of decoders, it can not be exported.

  • SolutionThe track model transformation converts the multi view rail structure of the original mixer to the current dual track structure, so that the number of the decoder will not exceed the limit when it is exported.

(2) Performance optimization: implementation scheme of backcast function

  • problemThe original implementation scheme is to export a new video file, and the frame sequence is the reverse order of the original video file. If the original video file is very large, the user only clips one segment. When replaying this segment, the original video file will still be processed in reverse order to export a new video, which is a very time-consuming operation.

  • SolutionAccording to the time point, the corresponding frame of the video file is acquired. When replaying, the normal time point is only converted to the time point after replaying, and the video file is not operated. Similar to operations on arrays, only subscripts are manipulated, without directly changing the order of arrays.

(3) Performance optimization: reducing memory usage

  • problemWhen previewing, the size of the original image consumes a lot of memory. After adding more than one HD image, there will be a memory alarm in the preview process.

  • SolutionWithout affecting user experience, use low-resolution images to preview, and use the original image when exporting.

4.2 Recent Planning

At present, this video editing framework works well in Machive Tourist App iOS, can support business iteration, can quickly expand more picture editing functions, of course, there are still some details to be optimized.

In the near future, we will explore some interesting and interesting video editing scenarios by combining machine learning with AR technology to provide users with more personalized travel record tools.

The writer is Zhao Chengfeng, R & D Engineer of iOS.