1. Background
Video transcoding is the process of converting a video file into another video file through demuxing, decoding, filter processing, encoding, and muxing. Every day, a large number of original videos are uploaded to Bilibili, and then converted into multiple different resolutions by the transcoding system. The transcoded video has a lower bitrate while maintaining picture quality close to the original, which improves smoothness during network transmission and saves bandwidth; at the same time, after transcoding, various original videos are converted into relatively unified and standardized encoding specifications, which also greatly improves device compatibility during playback.
Currently, the most widely used server-side video transcoding framework in the industry is FFmpeg, which can process almost all formats of multimedia files. FFmpeg's core transcoding components are basic libraries that implement the atomic capabilities of muxing/demuxing, encoding/decoding, filtering, and algorithms. At the same time, FFmpeg also provides a directly runnable command-line tool ffmpeg that implements simple transcoding pipeline logic.
Serial pipeline before ffmpeg 7With the expansion of transcoding business, the native ffmpeg command-line tool has also exposed many limitations:
When transcoding into multiple resolutions, since the pipeline is serial (*before FFmpeg 7.0), it cannot maximize the advantages of multi-core CPUs for VOD transcoding in scenarios with audio encoding and complex filters, and for live streaming, encoding of different resolutions may block each other.
In long-duration continuous transcoding scenarios such as live streaming, transcoding parameters cannot be dynamically updated through interaction.
Almost all transcoding pipeline control logic is crammed into 5 .c files, with unclear module division and high modification/maintenance costs.
When upgrading FFmpeg versions, code migration is difficult.
Therefore, considering the expansion of transcoding business, we need to develop our own transcoding core to replace the ffmpeg command-line tool.
2. Self-developed Transcoding Core Architecture
As shown in the figure below, our self-developed transcoding core uses the FFmpeg basic library as the underlying atomic capability for multimedia processing, abstracts each module of the transcoding pipeline, and derives different subclasses to handle the different business requirements of VOD and live streaming; the Controller module is the core module of transcoding, responsible for the frame scheduling logic of all pipelines, and is also implemented separately for VOD and live streaming. The scheduling logic for VOD is relatively simple, mainly including the input stream-output stream mapping relationship; while live streaming includes the scheduled frame pulling logic of the live control room, as well as message interaction logic (dynamic replacement of input, output, filters, etc. during transcoding), so its logic is relatively complex.
The figure below shows the module architecture diagram of the transcoding core for two-resolution transcoding. Pipeline A and Pipeline B are each regarded as one Pipeline, and one Pipeline corresponds to one transcoded output; the processing flow corresponding to a single audio/video stream within a single Pipeline is regarded as a Flow; each filtering/encoding/sampling/muxing operation inside a Flow is regarded as a Task.
During the operation of the transcoding core, upper-layer services can issue dynamic instructions at any time to add or remove transcoding inputs, outputs, and the entire pipeline. In live transcoding, this function can greatly improve the startup speed of transcoded streams: after the user's stream is disconnected and restarted, and the upstream node changes, the service layer can delete the original transcoding input and output through dynamic instructions without restarting the container, and then update to the new upstream and downstream nodes. This saves 100% of the transcoding startup time and improves the coverage of transcoded resolutions after stream disconnect and restart on the streaming end.
3. Module-level Controllable Serial/Parallel Pipeline
Different scenarios (VOD and live streaming) have different requirements for the process control of the transcoding pipeline. Each Task class in our self-developed transcoding core inherits from the same base class PipelineWorker, and can freely choose to run in serial or parallel mode. Workers in parallel mode will start a separate thread for frame processing.
The serial/parallel scheduling strategy for Tasks depends on business requirements and the internal parallelism of the Task.
3.1 Live Transcoding
Live transcoding has very high requirements for real-time performance. Filtering, encoding, and output network IO during transcoding are all modules that may block. If a serial pipeline is used, the risk of stuttering is very high, especially in one-input multiple-output scenarios, where multiple pipelines will block each other; this is why before FFmpeg 7.0, it is not suitable for direct use in live transcoding. After enabling parallel mode, Tasks will trigger frame dropping when the internal queue reaches the threshold, to ensure the stability of the transcoded stream as much as possible.
3.2 VOD Transcoding
VOD transcoding scenarios value transcoding performance more, and the serial/parallel mode needs to be determined according to the characteristics of the Task: if the internal parallelism of the Task is low, using parallel mode can increase CPU utilization and leverage the advantages of multi-core online containers; if the Task itself has high internal parallelism or overly simple logic, using parallel mode will instead degrade performance due to increased thread switching.
4. Dynamic Adaptive Transcoding
In live transcoding scenarios, since the FLV encapsulation format and RTMP protocol are used, the source live stream may change specifications such as resolution and frame rate at any time during streaming. This requires the transcoding core to have the ability to dynamically adapt transcoding parameters.
4.1 Resolution Adaptation
With the widespread adoption of live multi-person co-streaming business, there are more and more live streams with changing resolutions online. The FLV format can update encoding specifications at any time by refreshing the sequence header. After transcoding detects a resolution change, the filter, encoding, and muxing modules need to work together to ensure the normal refresh of the transcoded stream's resolution:
The figure below shows the change logic of width and height parameters for input streams with different aspect ratios after passing through the transcoding scale (scale) filter, using zoom scaling that maintains the original aspect ratio:
Parameter adaptation for filter groups is a difficult point to handle, because there are many types of filters, and the correspondence between parameters and input width and height is also different. For example, the scale filter only has two parameters: target width and height, which are only related to the aspect ratio of the input frame; while the overlay filter has parameters for watermark width and height, and overlay coordinates, which are more complex to calculate. Therefore, we reuse ffmpeg's expression function: the service layer can use placeholders, predefined input variables, and expressions to make filter parameters dynamically adapt to input specifications, and the transcoding core only needs to maintain one set of parameter adaptation rules to cover almost all filters.
4.2 Frame Rate Adaptation
Most PC streaming tools maintain a fixed frame rate when pushing streams, but for users streaming from mobile devices, the streaming tool may dynamically adjust the frame rate within a certain range according to network conditions, and changing frame rates pose a challenge to frame sampling logic.
Currently, there are two types of video frame sampling algorithms: CFR (Constant Frame Rate) and VFR (Variable Frame Rate). For live streaming scenarios with variable frame rates, live transcoding uses VFR instead of the commonly used CFR sampling. This is mainly because the CFR algorithm is not flexible enough for this scenario and exposes the following problems:
When performing CFR sampling with fixed 60/30fps specifications, if the input frame rate is 50/25fps, which cannot be divided evenly by 60/30, it will lead to uneven frame copying/dropping and cause stuttering (see the figure below)
When performing CFR sampling using the minimum value of the input stream's initial frame rate and the target frame rate, if the source frame rate increases midway, frames that should not be dropped will be dropped, making playback more stuttered than the source stream; if the source frame rate decreases midway, extra copied frames will be generated, resulting in a waste of encoding resources and bitrate.
When using VFR sampling, when the source frame rate is higher than the set frame rate, uneven sampling will still occur (see the two figures above). To address this, we added the VFR-HALF sampling method, which is triggered when the source fps meets certain conditions. For example, if the source frame rate is 50fps and the target frame rate is 30fps, the actual target frame rate will be adjusted to 25fps (sample one out of every two frames) to ensure uniform sampling, which is also the method used by YouTube VOD.
5. Additional Bitstream Information Management
Additional bitstream information mainly refers to data independent of compressed frame data in the bitstream, which refers to SEI information in AVC encoding. Currently, both VOD and live services on Bilibili rely on the ability to process SEI information. Live quiz games and interactive touch games during New Year's Eve live streams are all implemented through SEI information, and HDR videos for VOD also use color grading information stored in SEI.
The ffmpeg command-line tool has always been relatively conservative in handling SEI information. Before ffmpeg 5, ffmpeg directly discarded SEI information during decoding; after ffmpeg 5, ffmpeg saves the decoded SEI information in the frame structure. Although it has the ability to write SEI information, this writing capability is entirely implemented by the corresponding encoder.
The SEI processing logic of ffmpeg is forcibly bound to the internal implementation of the encoder, so in the current scenario where self-developed code, GPU, and heterogeneous transcoding are used in combination, it cannot cover all encoders and cannot guarantee the consistency of the writing method across all encoders. Our self-developed transcoding core adjusts the timing of SEI writing, and adds a BSF filter after encoding to uniformly handle SEI writing for all AVC/HEVC/AV1, and also determines whether to discard or merge SEI information from dropped frames according to service configuration during video frame sampling.
In live transcoding, we also use SEI to record the entire lifecycle of transcoded stream production. Through the SEI information obtained by the player, we can accurately analyze the latency and other core indicators of each production node along the path from the streaming client LiveJi (original: 直播姬) to the end user, providing data support for indicator optimization.
6. Summary and Outlook
The original intention of developing the self-developed transcoding core was to cover business scenarios that are difficult to implement with the ffmpeg command line. As business requirements become more diverse, the application scenarios of our self-developed transcoding core are also increasing. Since it was first applied to the live control room business in 2020, the self-developed transcoding core has covered live control room business and live transcoding business, and is currently gradually covering VOD transcoding.
Going forward, we will continue to focus on two goals: improving user experience and improving transcoding system efficiency:
We have previously cooperated with the company's AI and picture quality teams to complete the development of functions such as AI live subtitles and game event dashboards. In the future, we will also integrate more new AI-related functions and features into VOD and live transcoding.
During the grayscale release of VOD streaming transcoding with our self-developed transcoding core, it has already shown significant performance improvements; in the future, we will optimize the pipeline serial/parallel strategy with finer granularity to maximize the utilization of machine resources.
-End-
Author | DogHunter
This is a discussion topic separated from the original topic at https://www.bilibili.com/read/cv36362725/










