July 31, 2009

Multicore Video Decoding with MPlayer, Part 2

Article Source Linux Developer Network
July 31, 2009, 7:14 am

Last time around we took a look at using the ffmpeg-mt multithreaded version of ffmpeg in order to speed up high definition h264 decoding. This time around we'll try to use more cores again; the hundreds of stream processing units that live on your graphics card.

For this article you will need a recent NVidia graphics card, a GeForce 8 or later and the closed source NVidia graphics drivers. Recently NVidia added the Video Decode and Presentation API for Unix (VDPAU) to allow the GPU on your graphics card to perform some of the video stream decoding work.

If you have the closed source NVidia drivers installed and obtain MPlayer from its subversion repository then VDPAU will be detected during ./configure. Once MPlayer is configured, compiled, and installed, the main thing you might have to do is update your ~/.mplayer/codecs.conf file to tell MPlayer about the VDPAU codecs.

I find it convenient to have to MPlayer commands, running "mplayer" runs the non VDPAU version and mplayer-vdpau will use the GPU for decoding. This way, if GPU decoding of a video file fails for any reason I can quickly revert to using CPU only decoding. The mplayer-vdpau file is a simple shell script shown below which runs mplayer with a different codes.conf file which enables the use of VDPAU codecs.

$ cat /.../bin/mplayer-vdpau
mplayer -vo vdpau -codecs-file ~/.mplayer/codecs-vdpau.conf "$@"

For testing, I'll use the same Intel Q6600 2.4GHz quad core machine as the first article and an NVidia 250 GTS graphics card. Version 185.18.14 of the closed source NVidia drivers were used. Note that although this card is a "250" it is the same chip used in the 9800 series. The 250 GTS is a nice card for this article as they can be had for around $140 at the time of writing. Wikipedia contains a nice overview of NVidia cards for those unfamiliar with the cards.

I'll use the same video files from the first article. You might like to skim over the first article for the details, but in brief there are two freely available 1920x1080 animation files: Big Buck Bunny (BBB) (The 1920x1080 H.264 version) and Elephants Dream (ED). Because the Elephants Dream is only offered in MPEG4 and not h264, the first 5,000 frames were transcoded in the first article to provide a nice high quality h264 file. For comparisons on decoding video of real life scenes the two video trailers "The Bourne Ultimatum" (TBU) and "I Am Legend" (IAL) were used from h264info.com.

Unfortunately, monitoring how much of a hard time a GPU is having performing a task is still a black art. For a real time process like video decoding, it is either fast enough or not. There is no "gpu-top" command available yet which allows you to see that only 60% of the GPU processing power was needed to decode in real time.

We can of course see what impact using the GPU has on the CPU power necessary for the decode. I'll reuse the figures for mplayer-single (the normal single threaded mplayer) from the first article of the series as the baseline and compare how much CPU is required when VDPAU is in use.

The below command was used for non VDPAU decoding. For benchmarking the GPU decoding the -vc ffh264 was removed from the command line and the mplayer-vdpau wrapper script shown above was used instead of calling mplayer directly. One hitch is that you can not use VDPAU decoding with the -vo null option. Thus all benchmarking in this article had to be performed with video being displayed to the screen.

mplayer -vc ffh264 -benchmark -nosound

When using mplayer with VDPAU, the actual mplayer process might not show up in the most CPU hungry processes listed by top. Instead you will likely see the X process at the top position.

As an initial test, I looked at 15 seconds of video starting at 29 seconds into ED264. The extra mplayer parameters are -ss 29 -endpos 15. The results are shown below. Notice that the time required for the 15 seconds of video decode when VDPAU is used becomes 7.8 apposed to 10.8 for CPU alone. It is also telling that the VC Runtime and Percent drop so significantly for the VDPAU version, which should coincide with the amount of user time dropping to almost zero. During the execution of the below benchmarks, for CPU decding the mplayer process was shown to use 100% of a single core. For the VDPAU version X was using about 60%.

Command VC Runtime VC Percent real user
mplayer 9.4 90.2 10.8 9.8
mplayer-vdpau 1.3 17.8 7.8 0.2

I modified part of dstat to the obtain rolling CPU utilization of both the X process and all mplayer processes running. Using VDPAU output with mplayer results in two mplayer processes created for a single playback. Other than the video playback the system was under no load.

For CPU testing mplayer was executed with mplayer -nosound and for VDPAU testing the command was mplayer-vdpau -nosound. The results for "The Bourne Ultimatum" trailer are shown below. Using the CPU only for decoding the X server sat just below 20% and mplayer got to peaks ranging around 50% and up through 60% of a single CPU core at one stage. When using VDPAU the mplayer processes take hardly any CPU themselves, but the X server uses a significant amount of CPU. In contrast to this almost 80% single core requirement for CPU decoding, the VDPAU decoding managed to slightly break through 40% CPU usage for X a few times. On the whole, VDPAU reduced overall CPU requirements from about 60% to 40% when considering total CPU requirement.



The trailer for "I Am Legend" was harder for the system to decode, resulting in peaks of 88% for the mplayer process when using CPU only decoding. As you can see from the graph of the CPU only decode shown below, not only were there some peaks above 80% CPU for mplayer there was a fair amount of time spent between 60 and 80% CPU. Add to this the nearly 20% of time used by the X server and you are getting close to 100% utilization of a single CPU core. In contrast, the VDPAU decoded run did breach the 40% CPU usage a few times, but remained below 50% the whole time.

One of the major differences between using the CPU and VDPAU for decoding is consistency, VDPAU decoding managed to live between 20 and 40% CPU requirement for the duration of playback. In contrast using CPU only ranged from about 20 to 90% CPU requirements for mplayer alone. This can be explained by the banners overs used in a preview, the VDPAU decode will be using the GPU much less in these times, but we are only monitoring the CPU usage of VDPAU.



Next up was the Elephants Dream (ED) recoded 5,000 frame h264 file (ED264). The graphs are shown below. Notice that using CPU only decoding, the CPU for the mplayer process was getting dangerously close to 100% midway through the stream. Relative to the first few benchmarks for VDPAU, there were are few peaks in the CPU usage for the X process, hitting 60% a few times. During VDPAU decoding, the peaks where X jumped above 40% also caused the mplayer process to increase in CPU requirement. Overall the VDPAU needed about 45-50% combined CPU on average, peaking at 90%. The CPU only decoding needed 120% and perhaps the mplayer process could have used more CPU at times but was limited to only using a single CPU core.



Only the first three minutes were used from Big Buck Bunny (BBB) (The 1920x1080 H.264 version) for the benchmarks shown below. The -endpos 3:00 was used to limit the benchmark to the first three minutes.

There are large sections of BBB where not a great deal of the screen is changing. This can be seen in the CPU only decoding graph where mplayer is requiring less than 30% of the CPU during playback. There are a few large spike zones however, where mplayer jumps up towards 75% of CPU. In contrast, the VDPAU decoding has the X process hovering around 40% CPU usage, only this time ranging between 50% and 30% of CPU for X. Using VDPAU has removed the big CPU requirement spikes that are present in the CPU only decoding.



The benchmarks presented have always been in terms of CPU reduction during real-time video playback. As a ballpark figure, using VDPAU with an NVidia 250 GTS card on an Intel Q6600 equipped machine resulted in about a 40% reduction in CPU required for real time playback. In terms of the first article on ffmpeg-mt, this 40% average deduction translates into VDPAU decoding running at 165% the speed of CPU only decoding. This same roughly 160% is in the speedup bracket that using ffmpeg-mt would give you on a two core machine. Unfortunately ffmpeg-mt and VDPAU can't be combined to achieve better performance. Using VDPAU with mplayer uses a specific h264 codec to decode the video.

Note that VDPAU is support is still being updated in the NVidia drivers, with two fairly major mentions of it in driver releases so far in 2009. Perhaps the amount of processing that happens in the X server might decline with future driver or MPlayer releases.