Monday, July 20, 2009

GPU accelerated H.264 decoding

A while ago I bought a new camera and I had to learn that my dual core machine isn't able to play footage encoded in highest quality level in realtime (second highest quality works). Fortunately, ffplay can't do that either so it's not gmerlins fault.

H.264 decoding on the graphics card can be done with vdpau or vaapi. The former is Nvidia specific and libavcodec can use it for H.264. The latter is vendor independent (it can use vdpau as backend on nvidia cards) but H.264 decoding with vaapi is not supported by ffmpeg yet.

In principle I prefer vendor independent solutions, but since I need H.264 support and ATI cards suck anyway on Linux, I tried VDPAU first.

The implementation in my libavcodec video frontend was straightforward after studying the MPlayer source. The VDPAU codecs are completely separated from the other codecs. They can simply be selected e.g. with avcodec_find_decoder_by_name("h264_vdpau"). Then, one must supply callback functions for get_buffer, release_buffer and draw_horiz_band. That's because the rendering targets are no longer frames in memory but rather handles of data-structures on the GPU. See here and here to see the details.

After the decoding, the image data is copied to memory by calling VdpVideoSurfaceGetBitsYCbCr. This brings of course a severe slowdown. A much better way would be to keep the frames in graphics memory as long as possible. But this needs to be done in a much more generic way: Images can be VDPAU or VAAPI video surfaces, OpenGL textures or whatever. Implementing generic support for video frames, which are not in regular RAM, will be another project.


ddennedy said...

What kind of performance hit are you seeing compared to non-VDPAU when you pull the decoded image back into CPU RAM?

burkhard said...

Decoding time of a 40 sec clip (1000 frames PAL) drops from 49 sec to 35 sec.
Not much improvement (28%), but makes
realtime playback possible in my case.

I read a forum post of an nvidia guy who admitted that the backcopying of the frames is still suboptimal in the current driver versions. Let's hope this will improve in the future.

One should also note that at least my card (GeForce 8500 GT) is limited to one decoder instance.