Thursday, December 11, 2008

How gmerlin-avdecoder works

If you study multimedia decoding software like xine, ffmpeg or MPlayer you find that they all work surprisingly similar. Gmerlin-avdecoder is no exception here. The important components are shown in the image below:



Input
The input module obtains the data. Examples for data sources are regular files, DVDs or DVB- or network streams. Usually the data is delivered in raw bytes. For DVDs and VCDs however, read and seek operations are sector based. Since both formats require that each sector must start with a syncpoint (an MPEG pack header) having sector based data in the demultiplexer speeds up several things.

Demultiplexer
This is, where the compressed frames are extracted from the container. During initialization, the demultiplexer creates the track table (bgav_track_table_t). This contains the tracks (in most cases just one). Each track contains the streams for audio, video and subtitles. For most containers the demultiplexer already knows the audio and video formats. For others, the codec must detect it. This means you should never trust the formats before you called bgav_start().

In some cases (DVD, VCD, DVB) the input already knows the complete track layout and which demultiplexer to use. The initialization of the demultiplexer can then skip the stream detection. In the general case the demultiplexer is selected according to the file content (i.e. the first few bytes). Some formats (MPEG, mp3) can have garbage before the first detection pattern, so we must repeatedly skip bytes before checking for one of these.

The demultiplexer has a routine, which reads the next packet from the input. Depending on the format, this involves decoding a packet header and extracting the compressed data, which can be handled by the codec later on. Some formats (rm, asf, MPEG-2 transport streams, Ogg) use 2-layer multiplexing. There are top-level packets which contain subpackets.

If the format is well designed, we also know the timestamps and duration of the packet and if the packet contains a keyframe. Not all formats are well designed though (or encoders are buggy), then we must do a lot of black magic to get as much information as possible.

Most demultiplexers are native implementations. As a fallback for very obscure formats we also support demultiplexing with libavformat.

Buffers
Gmerlin-avdecoder is strictly pull-based. If a video codec requests a packet, but the next packet in the stream belongs to an audio stream, it must be stored for later usage. For this, we have buffers, which are just chained lists of packets. They can grow dynamically. This approach makes the decoding process mostly insensitive to badly interleaved streams.

Interleaved vs. noninterleaved
The demultiplexing method described above reads the file strictly sequentially. This has the advantage that we never seek in the stream so we can do this for non-seekable sources.

Some files (more than you might think) are however completely non-interleaved, e.g. the audio packets follow all video packets. These always have a global index though. In this case, if a video packet is requested, the demultiplexer seeks to the packet start and reads the packet. This mode, which is also used in sample accurate mode, only works for seekable sources.

Codecs
These convert packets to A/V frames. In most cases, one packet equals one frame. In some cases (mostly for MPEG streams), the codec must first assemble frames from packets or split packets containing multiple frames. The codec outputs the gavl frames, which are handled by the application.

Codecs are selected according to fourccs. For formats, which don't have fourccs, we either invent them or we use fourccs from AVI or Quicktime.

Video codecs must care about timestamps. For MPEG streams the timestamps at multiplex level (with a 90 kHz clock) must be converted to the ones according to the video framerate. Audio codecs must do buffering because the the application can decide how many samples to read at once.

Text subtitles
There are no codecs for text subtitles. Each packet contains the string (converted to UTF-8 by the demultiplexer), the presentation timestamp and the duration.

Reading a Video frame
If an application calls bgav_read_video, the following happens:
  • The core calls the decode function of the codec
  • The codec checks if there is already a decoded frame available. This is the case after initialization because some codecs need to decode the first picture to detect the video format
  • If no frame is left, the codec decodes one. For this it will most likely need a new packet
  • The codec requests a packet. Either the packet buffer already has one, or the demultiplexer must get one
  • In streaming mode, the demultiplexer gets the next packet from the input and puts it into the packet buffer of the stream to which it belongs. If the stream is not used, no packet is produced and it's data is skipped. This is repeated until the end is reached or we found a packet for the video stream. If the end is reached, the demultiplexer signals EOF for the whole track.
  • In non-streaming mode the demultiplexer knows, which stream requested the packet. It seeks to the position of the next packet and reads it. EOF is signaled per stream.
Index building
Building file indices for sample accurate access can happen in different ways depending on the container format. In the end, we need byte positions in the file, the associated timestamps (in output timescale) and keyframe flags. The following modes are supported:
  • MPEG mode: The codec must build the index. This involves parsing the frames (only needed parts) to extract timing information with sample accuracy. Codecs supporting this mode are libmpeg2, libavcodec, libmad, liba52 and faad2.
  • Simple mode: The demultiplexer knows about the output timescale, gets precise timestamps and one packet equals one frame. Then no codec is needed for building the index.
  • Mix of the above. E.g. in flv timestamps are always in millseconds. This is precise for video streams. For audio streams (mostly mp3), we need MPEG mode.
B-frames are ommitted in the index. That's because noone will use them for a seekpoint anyway and the PTS get strictly monotone. This lets us do a fast binary search in the index, but the demultiplexer must be prepared for packets not contained in the index.

Friday, December 5, 2008

Downscaling algorithms

The theory
Downscaling images is a commonly needed operation, e.g. if HD videos or photos from megapixel cameras are converterted for burning on DVD or uploading to the web. Mathematically, image scaling is exactly the same as audio samplerate conversion (only in 2D, but it can be decomposed into two subsequent 1D operations). All these have in common that samples of the destination must be calculated from the source samples, which are given on a (1D- or 2D-) grid with a different spacing (temporally or spatially). The reciprocal of the grid spacings are the sample frequencies.

In the general case (i.e. if the resample ratio is arbitrary) one will end up with interpolating the destination samples from the source samples. The interpolation method (nearest neighbor, linear, cubic...) can be configured in better applications.

One might think we are done here, but unfortunately we aren't. Interpolation is the second step of downscaling. Before that, we must make sure that the sampling theorem is not violated. The sampling theorem requires the original signal to be band-limited with a cutoff frequency of half the sample frequency. This cutoff frequency is also called the Nyquist frequency.

So if we upscale an image (or resample an audio signal to a higher sample frequency), we can assume that the original was already band-limited with half the lower lower sample frequency so we have no problem. If we downsample, we must first apply a digital low-pass filter to the image. Low-pass filtering of images is the same as blurring.

The imagemagick solution
What pointed me to this subject in the first place was this post in the gmerlin-help forum (I knew about sampling theory before, I simply forgot about it when programming the scaler). The suggestion, which is implemented in ImageMagick, was to simply widen the filter kernel by the inverse scaling ratio. For linear downscaling to 1/2 size this would mean to do an interpolation involving 4 source pixels instead of 2. The implementation extremely simple, it's just one additional variable for calculating the number and values of filter coefficients. This blurs the image because it does some kind of averaging involving more source pixels. Also the amount of blurring is proportional to the inverse scale factor, which is correct. The results actually look ok.

The gavl solution
I thought what's the correct way to do this. As explained above, we must blur the image with a defined cutoff frequency and then interpolate it. For the blur filter, I decided to use a Gaussian low-pass because it seems the best suited for this.

The naive implementation is to blur the source image into a temporary video frame and then interpolate into the destination frame. It can however be done much faster and without temporary frame, because the 2 operations are both convolutions. And the convolution has the nice property, that it's associative. This means, that we can convolve the blur coefficients with the interpolation coefficients resulting in the filter coefficients for the combined operation. These are then used on the images.

The difference
The 2 solutions have a lot in common. Both run in 1 pass and blur the image according to the inverse scaling ratio. The difference is, that the imagemagick method simply widens the filter kernel by a factor while gavl widens the filter by colvolving with a low-pass.

Examples
During my research I found this page. I downloaded the sample image (1000x1000) and downscaled it to 200x200 with different methods.

First the scary ones:

OpenGL (Scale mode linear, GeForce 8500 GT, binary drivers)


XVideo (GeForce 8500 GT, binary drivers)


Firefox 3.0.4 (that's why I never let the browser scale when writing html)


Gimp (linear, indexed mode)

gavl (downscale filter: GAVL_DOWNSCALE_FILTER_NONE)


Now it gets better:

Gimp (linear, grayscale mode)

mplayer -vf scale=200:200 scale.mov
The movie was made with qtrechunk from the png file.

gavl with imagemagick method (downscale filter: GAVL_DOWNSCALE_FILTER_WIDE)

gavl with gaussian preblur (downscale filter: GAVL_DOWNSCALE_FILTER_GAUSS)


Blogger thumbnail (400x400). Couldn't resist to upload the original size image to blogger and see what happens. Not bad, but not 200x200.

Saturday, November 29, 2008

Video4vimeo

Everyone uploads videos nowadays. Specialists use vimeo because youtube quality sucks. So the project goal was to create a video file for upload to Vimeo with Gmerlin-transcoder, optimize the whole process and fix all bugs.

The footage

I had to make sure that I own the copyright of the example clip and that nobodies privacy is violated. So I decided to make a short video featuring a toilet toy I bought in Tokyo in 2003. A friend of mine went to the shop a few years later, but it was already sold out.

The equipment

My camera is a simple mini-DV one. It's only SD but since the gmerlin architecture is nicely scalable, the same encoder settings (except the picture size) should apply for HD as well. I connected the camera via firewire and recorded directly to the PC (no tape involved) with Kino.



Capture format

The camera sends DV frames (with encapsulated audio) via firewire to the PC. This format is called raw DV (extension .dv). The Kino user can choose whether to wrap the DV frames into AVI or Quicktime or export them raw. Since the raw DV format is completely self-contained, it was choosen as input format for Gmerlin-transcoder. Wrapping DV into another container makes only sense for toolchains, which cannot handle raw DV.

Quality considerations

My theory is that the crappy quality of many web-video services is partly due to financial considerations of the service providers (crappy files need less space on the server and less bandwidth for transmission), but partly also due to people making mistakes when preparing their videos. Here are some things, which should be kept in mind:

1. You never do the final compression
In forums you often see people asking: How can I convert to flv for upload on youtube? The answer is: Don't do it. Even if you do it, it's unlikely that the server will take your video as it is. Many video services are known to use ffmpeg for importing the uploaded files, which can read much more than just flv. Install ffmpeg to check if it can read your files.

Compression parameters should be optimized for invisible artifacts in the file you upload. That's because in the final compression (out of your control) will add more artifacts. And 2nd generation artifacts look even more ugly, the results can be seen on many places in the web.

2. Minimize additional conversions on the server
If you scale your video to the same size it will have on the server, chances are good that the server won't rescale it. The advantage is that scaling will happen for the raw material, resulting in minimal quality loss. Scaled video looks ugly if the original has compression artifacts, which would be the case if you let the server scale.

3. Don't forget to deinterlace
Interlaced video compressed in progressive mode looks extraordinarily ugly. Even more disappointing is that many people apparently forget to deinterlace. Even the crappiest deinterlacer is better than nothing.

4. Minimize artifacts by prefiltering
If, for whatever reason, artifacts are unavoidable you can minimize them by doing a slight blurring of the source material. Usually this shouldn't be necessary.

Format conversion

All video format conversions can be done in a single pass by the Crop & Scale filter. This gives maximum speed, smallest rounding errors and smallest blurring.

Deinterlacing

Sophisticated deinterlacing algorithms are only meaningful if the vertical resolution should be preserved. In our case, where the image is scaled down anyway, it's better to let the scaler deinterlace. Doing scaling and deinterlacing in one step also decreases the overall blurring of the image.


Scaling
Image size for Vimeo in SD seems to be 504x380. It's the size of their flash widget and also the size of the .flv video. Square pixels are assumed.



Cropping
The aspect ratio of PAL DV is a bit larger than 4:3. Also 504x380 with square pixels is not exactly 4:3. Experiments have shown, cropping by 10 pixels each on the left and right borders removed black border at the top and bottom. If your source material has a different size, these values will be different as well.


Chroma placement
Chroma placement for PAL DV is different from H.264 (which has the same chroma placement as MPEG-2). Depending on the gavl quality settings, this fact is either ignored or a another video scaler is used for shifting the chroma locations later on. I thought that could be done smarter.

Since the gavl video scaler can do many things at the same time (it already does deinterlacing, cropping and scaling) it can also do chroma placement correction. For this, I made the chroma output format of the Crop & scale filter configurable. If you set this to the format of the final output, subsequent scaling operations are avoided.


Since ffmpeg doesn't care about chroma placement it's probably unnecessary that we do. On the other hand, our method has zero overhead and does practically no harm.

Audio
Vimeo wants audio to be sampled at 44,1 kHz, most cameras record in 48 kHz. The following settings take care for that:


Encoding

The codecs are H.264 for video and AAC for audio. Not only because they are recommended by vimeo, they give indeed the best results for a given bitrate.

For some reason, vimeo doesn't accept the AAC streams in Quicktime files created by libquicktime. Apple Quicktime, mplayer and ffmpeg accept them and I found lots of forum posts describing exactly the same problem. So I believe that this is a vimeo problem.

The solution I found is simple: Use mp4 instead of mov. People think mp4 and mov are indentical, but that's not true. At least in this case it makes a difference. The compressed streams are, however, the same for both formats.

Format

The make streamable option is probably unnecessary, but I allow people to download the original .mp4 file and maybe they want to watch it while downloading.

Audio codec


The default quality is 100, I increased that to 200. Hopefully this isn't the reason vimeo rejects the audio when in mov. The Object type should be Low (low complexity). Some decoders cannot decode anything else.

Video codec


I decreased the maximum GOP size to 30 as recommended by Vimeo. B-frames still screw up some decoders, so I didn't enable them. All other settings are default.


I encode with constant quality. In quicktime, there is no difference between CBR and VBR video, so the decoder won't notice. Constant quality also has the advantage that this setting is independent from the image size. The quantizer parameter was decreased from 26 to 16 to increase quality. It could be decreased further.

Bugs

The following bugs were fixed during that process:
  • Reading raw DV files was completely broken. I broke it when I implemented DVCPROHD support last summer.
  • Chroma placement for H.264 is the same as for MPEG-2. This is now handled correctly by libquicktime and gmerlin-avdecoder.
  • Blending of text subtitles onto video frames in the transcoder was broken as well. It's needed for the advertisement banner at the end.
  • Gmerlin-avdecoder always signalled the existance of timecodes for raw DV. This is ok if the footage comes from a tape, but when recording on the fly my camera produces no timecodes. This resulted in a Quicktime file with a timecode track, but without timecodes. Gmerlin-avdecoder was modified to tell about timecodes only if the first frame actually contains a timecode.
  • For making the screenshots, I called
    LANG="C" gmerlin_transcoder
    This switched the GUI to English, except the items belonging to libquicktime. I found, that libquicktime translated the strings way to early (the German strings were saved gmerlin plugin registry). I made a change to libquicktime so that the strings are only translated for the GUI widget. Internally they are always English.



The result


Wednesday, November 19, 2008

Make your webcam suck less

Every webcam sucks. Not because of the webcam itself, but because of the way it's handled by the software. Some programs support only RGB formats, others work only in YUV. Supporting all pixelformats directly by the hardware would increase the price of these low-end articles. Supporting all pixelformats by the drivers would mean to have something similar to gavl in the Linux kernel. The linux kernel developers don't want this because it belongs into userspace. They are right IMO. And since not all programs have proper pixelformat support, you can always find an application, which doesn't support your cam.

Other problems are that some webcams flip the image horizontally (for reasons I don't want to research). Furthermore, some programs aren't really smart when detecting webcams. They stop at the first device (which can be TV-card instead of a webcam) they can't handle.

So the project was to make a webcam device at /dev/video0, which supports as many pixelformats as possible and allows image manipulation (like horizontal flipping).

The solution involved the following:
  • Wrote a V4L2 input module for the real webcam (not directly necessary for this project though).
  • Fixed my old webcam tool camelot. Incredible how software breaks, if you don't maintain it for some time.
  • Added support for gmerlin filters in camelot: These can not only correct the image flipping, they provide tons of manipulation options. Serious ones and funny ones.
  • Added an output module for vloopback. It's built into camelot and provides the webcam stream through a video4linux (1, not 2) device. It supports most video4linux pixelformats because it has the conversion power of gavl behind it. Vloopback is not in the standard kernel. I got it from svn with

    svn co http://www.lavrsen.dk/svn/vloopback/trunk/ vloopback
A tiny initialization script (to be called as root) initializes the kernel modules:
#!/bin/sh
# Remove modules if they were already loaded
rmmod pwc
rmmod vloopback

# Load the pwc module, real cam will be /dev/video3
modprobe pwc dev_hint=3

# Load the vloopback module, makes /dev/video1 and /dev/video2
modprobe vloopback dev_offset=1

# Link /dev/video2 to /dev/video0 so even stupid programs find it
ln -sf /dev/video2 /dev/video0
Instead of the pwc module, you must the one appropriate for your webcam. Not sure if all webcam drivers support the dev_hint option.

My new webcam works with the following applications:
These are all I need for now. Not working are kopete, flash and Xawtv.

Thursday, November 13, 2008

Gmerlin pipelines explained

Building multimedia software on top of gavl saves a lot of time others spend on writing optimized conversion routines (gavl already has more than 2000 of them) and bullet-proof housekeeping functions.

On the other hand, gavl is a low-level library, which leaves lots of architectural decisions to the application level. And this means, that gavl will not provide you with fully featured A/V pipelines. Instead, you have to write them yourself (or use libgmerlin and take a look at include/gmerlin/filters.h and include/gmerlin/converters.h).

I'm not claiming to have found the perfect solution for the gmerlin player and transcoder, but nevertheless here is how it works:

Building blocks
The pipelines are composed of
  • A source plugin, which gets A/V frames from a media file, URL or a hardware device
  • Zero or more filters, which somehow change the A/V frames
  • A destination plugin. In the player it displays video or sends audio to the soundcard. For the transcoder, it encodes into media files.
  • Format converters: These are inserted on demand between any two of the above elements
Asynchronous pull approach
The whole pipeline is pull-based. Pull-based means, that each component requests data from the preceeding component. Asynchronous means that (in contrast to plain gavl), we make no assumption on how many frames/samples a component needs at the input for producing one output frame/sample. This makes it possible to do things like framerate conversion or framerate-doubling deinterlacing. As a consequence, filters and converters which remember previous frames need a reset function to forget about them (the player e.g. calls them after seeking).

Unified callbacks
In modular applications it's always important that modules know as little as possible about each other. For A/V pipelines this means, that each component gets data from the preceeding component using a unified callback, no matter if it's a filter, converter or source. There are prototypes in gmerlin/plugin.h
typedef int (*bg_read_audio_func_t)(void * priv, gavl_audio_frame_t* frame, int stream,
int num_samples);

typedef int (*bg_read_video_func_t)(void * priv, gavl_video_frame_t* frame, int stream);
These are provided by input plugins, converters and filters. The stream argument is only meaningful for media files which have more than one audio or video stream. How the pipeline is exactly constructed (e.g. if intermediate converters are needed) matters only during initialization, not in the time critical processing loop.

Asynchronous vs synchronous
As noted above, some filter types are only realizable if the architecture is asynchronous. Another advantage is that for a filter, the input- and output frame can be the same (in-place conversion). E.g. the timecode tweak filter of gmerlin looks like:
typedef struct
{
bg_read_video_func_t read_func;
void * read_data;
int read_stream;

/* Other stuff */
/* ... */
} tc_priv_t;

static int read_video_tctweak(void * priv, gavl_video_frame_t * frame,
int stream)
{
tc_priv_t * vp;
vp = (tc_priv_t *)priv;

/* Let the preceeding element fill the frame, return 0 on EOF */
if(!vp->read_func(vp->read_data, frame, vp->read_stream))
return 0;

/* Change frame->timecode */
/* ... */

/* Return success */
return 1;
}
A one-in-one-out API would need to memcpy the video data only for changing the timecode.

Of course in some situations outside the scope of gmerlin, asynchronous pipelines can cause problems. This is especially the case in editing applications, where frames might be processed out of order (e.g. when playing backwards). How to solve backwards playback for filters, which use previous frames, is left to the NLE developers. But it would make sense to mark gmerlin filters, which behave synchronously (most of them actually do), as such so we know we can always use them.

Sunday, November 9, 2008

Release plans

Time to make releases of the packages. The current status is the following:


  • Gmerlin-mozilla is practically ready, some well hidden bugs cause it to crash sometimes though. Also the scripting interface could be further extended.
  • gavl is ready. New features since the last version are timecode support, image transformation and a contributed varispeed capable audio resampler.
  • gmerlin-avdecoder got lots of fixes, support for newer ffmpegs, a demuxer for redcode files and RTP/RTSP support. The latter was most difficult to implement. It still needs some work for better recovering after packet loss in UDP mode. Since all important features are implemented now, gmerlin-avdecoder will get the version 1.0.0.
  • The GUI player can now import directories with the option "watch directory". This will cause the album to be syncronized with the directory each time it is opened. The plan is to further extend this such that even an opened album is regularly synchronized via inotify. Except from this, the gmerlin package is ready.

Thursday, November 6, 2008

Introducing gmerlin-mozilla

What's the best method to check if your multimedia architecture is really completely generic and reusable? One way is to write a firefox plugin for video playback and beat onto everything until it no longer crashes. The preliminary result is here:



And here are some things I think are worth noting:

How a plugin gets it's data
There are 2 methods:
  • firefox handles the TCP connection and passes data via callbacks (NPP_WriteReady, NPP_Write). The gmerlin plugin API got a callback based read interface for this.
  • the plugin opens the URL itself
The first method has the advantage, that procotols not supported by gmerlin but by firefox (e.g. https) will work. The disadvantage is, that passing the data from firefox to the input thread if the player will almost lockup the GUI because firefox spends most of it's time waiting until the player can accept more data. I found no way to prevent this in an elegant way. Thus, such streams are written to a temporary file and read by the input thread. Local files are recognized as such and opened by the plugin.

Emulating other plugins
Commercial plugins (like Realplayer or Quicktime) have lots of gimmicks. One of these lets you embed multiple instances of the plugin, where one will show up as the video window, another one as the volume slider etc. Gmerlin-mozilla handles these pragmatically: The video window always has a toolbar (which can hide after the mouse was idle), so additional control widgets are not initialized at all. They will appear as grey boxes.

Of course not all oddities are handled correctly yet, but the infrastructure for doing this is there.

Scriptability
Another gimmick is to control the plugin from a JavaScript GUI. While the older scripting API (XPCOM) was kind of bloated and forced the programmer into C++, the new method (npruntime) is practically as versatile, but much easier to support (even in plain C). Basically, the plugin exports an object, (an NPObject) which has (among others) functions for querying the supported methods and properties. Other functions exist for invoking methods, or setting and getting properties. Of course not all scripting commands are supported yet.

GUI
A GUI for a web-plugin must look sexy, that's clear. Gmerlin-mozilla has a GUI similar to the GUI player (which might look completely unsexy for some). But in contrast to other free web-plugins it's skinnable, so there is at least a chance to change this.

Some GUI widgets had to be updated and fixed before they could be used in the plugin. Most importantly, timeouts (like for the scrolltext) have to be removed from the event-loop before the plugin is destroyed, otherwise a crash happens after.

The fine thing is, that firefox also uses gtk-2 for it's GUI, so having Gtk-widgets works perfectly. If the browser isn't gtk-2 based, the plugin won't load.

Embedding technique
Gmerlin-mozilla needs XEmbed. Some people hate XEmbed, but I think it's pretty well designed as long as you don't expect too much from it. The Gmerlin X11 display plugin already supports XEmbed because it always opens it's own X11 connection. After I fixed some things, it embeds nicely into firefox.

Configuration
The GUI should not be bloated by exotic buttons, which are rarely used. Therefore most of the configuration options are available via the right-click menu. Here, you can also select fullscreen mode.

Thursday, October 16, 2008

Streaming through the NAT

The only missing network streaming protocol for gmerlin-avdecoder was RTSP/RTP, so I decided to implement it. Some parts (the ones needed for playing the Real-rtsp variant) were already there, but the whole RTP stuff was missing.

The advantage of these is that they are well documented in RFCs. With some knowledge about how sockets and their API work, implementation was straightforward. Special about RTSP/RTP is, that there is one RTSP connection (usually TCP port 554) which acts like a remote control, while the actual A/V data are delivered over RTP, which usually uses UDP. To make things more complicated, each stream is transported over an own UDP socket, with another socket used for Qos infos. Playing a normal movie with audio and video needs 4 UDP sockets then.

The basic functions were implemented, I opened 4 UDP ports on my DSL-Router and I could play movies :)

Then I stumbled across something strange:
  • My code worked completely predictable regarding the router configuration. When I closed the ports on the router (or changed the ports in my code), it stopped working
  • Both ffmpeg and vlc (which, like MPlayer, uses live555 for RTSP) always work in UDP mode, no need to manually open the UDP ports. Somehow they make my router forward the incoming RTP packets to my machine.
So the question was: How?

After spending some time with wireshark and strace I made sure, that I setup my sockets the same way as the other applications, and the RTSP requests are the same. When gmerlin-avdecoder still didn't make it through the NAT (and me almost freaking out) I decided to take a look at some TCP packets, which were marked with the string "TCP segment of a reassembled PDU". I noticed that these occur only in the wireshark dump of gmerlin-avdecoder, not in the others.

After googling a bit, the mystery was solved:
  • The Router (which was found to be a MIPS-based Linux box) recognizes the RTSP protocol. By parsing the client_port field of the SETUP request it knows which UDP ports it must open and forward to the client machine.
  • The "TCP segment of a reassembled PDU" packets are small pieces belonging to one larger RTSP request.
  • If the SETUP line is not in the same TCP packet as the line which defines the transport, the recognition by the router will fail.
  • Wireshark fooled me by assembling the packets belonging to the same request into a larger one and displaying it together with the pieces (this feature can be turned off in the wireshark TCP configuration).
  • The fix was simple: I write the whole request into one string, and send this string at once. Finally the router automagically sends the RTP packets to gmerlin-avdecoder.
What did I learn through this process? Most notably that TCP is stream-based only for the client and the server. Any hardware between these 2 only sees packets. Applications relying on intelligent network hardware must indeed take care, which data will end up in which packet.

You might think that it's actually no problem to open UDP ports on the router, and doing such things manually is better than automatically. But then you'll have many people running their clients with the same UDP ports, which makes attacks easier for the case that gmerlin-avdecoder has a security hole. Much better is to choose the ports randomly. Then, we can also have multiple clients in the same machine. The live555 library uses random ports, ffmpeg doesn't.

Wednesday, October 8, 2008

Globals in libs: If and how

As a follow-up to this post I want to concentrate on the cases, where global variables are tolerable and how this should be done.

Tolerable as global variables are data, whose initialization must be done at runtime and
takes a significant amount of time. One example is the libquicktime codec registry. It's creation involves scanning the plugin directory, comparing the contents with a registry and loading all modules (with a time consuming dlopen), for which the registry entries are missing or outdated. This is certainly not something, which should be done per instance (i.e. for each opened file). Other libraries have similar things.

Next question is how can they be implemented? A simple goal is, that the library must linkable with a plugin (i.e. dynamic module) instead of an executable. This means, that repeated loading and unloading (from different threads) must work without any problems. A well designed plugin architecture knows as little as possible about the plugins, so having global reference counters for each library a plugin might link in, is not possible.

Global initialization and cleanup functions


Many libraries have functions like libfoo_init() and libfoo_cleanup(), which are to be called before the first and after the last use of other functions from libfoo respectively. This causes problems for a plugin, which has no idea if this library was already be loaded/initialized by another plugin (or by another instance of itself). Also before a plugin is unloaded there is no way to find out, if libfoo_cleanup() can safely be called or if this will crash another plugin. Omitting the libfoo_cleanup() call opens a memory leak if the libfoo_init() function allocated memory. From this we find that the global housekeeping functions are ok if either:

  • Initialization doesn't allocate any resources (i.e the cleanup function is either a noop or missing) and
  • Initialization is (thread safely) protected against multiple calls

or:

  • Initialization and cleanup functions maintain an internal (thread safe) reference counter, so that only the first init and last cleanup call will actually do something


Initialization on demand, cleanup automatically


This is how the libquicktime codec registry is handled. It meets the above goals but doesn't need any global functions. Initialization on demand means, that the codec registry is initialized before it's accessed the first time. Each function, which accesses the registry starts with a call to lqt_registry_init(). The subsequent registry access is enclosed by lqt_registry_lock() and lqt_registry_unlock(). These 3 functions do the whole magic and they look like:


static int registry_init_done = 0;
pthread_mutex_t codecs_mutex = PTHREAD_MUTEX_INITIALIZER;

void lqt_registry_lock()
{
pthread_mutex_lock(&codecs_mutex);
}

void lqt_registry_unlock()
{
pthread_mutex_unlock(&codecs_mutex);
}

void lqt_registry_init()
{
/* Variable declarations omitted */
/* ... */

lqt_registry_lock();
if(registry_init_done)
{
lqt_registry_unlock();
return;
}

registry_init_done = 1;

/* Lots of stuff */
/* ... */

lqt_registry_unlock();
}


We see that protection against multiple calls is garantueed. The protection mutex itself initialized from the very beginning (before the main function is called).

While this initialization should work on all POSIX systems, automatic freeing is a bit more tricky and only possible for gcc (don't know if other compilers have similar features). The best time for freeing global resources is right before the library is unloaded. Most binary formats let you mark functions, which should be called before unmapping the library (in ELF files, this is done by putting these into the .fini section). In the sourcecode, this looks like:

#if defined(__GNUC__)

static void __lqt_cleanup_codecinfo() __attribute__ ((destructor));

static void __lqt_cleanup_codecinfo()
{
lqt_registry_destroy();
}

#endif

Fortunately the dlopen() and dlclose() functions maintain reference counts for each module. So the cleanup function is garantueed to be called by the dlclose() call, which unloads the last instance of the last plugin linked to libquicktime.

I regularly check my programs for memory leaks with valgrind. Usually (i.e. after I fixed my own code) all remaining leaks come from libraries, which miss some of the goals described above.

Remove globals from libs

Global variables in libraries are bad, everyone knows that. Maybe I'll make another post explaining when they can be tolerated how this can be done. But for now we assume that they are simply bad.

One common mistake is to declare static data (like tables) as non-const. This makes them
practically variables. If the code never changes them, they cause no problem in terms of thread safety. But unfortunately the dynamic linker doesn't know that, so they will be mapped into r/w pages when the library is loaded. And those pages will, of course, not be shared between applications so you end up with predictable redundant blocks in your precious RAM.

Cleaning this up is simple: Add const to all declarations, where it's missing. But how does one find all these declarations in a larger sourcetree in a reasonable time? The ELF format is well documented and there are numerous tools to examine ELF files.

Let's take the following C file and pretend it's a library build from 100s of sourcefiles with 100000 of codelines:

struct s
{
char * str;
char ** str_list;
int i;
};

struct s static_data_1 =
{
"String1",
(char*[]){ "Str1", "Str2" },
1,
};

char * static_string_1 = "String2";

int zeroinit = 0;
Now there are 2 sections in an ELF file, which need special attention: The .data section contains statically initialized variables. The .bss section contains data which is initialized to zero. After compiling the file with gcc -c the sizes of the sections can be obtained with:
# size --format=SysV global.o
global.o :
section size addr
.text 0 0
.data 56 0
.bss 4 0
.rodata 26 0
.comment 42 0
.note.GNU-stack 0 0
Total 128
So we have 56 bytes in .data and 4 bytes in .bss. After successful cleanup all these should ideally end up in the .rodata section (read-only statically initialized data). Since we have 100000 lines of code, the next step is to find the variable names (linker symbols) contained in the sections:
# objdump -t global.o

global.o: file format elf64-x86-64

SYMBOL TABLE:
0000000000000000 l df *ABS* 0000000000000000 global.c
0000000000000000 l d .text 0000000000000000 .text
0000000000000000 l d .data 0000000000000000 .data
0000000000000000 l d .bss 0000000000000000 .bss
0000000000000000 l d .rodata 0000000000000000 .rodata
0000000000000020 l O .data 0000000000000010 __compound_literal.0
0000000000000000 l d .note.GNU-stack 0000000000000000 .note.GNU-stack
0000000000000000 l d .comment 0000000000000000 .comment
0000000000000000 g O .data 0000000000000018 static_data_1
0000000000000030 g O .data 0000000000000008 static_string_1
0000000000000000 g O .bss 0000000000000004 zeroinit
Now we know that the variables static_data_1, static_string_1
and zeroinit are affected.

The symbol __compound_literal.0 comes from the expression (char*[]){ "Str1", "Str2" }. The bad news is that compound literals are lvalues according to the C99 standard, so they won't be assumed const by gcc. You can declare them const, but they'll still be in the .data section, at least for gcc-Version 4.2.3 (Ubuntu 4.2.3-2ubuntu7). The cleaned up file looks like:
struct s
{
const char * str;
char ** const str_list;
int i;
};

static const struct s static_data_1 =
{
"String1",
(char*[]){ "Str1", "Str2" },
1,
};

char const * const static_string_1 = "String2";

const int zeroinit = 0;
The resulting symbol table:
0000000000000000 l    df *ABS* 0000000000000000 global1.c
0000000000000000 l d .text 0000000000000000 .text
0000000000000000 l d .data 0000000000000000 .data
0000000000000000 l d .bss 0000000000000000 .bss
0000000000000000 l d .rodata 0000000000000000 .rodata
0000000000000010 l O .rodata 0000000000000018 static_data_1
0000000000000000 l O .data 0000000000000010 __compound_literal.0
0000000000000000 l d .note.GNU-stack 0000000000000000 .note.GNU-stack
0000000000000000 l d .comment 0000000000000000 .comment
0000000000000040 g O .rodata 0000000000000008 static_string_1
0000000000000048 g O .rodata 0000000000000004 zeroinit
Larger libraries have huge symbol tables, so you will of course filter it with:

grep \\.data | grep -v __compound_literal

So if you want to contribute to a library which needs some cleanup, and you are of the "I know just a little C but I want to help"-type, this is a good idea for a patch :)

2 ssh servers on the same port

A tcp port can only be used by one server process for incoming connections. If another process wants to listen on the same port it will get an "address already in use" error from the OS. If you know the background it's pretty clear why it must be so.

But imagine a case like the following:
  • You want to make a linux machine reachable via ssh
  • From the same subnet passwords are sufficient
  • From outside only public key authentication is allowed
  • Your users are already happy if they get their ssh clients working on Windows XP. You don't want to bother them (and indirectly yourself as the admin) with nonstandard port numbers.
  • Your sshd doesn't support different configurations depending on the source address.
At a first glance, this looks unsolvable. But if you have an iptables firewall (and you will have one if the machine is worldwide reachable) there is a little known trick called port redirection.

You run 2 ssh servers: The external one (with public key authentication) listens at port 22, the internal one (with passwords) listens e.g. at port 2222. Then you configure your iptables such, that incoming packets which come from the subnet to port 22 are redirected to port 2222. The corresponding lines in the firewall script look like:


# Our Subnet
SUB_NET="192.168.1.0/24"

# iptables command
IPTABLES=/usr/sbin/iptables

# default policies, flush all tables etc....
...

# ssh from our subnet (redirect to port 2222 and let them through)
$IPTABLES -t nat -A PREROUTING -s $SUB_NET -p tcp --dport 22 \
-j REDIRECT --to-ports 2222
$IPTABLES -A INPUT -p tcp -s $SUB_NET --syn --dport 2222 -j ACCEPT

# ssh from outside
$IPTABLES -A INPUT -p tcp -s ! $SUB_NET --syn --dport 22 -j ACCEPT


I have this configuration on 2 machines for many months now with zero complaints so far.

Wednesday, September 24, 2008

The fastest loop in C

Sure, practically nobody knows assembler nowadays. But if you use a certain construct very often in time critical code, you might get curious what's the fastest way to do it. One example are loops. I tested this with a loop of the following properties:
  • Loop count is unknown at compile time, so the loop cannot be unrolled
  • The loop index i is not used within the loop body. In particular it doesn't matter it it's incremented or decremented
This kind of loop occurs very often in gavl colorspace conversion routines. The following C-file has 3 functions, which do exactly the same:

void do_something();

static int num;

void loop_1()
{
int i;
for(i = 0; i < num; i++)
do_something();
}

void loop_2()
{
int i = num;
while(i--)
do_something();
}

void loop_3()
{
int i = num+1;
while(--i)
do_something();
}

This code can be compiled with gcc -O2 -S. The resulting loop bodies are the following:

loop 1:

.L17:
xorl %eax, %eax
addl $1, %ebx
call do_something
cmpl %ebp, %ebx
jne .L17
loop 2:
.L11:
xorl %eax, %eax
addl $1, %ebx
call do_something
cmpl %ebp, %ebx
jne .L11
loop 3:
.L5:
xorl %eax, %eax
call do_something
subl $1, %ebx
jne .L5
As you see, the winner is loop 3. Here, the whole logic (without initialization) needs 2 machine instructions:
  • Decrement ebx
  • Jump is result is nonzero
I doubt it can be done faster. You also see, that loop 2 looks simpler than loop 1 in C, but the compiler produces exactly the same code. I changed the innermost gavl pixelformat conversion loops to loop 3 and could even measure a tiny speedup for very simple conversions.

Monday, September 22, 2008

Fun with the colormatrix

Some years ago, when I wrote lemuria, I was always complaining that OpenGL doesn't have a colormatrix. Later I found, that some extensions provide this, but not the core API. Today I think the main problem was that I used a graphics system (OpenGL) which is optimized to look as realistic as possible for a visualization which should look a surrealistic as possible :)

Later when I concentrated on more serious work like developing high quality video filters, I rediscovered the colormatrix formalism. Now what is that all about and what's so exciting about that? If you assume each pixel to be a vector of color channels, multiplying the vector by a matrix is simply a linear transformation of each pixel. In homogenous coordinates and in RGBA or YUVA (i.e. 4 channel) colorspace, it can be completely described by a 4x5 matrix. Now there are lots of commonly used video filters, which can be described by a colormatrix multiplication:

Brightness/Contrast (Y'CbCrA)





[c000b - (1+c)/2]
[01000]
[00100]
[00010]
b and c are between 0.0 and 2.0.

Saturation/Hue rotation (Y'CbCrA)





[10000]
[0s*cos(h)-s*sin(h)00]
[0s*sin(h)s*cos(h)00]
[00010]
h is between -pi and pi (0 is neutral). s is between 0.0 (grayscale) and 2.0 (oversaturated).

Invert single RGB channels (RGBA)





[-10001]
[01000]
[00100]
[00010]
This inverts only the first (red) channel. Inverting other channels is done by changing any of the other lines accordingly.

Swap RGB channels (RGBA)





[00100]
[01000]
[10000]
[00010]
This swaps the first (red) and 3rd (blue) channel. It can be used to rescue things if some buggy routine confused RGB and BGR. Other swapping schemes are trivial to do.

RGB gain (RGBA)




[gr0000]
[0gg000]
[00gb00]
[00010]

gr, gg and gb are in the range 0.0..2.0.

There are of course countless other filters possible, like generating an alpha channel from (inverted) luminance values etc. Now, what do you do if you want to make e.g. a brightness/contrast filter working on RGBA images? The naive method is to transform the RGBA values to Y'CbCrA, do the filtering and transform back to RGBA. And this is what can be optimized in a very elegant way by using some simple matrix arithmetics. Instead of performing 2 colorspace conversions in addition to the actual filter for each pixel, you can simply transform the colormatrix (M) from Y'CbCrA to RGBA:






Untransformed pixel in RGBA:(p)
Untransformed pixel in Y'CbCrA:(RGBA->Y'CbCrA)*(p)
Transformed pixel in Y'CbCrA:(M)*(RGBA->Y'CbCrA)*(p)
Transformed pixel in RGBA:(Y'CbCrA->RGBA)*(M)*(RGBA->Y'CbCrA)*(p)
The matrices (RGBA->Y'CbCrA) and (Y'CbCrA->RGBA) are the ones which convert between RGBA and Y'CbCrA. Since matrix multiplication is associative you can combine the 3 matrices to a single one during initialization:

(M)RGBA = (Y'CbCrA->RGBA)*(M)Y'CbCrA*(RGBA->Y'CbCrA)

If you have generic number-crunching routines, which to the vector-matrix multiplication for all supported pixelformats, you can reduce conversion overhead in your processing pipeline significantly.

Another trick is to combine multiple transformations in one matrix. The gmerlin video equalizer filter does this for brightness/contrast/saturation/hue rotation.

A number of gmerlin filters use the colormatrix method. Of course, for native pixelformats (image colorspace = matrix colorspace), the transformation is done directly, since it usually needs much less operations than a generic matrix-vector multiplication. But for foreign colorspaces, the colormatrix is transformed to the image colorspace like described above.

Some final remarks:
  • The colormatrix multiplication needs all channels for each pixel. Therefore it doesn't work with subsampled chroma planes. An exception is the brightness/contrast/saturation/hue filter, because luminance and chroma operations are completely separated here.
  • For integer pixelformats the floating point matrix is converted to an integer matrix, where the actual ranges/offsets for Y'CbCr are taken into account.
  • In practically all cases, the color values can over- or underflow. The processing routines must do proper clipping.

Sunday, September 21, 2008

Back from California

My job sometimes requires me to attend international conferences and workshops. It's pretty exciting, but also very hard work. You have to prepare presentations, posters and articles for the conference digest. You must concentrate on others presentations (ignoring your jet lag) and the brain gets much more input per day than usual. Some people, who wish me a nice holiday before I leave, don't really know what this is about :)

This year, I had the honor to make a 2 weeks trip to Southern California, first to San Diego and then to Pasadena. It was my first visit to the US so here are some differences I noticed with respect to Old Europe. Of course these are highly subjective and based on what I saw in two weeks in just a tiny corner of a huge country.

Shops, Restaurants
If you are used to the unfriendlyness of German shop employees and waiters, you'll be positively surprised in the US. They always have some friendly words for you. They might not be serious with that, but they do it well enough so it works. This is definitely something, where Germans can learn from the US.

Public transport
As a resident of a big German city, I can live perfectly without owning a car. Of course people here always complain about the local trains being too expensive, finishing operation too early in the night etc. But this is still paradise compared the US. At the bus stops in San Diego I saw mostly people who didn't really look prosperous. It seems that everyone who can afford a car, buys one. For a reason.

International news
If you want to learn about a foreign society it's a good idea to watch their TV programs. I learned that American TV news (aside from being extremely hysteric) mostly deal with domestic issues. I talked to an American colleague about that. I was quite surprised that he told me exactly, what I already thought: Americans are self-centered. Maybe a bit of information about foreign countries and societies (especially the ones you plan to bomb) would make some things go more smoothly.

Freedom
Both Western European societies and the US appreciate personal freedom. But the definitions of freedom seem to be somewhat different. In Europe you can usually drink alcohol in public and on many beaches you can decide yourself how much you wear. Americans want to be able to buy firearms, drive big cars and put their dishes into the trashcan after eating. In Europe, I don't miss any of the American freedoms. Not sure about the vice versa.

Shows
Shows of any kind in the US have definitely another dimension. Germans seem to be way too modest to do something like the killer-whale show in the San Diego Seaworld or the shows in the Universal studios in Hollywood.

Sunday, September 7, 2008

Image transformation

Lots of graphics and video effects can be described by a coordinate transformation: There is a rule for calculating destination coordinates from the source coordinates. The rule completely describes the type of transformation.

The most prominent family of transformations are the affine transforms, where the transform rule can be described by a vector matrix multiplication:

(xdst, ydst) = (A) (xsrc, ysrc)

With this one can implement e.g. scaling and rotation. If the vectors and matrix are given in homogeneous coordinates, one can also shift the image by a vector. Other transforms can be Lens distortion, perspective distortion, wave effects and much more.

Now the question: how can such a transform be implemented for digital images? The algorithm is straightforward:
  • Find the inverse transform. For affine transforms, it will simply be the inverse matrix. For other transforms it might not be that easy. The inverse transform will give you the source coordinates as a function of the destination coordinates
  • For each destination pixel, find the coordinates in the source image
  • If the source coordinates are fractional (they usually will be), interpolate the destination pixel from the surrounding source pixels
Implementation

Since the interpolation is always the same I decided to implement this in a generic way in gavl (gavl_image_transform_t). With this we can implement lots of different transforms by just defining the inverse coordinate transform as a C-function and passing it to the interpolation engine. While the algorithm is simple in theory the implementation has to take care for some nasty details:

1. Y'CbCr formats with subsampled chroma planes
In the video area, these are the rule rather than the exception. Many filters don't support them and use an RGB format instead (causing some conversion overhead). They can, however, be handled easily if you set up separate interpolation engines for each plane. For the subsampled planes, you do the following:
  • Transform the coordinates of the chroma location to image coordinates (i.e. multiply by the subsampling factors and shift according to the chroma placement)
  • Call the function to get the source coordinates the usual way
  • Transform the source coordinates back to the coordinates of the chroma plane
2. Destination pixels which are outside the source image
This is e.g. the case where an image is downscaled. Gavl handles these by not touching the destination pixel at all. Then you can fill the destination frame with a color before the transformation and this color will be the background color later on.

3. Destination pixel is inside the source image, but surrounding pixels (needed for interpolation) are not
Here, one can discuss a lot what should be done. For gavl, I decided to assume, that the "missing pixels" have the same color as the closest border pixel. The reason is, that instead of handling all possible cases inside the conversion loop for each pixel (which will slow things down due to the additional branches), one can simply shift the source indices and modify the interpolation coefficients once during initialization. The following figure illustrates, how this is done:

The start index n and the interpolation coefficients are saved in the interpolation table. After shifting the table, the interpolation routine works without branches (and without crashes). Due to the way the interpolation coefficients are modified we assume that the missing pixels at -2 and -1 are the same color as the border pixel at 0. Of course this is done for x and y directions and also for the case that indices are larger than the maximum one.

Usage

The image transformation is very easy to use, just get the gavl from CVS and read the API documentation. There is also a gmerlin filter in CVS (fv_transform) which can be used as reference. Some questions might however arise when using this:


1. How exactly are coordinates defined?
Gavl scaling and transformation routines work with subpixel presision internally. This is necessary, if one wants to handle chroma placement correctly. To make everything correct one should think a bit how coordinates are exactly defined. This is an example for a 3x3 image:

The sample values for each pixel are taken from the pixel center. This means, the top-left pixel has a color value corresponding to the location (0.5, 0.5). For chroma planes, the exact sample locations are considered as described here.

2. Handling of nonsquare pixels
These must be handled by the coordinate transform routine provided by you. Basically, you have a "sample aspect ratio" (sar = pixel_width / pixel_height). In your transformation function, you do something like:

x_dst *= sar; /* Distorted -> undistorted */

/* Calculate source coordinate assuming undistorted image */

x_src /= sar; /* Undistorted -> distorted */

3. Image scaling
One is tempted to think, that this all-in-one solution can be used for scaling as well. It is, of course true, but it's a stupid thing to do. Scaling can be highly optimized in many ways. The gavl_video_scaler_t does this. It's thus many times faster than the generic transform.

4. Downsampling issues
The image transform makes no assumptions about the type of the transform. Especially not if the transform corresponds to downsampling or not. And this is where some issues arise. While for upsampling it's sufficient to just interpolate the destination pixels from the source pixels, for downsampling the image must be low-pass filtered (i.e. blurred) first. This is because otherwise the sampling theorem is violated and aliasing occurs. A very scary example for this is discussed here. One more reason to use the gavl_video_scaler_t wherever possible because it supports antialiasing filters for downscaling. The good news is that usual video material is already a bit blurry and aliasing artifacts are hardly visible.

Examples
After this much theory, finally some examples. These images were made with the gmerlin transform filter (the original photo was taken in the South Indian ruin city of Hampi).

Perspective effect. Coordinate transform ported from the Gimp.


Rotation

Whirl/pinch (Coordinate transform ported from cinelerra)

Lens distortion (Coordinate transform ported from EffecTV)


Generic affine (enter single matrix coefficients)

Music from everywhere - everywhere

Project goal was simple. Available audio sources are
  • Vinyl records
  • Analog tapes
  • CDs
  • Radio (analog and internet)
  • Music files in all gmerlin supported formats on 2 PCs
All these should be audible in stereo in all parts of the apartment including balcony and bathroom. Not everywhere at the same time though and not necessarily in high-end quality.

This had been on my wish list for many years, but I was too lazy to lay cables through the whole apartment (especially through doors, which should be lockable). And since there are lots of signal paths, the result would have been a bit messy. Dedicated wireless audio solutions didn't really convince me, since they are mostly proprietary technology. I never want to become a victim of Vendor lock-in (especially not at home).

When I first read about the WLAN-radios I immediately got the idea, that those are the key to the solution. After researching a lot I found one radio which has all features I wanted:
  • Stereo (if you buy a second speaker)
  • Custom URLs can be added through a web interface (many WLAN radios don't allow this!)
  • Ogg Vorbis support (so I can use Icecast2/ices2 out of the box)
The block diagram of the involved components is here:

Now I had to set up the streaming servers. The icecast server itself installs flawlessly on Ubuntu 8.04. It's started automatically during booting. For encoding and sending the stream to icecast, I use ices2 from the commandline. 2 tiny problem had to be solved:
  • Ubuntu 8.04 uses PulseAudio as the sound server, while ices2 only supports Alsa. Recording from an Alsa hardware device while PulseAudio is running doesn't work.
  • For grabbing the audio from a running gmerlin player the soundcard and driver need to support loopback (i.e. record what's played back). This is the case for the Audigy soundcard in the Media PC, but not for the onboard soundcard in the desktop machine.
Both problems can be solved by defining pulse devices in the ~/.asoundrc file:
pcm.pulse
{
type pulse
}

pcm.pulsemon
{
type pulse
device alsa_output.pci_8086_293e_sound_card_0_alsa_playback_0.monitor
}

ctl.pulse
{
type pulse
}
The Alsa devices to be written into the ices2 configuration files are called pulse and pulsemon. The device line is hardware dependent. Use the PulseAudio Manager (section Devices) to find out the corresponding name for your card. If your card and driver support loopback, the pulsemon device isn't necessary.

Some fine-tuning can be done regarding encoding bitrate, buffer sizes and timeouts. When I optimized everything for low latency, icecast considered the WLAN radio too slow and disconnected it. More conservative settings work better. Encoding quality is always 10 (maximum), the corresponding bitrate is around 500 kbit/s.

Mission accomplished