Chapter 1. Howto

Introduction
GStreamer Pipelines
Media components
MPEG encoding
LIBAV encoders
AVI format
OGG/OGM format
Matroska format
QuickTime format
MPEG PS muxing
LIBAV muxers
Subtitles
Recording
Multiple Images
Media processing
Comparison and Motivation

Introduction

In this context, transcoding can be loosely described as decoding, optionally processing and re-encoding of media data (video, audio, …) possibly from one container format to another. Recording is very related to this, but only differs in that the source material is taken from some live source, rather than an existing medium. In particular, the sequel applies equally to either of these.

GStreamer is a multimedia (processing) framework that allows the construction of graphs of media-handling components ranging from simple playback to complex audio (mixing) and video processing, more explanation on which can be found on its homepage. It is by no means the only (Un*x) software dealing with multimedia; e.g. other players include MPlayer, VideoLAN, …, and the list of codec processing libraries is equally extensive. It is, however, more unique in that it can actually be considered a framework, and the sequel will (not surprisingly) focus on (a particular approach to) transcoding using GStreamer. The primary reason being that there are not so many alternatives (including the above) that allow for (comfortably) processing and particularly creating a (fairly wide range of) media files. Some more considerations can be found in “Comparison and Motivation”.

Of course, there are many more expositions on multimedia (processing); who knows (?) what is out there for Google to find, or explanations that can be found on Wikipedia … Besides all that, there is some consolidated information in e.g. the format and codecs sections of MPlayer documentation. All of this need not be recited here, so it now onlyremains to consider a GStreamer approach and point of view.

GStreamer Pipelines

A pipeline is a central concept in GStreamer. It is a (usually simple) directed graph connecting media handling components, typically demuxer, decoder, encoder and muxer or display elements (in this order). For the later purposes here, it should suffice to know and understand that:

  • a "media handling component" is usually referred to as element. There are many kinds of elements, for instance (but not exhaustive); demuxer (splits a container file into its constituent streams), decoder (decodes compressed data into raw format), and symmetrically encoder and muxer …. Do note that each type of container or compressed data has its own element, e.g. avidemux, vorbisenc, … Elements are typically made available through installed plugins, which are basically dynamically loaded libraries.

  • elements have pads, which are like ports or plugs that connect elements and through which data flows. For example, an encoder/decoder will basically have 1 input (sink) and output (source) pad each, whereas typically a demuxer has 1 sink pad and several source pads, and a muxer has several sink pads and 1 source pad. Evidently, an element's source pad connects to a downstream (succeeding) element's sink pad.

  • data that flows through pads is described by caps (short for capabilities), which is roughly a mime-type (e.g. audio/x-raw, video/x-raw) along with a set of descriptive/defining properties depending on the mime-type (e.g. width, height, depth). Each block/buffer of data typically holds an (un)compressed video frame, or some number of (un)compressed audio samples, or even any part of a bitstream taken from file or network, … typically along with some metadata (if applicable), e.g. indicating whether or not keyframe, (media)time and (media)duration of contained data, …

  • there are elements that interface with the world outside a pipeline. Such elements have only 1 source pad (source element) or only 1 sink pad (sink elements), and are consequently also simply collectively called sources or sinks respectively. For example, filesrc takes care of reading data from a file and feeding this into a pipeline, and filesink will happily save data a pipeline has been labouring to produce. The outside can be much more live as well; v4l2src will capture data from a video4linux compliant device (such as a TV capture card), whereas e.g. alsasink is equally instrumental on the downstream side.

  • gst-inspect(1) can be used to bring the above to life by providing concrete information on elements. Specifically, just running gst-inspect yields a listing of (a.o.) all elements that are available along with some terse description. Executing gst-inspect element for some element shows (again a.o.):

    • in the Pad Templates section; the pads the elements has and the type of data to flow through them

    • in the Element Properties section; various settings determining the operation of the element, e.g. a bitrate for encoding, number of (top)lines to crop, volume scale factor to apply, …

  • many elements provide some additional documentation, which should normally be installed along with the plugins themselves (see your distribution system) and accessible using Devhelp. Alternatively, it can also be found on-line in GStreamer's plugin documentation

  • intermediate conversion between various (raw) (YUV or RGB) video and audio formats may be necessary in order to connect elements whose caps differ (only in raw format). videoconvert and audioconvert can take care of this, and may have to be inserted one or more times, depending on various formats supported by elements in the pipeline. As any element, these can detect the caps supported by their neighbours and will conduct best effort to have linking and subsequent negotiation and dataflow succeed, in this case by performing conversion as needed.

  • having a connected pipeline (graph) is one (static) thing, making it do something is another. This can be loosely called running or activating the pipeline, which typically means that sources really get going (reading, capturing, …, and processing/playing takes off. In any case, this is a matter for an application to take care of, and similarly so for keeping an eye on the pipeline as it runs, providing notifications when appropriate, dealing with errors, and finally allowing for clean shutdown when the pipeline informs it has had enough.

By now, it should be quite conceivable that all of the above (concepts) can nicely cooperate and be linked together to form a pipeline, for instance using a syntax as described in gst-launch(1). For even more details, one might boldly go into e.g. the introductory part(s) of GStreamer Application Development Manual, although this actually targets a bit lower technical level.

When dealing with GStreamer applications, one usually need not be aware of or concerned with all these things, as it is typically the very goal and merit of an application to construct and manage such a pipeline behind the scenes. Such is the case for instance with media player totem, for which GStreamer can serve as a backend. However, in the goal we are pursuing here, it shall be up to the reader/user to conduct transcoding by orchestrating a pipeline, by means of gst-launch(1) or (much) more comfortably entrans. After all, the latter provides (a.o.) a small application convenience layer, which alleviates part of the burden, boilerplate, minutia and intricacy of pipeline building (hm, what fun is still left …).

To emphasize again, the above (possibly daunting) level of detail is not meant to add to the rumours that GStreamer is (way too) complicated. Rather, on the one hand, it intends to dispel these by unveiling some of the mystery by plain and simple overview. On the other hand, as mentioned above, any GStreamer application can choose to go about its business (including transcoding) taking care to (more or less) hide the above details. That is, however, simply not the approach that has been chosen here (see also e.g. “Comparison and Motivation”).

The general idea presented here, along with additional explanations and specific and concrete examples in gst-launch(1) and entrans should suffice to cover quite some transcoding cases. Next section(s) will provide some specific information and gotchas to consider in pipeline assembling and media handling, depending on the particular circumstances.

Media components

The purpose here is neither to re-iterate specific GStreamer related information on pads, caps, … of elements that can be obtained by means of gst-inspect(1), nor to lecture (at least not in great detail) on what is already ubiquitously available out there (see e.g. references in “Introduction”). Similarly, GStreamer (transcoding) is by no means limited to types or formats that are mentioned below (use again gst-inspect(1) to see all that is available). Rather, it is endeavoured here to provide some minimal (technical) background that is pertinent in explaining or avoiding phenomena involving certain components turning up as FAQs (on mailinglist, for example).

MPEG encoding

MPEG-1/2 compressed video can be obtained by means of avenc_mpeg1video, avenc_mpeg2video or mpeg2enc, the formers based on libav's avcodec library, and the latter on mjpegtools' mpeg2enc library.

Further documentation or usage shall not be discussed here. It is noted here, however, that the nature of mpeg2enc leads to it output not having metadata on timestamps (at least, for not so recent versions). As this output is typically muxed into mplex (see also later in “MPEG PS muxing”), this is usually not a concern. If one were to put this content into e.g. AVI (which is possible, although why doing so is another), then entransstamp can easily help one out if needed, as in the fragment

mpeg2enc format=3 ! stamp ! avimux

LIBAV encoders

gst-inspect(1) shows quite a list of libav (formerly ffmpeg) based encoders avenc_name, most of which are pretty straightforward to use. They may have limitations in supported video sizes or formats, most common of which are exposed in recent releases [1], in which case following pipeline fragment illustrates how to have this automatically dealt with:

videoscale ! videoconvert ! avenc_huffyuv

AVI format

AVI (Audio Video Interleaved) files, designed by Microsoft, have a plain and simple format. Though it has various limitations (for example in streaming), it handles many basic purposes well.

One of the limitations is that it does not store any timedata, it only knows of framerate and bitrate. This is then used to deduce what belongs when, which is possible at least with CBR audio and is why e.g. VBR is not supported (other than by some esoteric not well-supported extensions).

OGG/OGM format

This is a fileformat from Xiphophorus typically (but not necessarily) containing Vorbis and/or Theora streams. It can be handled by means of oggdemux and oggmux, and e.g. vorbisdec, vorbisenc, theoradec and theoraenc will probably come along to play as well.

Although it is not quite subject to limitations mentioned above (e.g. does handle VBR audio), the format typically expects a particular (video) framerate (as much as a specific audio samplerate also always applies), and therefore corresponding (time-)regular streams.

Some stand-alone tools that may help in manipulating Ogg include the vorbis-tools and the OGMtools

Matroska format

Although Matroska is not related to AVI in either in its origin or its bitstream structure, it is similar in that it basically stores chunks/blocks of data, preceded by a header and (optionally) followed by index-like data. A simple gst-inspection (of matroskamux or matroskademux) easily shows that it can contain quite a range of material, forming a superset of AVI capabilities in this regard. There are some more differences not evident at this first sight:

  • it can hold (any number of) video, audio or subtitle streams, although the latter are just not yet supported by matroskamux (however, see also “ Subtitles ”) [whereas an AVI contains at most 1 video stream and a number of audio ones]

  • each datablock is recorded along with a (media)timestamp and (media)duration; in particular this caters for (audio) VBR [whereas AVI does not record such information and typically only relies on framerate or constant bitrate]

  • a table of references (index) is included which could (theoretically) refer to every datablock, but usually only does so to a (fairly dense) subset of keyframes [whereas an AVI holds an index to all datachunks]

  • only v2 Matroska files may hold detailed keyframe information (as only SimpleBlock indicates whether or not a block holds a keyframe; but this is not used in Matroska v1 files) [whereas keyframe info is fully available in any decent non-corrupt AVI]

Hence, some practical considerations follow

  • matroskamux can basically be freely used in stead of avimux. One need only (additionally) make sure that the timestamps going into the former make up a nice consecutive (perfect) stream, as they will be recorded for posterity and used in later playback. This could be achieved e.g. by means of identity (with single-segment true) or entransstamp inserted before an encoder element or muxer (some encoders like competing with muxer on the notion of time, and also keep an eye on this).

  • Pass-through transcoding is possible with Matroska input, but may likely need an element such as divxkey or mpeg4videoparse to ensure that the meta-information on keyframe-ness is properly marked. If not, lacking this metadata in the input, all data will be assumed a keyframe by default.

  • If interested in a bit of history and some inter-operability; some (not so much) older versions of MPlayer do not recognize the full set of possible bytetags in a Matroska file, and may consequently fail to find a valid videostream. This could be prevented by nudging tags into another direction by means of a capsfilter, an example fragment being

      stamp ! avenc_mpeg4 ! video/x-divx,divxversion=5 ! matroskamux
    

    Similarly, older MPlayer versions cannot handle v2 files.

If needed or desired, the stand-alone MKVToolnix tools can aid in managing Matroska files.

QuickTime format

Likewise, QuickTime, designed by Apple, is quite different in origin and bitstream, but alike in that it also basically stores variably sized blocks/atoms of data, along with metadata describing the streams/tracks contained in the file and providing index-like data. The very same syntactical format is also used as an ISO MPEG-4 container format (.m4a, .mp4), albeit with some variation in types of boxes considered (where ISO specs refer to the syntactical entity atom as a box, see e.g. the concise ISO MPEG-4 14496-12 base media specification in the publicly available ISO standards). Though hardly a practical concern, it might be noted that the metadata is typically stored at the end of a file following the actual media contents, unlike other formats considered (and somewhat contrary to colloquially referring to metadata as a header).

The comments and practical considerations of the previous section apply just as well in this case, except for the following replacement or additional notes:

  • QuickTime metadata indexes all datablocks (samples) and also holds full keyframe info (synchronization sample table).

  • avmux_mov and avmux_mp4 provide a working QuickTime and MP4 muxer respectively, with only some moderate variation in the resulting output and types of accepted input stream (depending on the specification involved). However, the similarly related qtmux, mp4mux, gppmux and mj2mux should be considered more reliable. Recall that similar observations with respect to timestamps as in the foregoing section apply.

Some stand-alone helper and/or debug management utilities are provided by MPEG4IP or libquicktime.

MPEG PS muxing

Either avmux_mpeg or mplex[2] produce an MPEG program stream, the former based on libav's avformat library, and the latter on mjpegtools' mplex. Though neither confirmed nor explained, it appears that the output of the former element may only (??) be handled (well ??) by ffmpeg aware/related (software) players (notably MPlayer), a (possible ??) limitation not so affecting the latter's results.

Unlike any other (typical) muxer element, mplex totally disregards any provided time metadata in input streams. This is (implicitly) determined from other information extracted from the (compressed) bitstream itself, such as framerate, samplerate, …

LIBAV muxers

As already indicated by example in the previous section, various muxers of libav's avformat library are wrapped and provided by the avmux plugin. They should[3] produce results similar to those of the standalone ffmpeg binary, possibly with varying degrees of success.

Subtitles

Subtitles can come in a variety of formats, be it containing the actual text to be displayed (optionally with some sort of style information) or simply a direct bitmap (encoded) form of what the end-result text should be on screen (as for DVD subtitles). In either case, they can also be supplied in a separate file.

There has been quite some work on subtitle support in GStreamer and these are as such quite well (though possible not yet impeccably) supported in the recent 1.0 series. For instance a DVD can be transcoded (including subtitles) into a Matroska container by means of following pipeline:

dvdreadsrc title=2 ! dvddemux name=demux \
matroskamux name=mux ! filesink location=dvd.mkv \
demux. ! mpeg2dec ! queue  ! avenc_mpeg4 ! mux. \
demux.audio_00 ! a52dec ! queue ! audioconvert ! lamemp3enc ! mux. \
demux.subpicture_05 ! queue ! dvdsubparse ! mux.subtitle_0

Alternatively, entrans could be used in dynamic mode, with a decodebin based pipeline, which is much more convenient when it comes to selecting (several) desired streams:

entrans.py -i dvd://2 -o dvd.mkv --an 1,3 --on 6,16,19,21 -- \
--video avenc_mpeg4 \
--audio audioconvert ! lamemp3enc \
--other dvdsubparse

Of course, in either case, some (source and stream selecting) numbers may be different, and it can also be extended to several audio and subtitle streams with the corresponding evident modifications. Similarly, other codecs could easily be chosen and configured as desired.

Recording

In this case, the input is some live real-time source, data from which is to be recorded and kept for future use by means of (a.o.) one of the above muxers. Although this source can be a great many things, e.g. network, or nowadays often some type of DVB source, the intention here is to focus on a raw source, i.e. typically a video4linux device (such as an analogue tv capture card or webcam).

In many ways, there is nothing special about this, and one can record using a pipeline built according to basic principles as follows

v4l2src queue-size=15 ! video/x-raw,framerate=25/1,width=384,height=576 ! \
  avenc_mpeg4 name=venc \
alsasrc ! audio/x-raw,rate=48000,channels=1 ! audioconvert ! lamemp3enc name=aenc \
avimux name=mux ! filesink location=rec.avi venc. ! mux. aenc. ! mux.

Of course, the particular image and audio capture settings may vary, albeit within some limitations typically inherent to the device. Also, this pipeline may elicit some complaints from the encoder regarding timestamps; see also further on.

Although the above will (how?) typically work, it is nevertheless reasonable to be concerned or have questions about:

  • Buffering.  A lot of things may be going on in a multi-tasking system, which can keep (mainly) CPU and other resources temporarily occupied elsewhere, although the real-time data keeps coming in, and we don't want to miss any of this (or as little as possible at least). Is there some buffering going on to (shortly) hold on to data for it to be picked up a bit later, as opportunity permits ?

  • Synchronization.  Is there any system in place that (at least) tries to ensure that videoframes end up along-side the proper audio data (or vice versa) ?

Buffering here occurs on a low (device) level; the driver will arrange for a number of buffers and configure the device to capture data into these buffers. As such, the device can always continue capturing in real-time for at least a while, even if there are some delays elsewhere. In case of e.g. v4l2src, the number of framebuffers is controlled by the queue-size property, although the actual resulting queue still depends on the particular device's capability. Similarly, alsasrc' buffer-time and latency-time control the total capacity of the buffer and the size of each individual buffer (respectively), again up to best effort approximation by the device(driver).

A live audio source such as alsasrc will provide the pipeline with an (audio) clock, which (basically) advances according to the number of captured samples. All captured data is then stamped with the time (of capture) as perceived by this clock. Evidently, this already ensures that the captured audio stream makes up a very nice, time-perfect stream. As videoframes are also stamped this way, this aligns video data with audio data, for the most part … As indicated in previous sections, most formats (and muxers) expect video data to have a particular framerate, and the incoming data to abide by this. If not, strange things (notably loss of synchronization) will happen. Data may fail to abide due to:

  • device (mal)performance; a tv capture card usually manages to deliver frames in a regular pace at the proper framerate, but this need not be the case for all devices, e.g. webcams

  • delay effects introduced by buffering, that is, when a (video) buffer is taken out of a (device) queue and timestamped, there is a delay with respect to the real capture time. In particular, this can lead to irregular clusters of timestamps (even if actual capturing goes by the book). In fact, this type of clustering prompts the encoder to complain about timestamps, as indicated earlier.

So, the video stream needs to be tamed and regularized before encoding and muxing it, as a muxer may actually disregard the (nicely accurately computed) synchronized timestamp and rely on expected framerate only. Such taming could be done by inserting videorate (somewhere before the encoder), which would then drop and/or duplicate frames as needed to obtain a perfect and (still) synchronized stream. However, due to videorate's current technique, this may lead to more frame drops or duplicates than really needed, particularly due to the above timestamp clustering; a concentration of frames would call for dropping, a subsequent lack for duplication, … An alternative is to use the following fragment

v4l2src ! stamp sync-margin=2

entransstamp is aware of desired framerate and also counts incoming frames. At regular intervals (set by sync-interval), it verifies whether the incoming videostream timestamp is off by more than sync-margin frames with respect to expected timestamp. If so, proper dropping or duplication is performed to resync. Periodic checking against real-time makes the process less sensitive to intermittent variations.

In conclusion, the following pipeline should cater for recording in most conditions and circumstances:

v4l2src queue-size=16 ! video/x-raw,framerate=25/1,width=384,height=576 ! \
  stamp sync-margin=2 sync-interval=5 ! queue ! avenc_mpeg4 name=venc  \
alsasrc buffer-time=1000000 ! audio/x-raw,rate=48000,channels=1 ! \
  queue ! audioconvert ! lamemp3enc name=aenc   \
avimux name=mux ! filesink location=test.avi aenc. ! mux. venc. ! mux.

In particular, it features resync'ing at a higher frequency (in the same thread as and immediately following capturing), and provides additional buffering (beyond that on driver level).

Alternatively, one might turn to a container format and muxer that does not expect its (video) input to have a particular framerate, such as notably the Matroska format. Such format caters for faithful snapshot recording and subsequent playback (including original timestamps) without needing to regularize input. In principle, there is then no need for entransstamp or videorate, though an encoder might still complain about the quality/regularity of timestamps, as mentioned earlier.

Note

v4l2src may produce buffers that contain mmap'ed driver buffers (only if always-copy is false in latest version), and these may (depending on your mileage) wreac havoc with (separate thread) encoding performance. Presumably, the recycling of mmap'ed v4l2src buffers leads to additional (lock or driver) contention (particularly when such releasing occurs in a separate thread). The above example does not suffer from this, because entransstamp copies the data into a normal buffer. Alternatively, efence or a combination of 2 counter-acting videoboxes could also be (ab)used for this purpose.

Multiple Images

In this case, the input or output consists of a collection of (image)files.

The easiest native GStreamer way of handling these is for example

multifilesrc location=image-%05d.jpg num-buffers=25 ! image/jpeg,framerate=25/1 ! \
  jpegdec ! videoflip method=vertical-flip ! jpegenc ! \
  multifilesink location=image-out-%05d.jpg

(assuming that it concerns 25 input files image-00000.jpg, image-00001.jpg, …)

Another approach is to use external tools to merge/split the set of files to/from one file and deal with this file. One candidate is the multipart format used by multipartmux and multipartdemux. Each block of data in multipart files is preceded by (in printf format)

"\r\n--%s\r\nContent-Type: %s\r\nContent-Length: %u\r\n\r\n",
  boundary, mime-type, size

where boundary is configurable (ThisRandomString by default), and the Content-Length line is optional. Clearly, a fairly simple (shell) script or pipeline suffices to transform such a file from/to a set a files. Having such at hand, the above example then becomes:

filesrc location=image.multipart ! multipartdemux ! \
  jpegdec ! videoflip method=vertical-flip ! jpegenc ! \
  multipartmux ! filesink location=image-out.multipart

A similar approach could use an intermediate single file in mjpegtools' YUV4MPEG2 format, which is really a sequence of uncompressed images, see entrans plugin collection documentation for more details on this.

Media processing

Of course, transcoding is typically not limited to a transformation from one (codec or muxer) format to another, but usually also involves applying some (filter) operations. These can range from basic operations such as scaling (videoscale) or clipping (videobox) to various effects (e.g. effectv plugin) or any transformation one can envision. A gst-inspect(1) browse is bound to reveal many more examples.

Although the details of the operation performed by such filters is typically governed by their properties, some of these elements (videoscale, videorate, audiorate, …) must necessarily also consider the caps of connected elements. Conversely, their operation can be affected by applying a specific capsfilter, see e.g. examples in “LIBAV encoders”.

Last but not least, some additional filters are available in the entrans plugin collection.

Comparison and Motivation

As mentioned earlier, there is a great many software that handles multimedia playback. GStreamer, as a framework, makes it (relatively) easy to build a player; e.g. GNOME player totem has a GStreamer backend, and elisa, to name a few among GStreamer applications

When it comes to creating media files, e.g. by transcoding existing media, there is not such an abundance of choice. The following table compares some media processing and creation capabilities:

 OggMatroskaAVIMPGFiltersCutting
mencodernonoyesyesyesyes
libavyesyesyesyesnono
transcodenonoyesnoyesyes
avidemuxyesnoyesyesyesyes
GStreameryesyesyesyesyesyes

Quite possibly, other (demuxing, decoding) features that may be of interest have been disregarded here, but as it stands above, a (tentative) conclusion follows. Fortunately, some applications do actually make use of the above GStreamer capabilities, e.g. PiTiVi (a video editor), Thoggen (a DVD ripper), and of course entrans. So, a non-exhaustive overview of the application spectrum is:

 Non-GStreamerGStreamerComments
playermanymany 
cmd-line transcodingmencoder, transcodeentrans 
GUI editing/transcodingavidemuxPiTiVi, Thoggen, Transmageddon 

It should again be noted that all these (GStreamer) applications make use of a common API and codebase, and as such can benefit from a single collective developer build and maintenance effort. Furthermore, due to its plugin system, GStreamer is not only free software (as many others), but also offers the freedom to extend it (comfortably) at will to include more formats, filters, …, which then become available at once to all the above applications.

Finally, let's demonstrate the many players mentioned above, although not by lengthy list of examples. A GStreamer player can indeed arise quite easily. In fact, simply running the following pipeline already yields an extremely basic player with some minimal navigation[4]. It requires only trivial modification to control which (if any) effects and filters to apply on-the-fly.

filesrc location=example-in.ogm ! decodebin name=bin \
bin. ! progressreport ! navseek seek-offset=10 ! timeoverlay ! xvimagesink \
bin. ! audioconvert ! alsasink



[1] As of release 0.10.5 of gst-ffmpeg.

[2] As of release 0.10.8 of gst-plugins-bad, and only operates with latest mjpegtools release candidates.

[3] The standalone ffmpeg's architecture provides the avformat code with full (intimate) access to a corresponding encoder AVCodecContext structure, which in avmux can only be built approximately by means of caps information. Results may or may not vary depending on particular format used and the way it is coded.

[4] Requires at least GStreamer core 0.10.13.