Excellent overview of the MPEG standards

Rohit Khare (rohit@uci.edu)
Sun, 28 Feb 1999 21:51:26 -0800


This fellow has just taken on SDMI project, and plans to have draft
standards by June. This is a stunning comittment, but from his home
page, he may do it :-) He's been involved in MPEG for a very long
time, but how the luddite recording industry is going to take on a
reasonably scalable IPR metadata format is beyond me... RK

NYTimes:
At a seven-hour, closed-door meeting of 200 top executives in the
music and technology industries, Leonardo Chiariglione was named on
Friday to head the Secure Digital Music Initiative. The group is
seeking to create a technical format for the copyrighted sale and
digital delivery of music over the Internet.

Chiariglione, an Italian researcher, was instrumental in creating the
industry-standard formats for converting and compressing video and
audio information into digital form, known as MPEG.

The Secure Digital Music Initiative, announced in December, is an
attempt by the recording industry associations of North America,
Japan and Europe to create a standardized way to distribute songs and
albums on the Internet. The technical challenge is to do so in a way
that protects copyright holders and foils the many audio pirates who
copy and distribute digital music illegally.
Significantly, one of the MPEG formats that Chiariglione helped
create, MP3, has become the favorite way for Internet pirates to copy
and transmit music. So the music industry, in essence, has recruited
the man who opened the Pandora's box of Internet piracy to ask him to
close it.
After his selection at Friday's meeting, Chiariglione startled many
in the audience by announcing an ambitious timetable, one that may
confound the many Internet skeptics who have derided the secure-music
initiative as coming too late to turn back th

http://www.cselt.stet.it/ufv/leonardo/paper/mpeg-4/mpeg-4.htm

Tip: MP3 really refers to the third audio codec of MPEG-1...

3. Features of the MPEG-4 standard

The title of MPEG-4 is still "very low bitrate audio-visual coding".
This betrays the major original motivation for the project, i.e. the
need to have a standard that supported the low bitrate range of
digital audio visual application on low bitrate channels, such as
those provided by mobile or fixed telephony. While the digital
audio-visual landscape has undergone considerable changes since work
on MPEG-4 began 5 years ago - the Web barely existed at that time -,
the original assumptions of the project, however, have proved right
and time has provided the opportunity to incorporate and extend waves
of other technologies found to be synergistic with the MPEG-4 goal.
The following presents a list of the main features supported by the
MPEG-4 standard. It is to be remembered that an MPEG-4 decoder IS NOT
supposed to incorporate all these features, even though profiles
supporting the inclusion of all tools exist. A summary description of
profiles will be given in the next section.

MPEG-4 Visual provides a natural video coding algorithm that is
capable of operation from 5 kbit/s with a spatial resolution of QCIF
(144x176 pixels). It is ITU-T H.263 compatible in the sense that an
H.263 bitstream is correctly decoded by an MPEG-4 Video decoder. The
algorithm scales up to higher bitrates and optimisation has been
carried out at bitrates of 5 Mbit/s for ITU-R 601 resolution pictures
(288x720@50Hz and 240x720@59.94 Hz).
One important feature of MPEG-4 Visual is the ability to code not
just a rectangular array of pixels, but also objects in a scene. By
object, I mean a walking person or a running car etc. This means that
in addition to the traditional coding of rectangular arrays of
pixels, MPEG-4 Visual is capable of coding the individual objects in
a scene. Actually the objects may very well not be natural, i.e.,
they can be synthetically generated. MPEG-4 Visual is currently
supporting the definition of a synthetic human face through 68
feature points of a face. The feature points can then be animated.
MPEG-4 Visual also addresses the coding of texture, i.e. a natural or
synthetic array of pixels that is typically spread onto a synthetic
object. This provides a high number of scalability levels with
improved compression compared to known coding algorithms.
The freedom of individually coded objects, however, entails the need
for a standard way to position objects in a scene. Actually this
applies to both video and audio objects. This can be done by the
author of the scene and is the MPEG-4 equivalent of the role of a
movie director who instructs the scene setter to put a table here and
a chair there and asks an actor to stand and pronounce a sentence,
and another to walk away.
Clearly the availability of a technology to encode an individual
object is opening up interesting possibilities in the area of digital
production and this is currently being pursued.
MPEG-4 Audio provides a complete coverage of the bitrate range of 2
to 64 kbit/s. Good speech quality is obtained already at 2 kbit/s and
transparent quality of monophonic music sampled at 48 kHz at 16 bits
per sample is obtained at 64 kbit/s. Appealing quality is obtained at
16 kbit/s for stereo music.
In the area of synthetic audio two important technologies have been
standardised. The first is a Text To Speech (TTS) interface, i.e. a
standard way to represent prosodic parameters, such as pitch contour,
phoneme duration, and so on. Typically these can be used in a
proprietary TTS system to improve the synthesised speech quality and
to create, with the synthetic face, a complete audio-visual talking
face. The TTS can also be synchronized with the facial expressions
of an animated talking head.
The second technology provides a rich toolset for creating synthetic
sounds and music, called "Structured Audio". Using newly developed
formats to describe synthesis methods and their control, any current
or future sound-synthesis method may be used to create sound in
MPEG-4. The sound is guaranteed to be exactly the same on every
MPEG-4 decoder.

MPEG-4 Systems provides the object composition technology referred to
above. This is based on VRML but extends the original VRML
functionality by allowing the inclusion of streamed audio and video,
natural objects, generalised URL, composition update and, most
important, a very effective compression for VRML type information.
This technology is called BIFS (Binary Format for Scene Description).
As VRML only addresses 3D worlds, MPEG-4 Systems defines all 2D nodes
that are needed by 2D-only applications according to the same
organisation of 3D. As done for MPEG-1 and 2, MPEG-4 Systems provides
precise synchronisation of audio and video objects. MPEG-4 Systems
supports both push and pull delivery of content. MPEG-4 also has
support for Object Content Identification (OCI) so that searches in
data bases of MPEG-4 objects become possible. To accommodate the
needs of the content industry each MPEG-4 audio, visual and
audiovisual object can be identified by a registration number similar
to the well established International Standard Recording Code (ISRC)
now used to identify pieces of music in Compact Disc Audio.
The last component of MPEG-4 is called DMIF (Delivery Multimedia
Integration Framework) and this provides three types of abstraction.
The first is abstraction from the transport protocol, which can be
RTP/UDP/IP or AAL5/ATM or MPEG-2 TS or others. The ability to
identify delivery systems with different quality of service (QoS)
provides a very effective way to charge the cost of delivery
depending on the QoS required by the application or by the user. The
second is the abstraction of the application from the delivery type.
Interactive (client-server), local or broadcast delivery are seen
through a single interface. Further, in the interactive case, DMIF
provides an abstraction from the signalling mechanisms of the
delivery system.
For all parties involved in MPEG-4 the slogan "Develop once, play
everywhere from anywhere" applies!