VoxML: voice-driven interfaces to the Web

Rohit Khare (rohit@uci.edu)
Mon, 05 Oct 1998 23:53:40 -0700


Unfortunately, the Markoff article is unusually low on bit content today.
It's attached nonetheless. The press notice seems a bit premature for what'=
s
just an intent-statement by an ad-hoc group of players outside of any
standards committee. Futhermore, I wonder at what point
additonal-form-markup for Voice access merges into veosystems' Common
Business Language / WIDL / SOAP combined with XSL/speech bindings...

The speech trendline and the Web trendline do intersect neatly and are
mutually reinforcing, though: speech retrofitting needs a clean
'high-level/UI-level' API into typical daily applications; and ports to the
Web are already surfacing those access points to complex business systems,
reifying low-level API calls as HTML forms and XML documents.

Speech is also critical to small devices, like munchkins. I've also attache=
d
a curious press notice of 3d-chip modules in the Economist. Note the lead
user is Boeing's wearable manual tool. It's not quite a straight line from =
a
'pack of cigarettes' to a 'postage stamp', but not implausible. I wonder if
this technology could indeed be composed with organic (flexible) transistor
bases... Anyone for lunchtime visit to ISC?

Rohit

PS. More on VoxML? http://VoxML.mot.com/ Here's the damning first FAQ:

Q. Why did you create a new markup language? Why don't you just use XML or
HTML with style sheets?

Existing markup languages (even with style sheets) aren't well suited for
developing voice dialogues. VoxML (which is based on XML) was designed to
support interactive dialogues while leveraging the technologies that have
made the Internet a simple and effective medium to distribute content.

[No technical details are available publically at this time. Here are some
lies:]

"Motorola is proposing the VoxML approach as a publicly available
specification for voice applications development...This effort will
collectively be the driving force in creating the next generation voice
applications."

[I'm not even sure a BNF qualifies as a markup language -- what data are
they marking up, anyway? I smell yet another effort to write specs with
angle brackets and claim network effects from the Web installed base...]

"Obviously, web pages were not designed for a voice interface. The VoxML
language allows web developers to use familiar programming skills to develo=
p
voice interfaces quickly."

Q. Can you develop applications that require complex grammars using VoxML?

Yes. The VoxML language supports the use of context free grammars written i=
n
an extended BNF format.

[You have to fill out an annoying personal registration to delve deeper int=
o
their developer site.]

"Thank you for your information. After we have reviewed and verified this
information, someone from our developer relations team will send you a logi=
n
and password for the developer site via the email address you've given
below."

[Moral: Why is it anything that touches the Bellhead universe like wireless
or speech/telephony integration so screwed-up standards-wise? These guys
want to reimplement every layer of the Web application stack, from crazy
link protocols to replacing HTTP, HTML, and CSS. Henrik?]

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Silicon smarts

BRAINS and computers are very different things. Brains consist of a
trillion or so tiny elements, called neurons, which are individually dumb
but collectively, thanks to the thousand trillion connections between them,
very powerful. Most computers, on the other hand, depend on a single,
complex component=8Ba microprocessor=8Bto get things done. Even the most
advanced supercomputers, with hundreds or even thousands of connected
microprocessors, cannot match the compactness or connection density of the
human brain.

Two new chip-making techniques being developed at Irvine Sensors Corporatio=
n
(ISC) in Costa Mesa, California, could be significant steps in the
long-running effort to make more brain-like computers. Researchers at ISC
have found a way to pack silicon chips extremely tightly together and,
better still, to make large numbers of connections between them.

Their technique layers silicon chips on top of each other, cramming 50 chip=
s
into the space normally occupied by just one. This is done by grinding away
the underside of the silicon wafer on which the chip circuitry is built=8Ba
thick, non-functional platform that can be removed without affecting the
chip=B9s operation. The result is a paper-thin but fully functional chip,
which can be stacked and bonded with other chips to form a single unit. The
chips are wired together via connectors along their edges, and the whole
sandwich is embedded in epoxy resin.

This space-saving technique is already being used commercially in a
four-layer memory chip that packs 128 megabits of data into an amazingly
small one-centimetre-square package. Earlier this year, ISC won a $1.3m
military contract from Boeing to build a wearable, voice-activated computer
the size of a pack of cards.

But while such stacking wizardry means computers can be smaller, it does no=
t
make them more brain-like. At present, the nearest approximation to a
silicon brain involves making electronic circuits that behave like neurons,
and connecting them up in small networks. Such =B3artificial neural networks=B2
can be used for everything from image recognition to credit scoring, but
their size and complexity=8Band so their deductive power=8Bis limited.

This is because it is only possible to fit a certain number of silicon
neurons on to a single chip; and there is a limit to the number of
connections that can be made between adjacent neural chips. Researchers
would like to be able to build networks that are larger and more densely
connected=8Bin short, more brain-like. ISC=B9s second technology should let the=
m
do this, by allowing direct vertical connections to be made anywhere on the
adjoining surfaces of adjacent chips in a stack.

To achieve this, half of a special component called a three-dimensional
field-effect transistor, or 3DFET, is constructed at the site of each
connection, as part of the usual chip-making process. When the chips are
stacked, the two halves of each 3DFET fit together, allowing signals to pas=
s
up and down from one chip to the other. A prototype 3DFET, developed with
financing from the US army=B9s ballistic missile defence organisation, has
already been made and tested. With new funding, ISC hopes to make a
chip-stack connected using 3DFETs within 18 months.

After that, says ISC=B9s chief technical officer, John Carson, the long-term
goal is to stack 1,000 neural chips in a single cube. This would involve
thinning each chip down to less than the thickness of a human hair. Already=
,
ISC has produced a prototype that is almost this thin.

If ISC can squeeze a million silicon neurons on to each chip, and pack a
thousand chips into a one-inch neural cube, the arithmetic starts to get
interesting. A thousand such cubes, which could fit in a shoebox, would
contain a trillion neurons, and a hundred trillion connections. That would
still not match the connectivity of human grey matter. But it would be the
most brain-like computer ever made.

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
October 6, 1998

Operator? Give Me the World Wide Web and Make It Snappy
By JOHN MARKOFF

A group of communications, software and financial services companies plans
to announce an alliance today aimed at creating standards for speech
recognition that will push electronic commerce and the Internet beyond the
boundaries of the personal computer, potentially making it accessible to
anyone with a telephone.

The announcement, to be made at an industry trade show in New York, is
intended to expand the ability to perform Internet transactions or retrieve
data from the World Wide Web by using speech-recognition technology to
translate spoken words into data commands transmitted via the World Wide
Web.

The companies, which include Motorola Inc., SAP A.G., Visa International,
Broadvision Inc., and Nuance Communications, a unit of SRI International,
describe their effort as V-Commerce and argue that telephone access will
break down the last barrier remaining between the mass of consumers and the
Internet.

Currently access to the Internet is limited to those with personal computer=
s
and Internet capability, but increasingly executives in both the computer
and communications industries envision a new Internet platform that will be
controlled by voice and will display information on a small portable screen=
.

In the last year speech recognition technologies have improved
significantly, making it possible for a number of companies to begin
deploying systems that answer questions for phone call and package routing,
stock trading, banking, airline reservations and other commercial
transactions. Now the companies are making an effort to integrate voice
systems with Internet and corporate data bases.

The new drive toward integration will make it possible for consumers to pos=
e
questions via the telephone and then receive data in a variety of different
ways, including via the phone, portable devices such as pagers and cell
phones, or in the Web browser of their personal computers.

"We see voice recognition making tremendous advances," said Todd Chaffee,
executive vice president for corporate development and alliances at Visa.
"Voice technology is going to lead to a massive extension of electronic
commerce beyond those with PC's."

In the future, according to the alliance executives, it will be increasingl=
y
simple to use a telephone to make a query such as "tell me the flights from
San Francisco to New York leaving tomorrow." The answer to that question
could then be immediately spoken via the phone, or displayed on a pager or
in a computer Web browser.

Currently, gaining access to the same information via a Web browser alone
can require as many as 15 mouse clicks and take up to three minutes.

Nuance is one of a small group of companies, including Applied Language
Technologies, Lernout & Hauspie Speech Products N.V., Dragon Systems and th=
e
I.B.M. Corporation, that are now developing basic speech recognition
technologies that are speaker-independent and do not require training.

This is a significant advance and is permitting wider deployment of voice
applications, such as recent Nuance systems that permit call routing at
Sears, Roebuck or stock information at Schwab.

Both Visa, the credit card processor, and Motorola are investors in Nuance
and both companies are moving to deploy both new services and products that
are based on the new voice technologies.

Visa executives said the company had created prototypes of five voice-based
financial services including credit card activation, lost and stolen card
replacement, travel planning, voice banking and bill payment.

The services will be available commercially next year. Motorola executives
said the communications company was working to create new Internet standard=
s
that would make speech extensions to standard Internet services simple to
implement.

The company recently introduced a new voice Internet standard called Voice
Markup Language, or VoxML, to allow software developers to add speech to
their Web applications. Because VoxML applications can be deployed on
standard Web servers, the company maintains it is straightforward to add th=
e
technology.

"Motorola looks at this area as a fundamental element of our strategy," sai=
d
Maria Martinez, vice president and general manager of Motorola's Internet
division.

One of the strengths of the VoxML standard is that it would permit response=
s
to voice queries to be displayed on a variety of existing Web devices, from
pagers to browsers.

"So far, electronic commerce has been constrained by the PC," said Ron
Croen, Nuance's chief executive officer. "We're intent on making the
audience larger by an order of magnitude."