RE: Real-time distributed Web search (Gnutella knockoff?)

Date view Thread view Subject view Author view

From: Nicolas Popp (nico@realnames.com)
Date: Tue Jun 27 2000 - 17:39:47 PDT


Somebody pointed out to me that this approach was doomed.

If you give destination sites control of the search query, many will lie.
Most commercial sites want to hijack traffic from search engine and appear
very high in a results list whether or not their site is actually relevant
to the query. That's a real world fact and that's actually why meta tags
have failed (and ignored by search engines these days).

In other words, if you give control of the query to the Web sites that want
to be found and lose control of the relevance ranking, you will return
garbage...

-Nico

-----Original Message-----
From: Rohit Khare
To: FoRK@xent.com
Sent: 00/06/27 4:56
Subject: Real-time distributed Web search (Gnutella knockoff?)

I'm not really thrilled, but it is interesting to see that both
cachers (Dynamai) and searchers (here, Infrasearch) are beginning to
cope with dynamcially-generated content. That said, the real promise
of real-time indexing is real-time notification (iPal, anyone?) --
Rohit

gag: http://elitist.xcfventures.com/window.jpg

briefly

InfraSearch is 100% Gnutella. The only thing not Gnutella is your web
browser.

The fully-distributed system comprises several major components, each
of which can be made redundant and load-balanced with extreme ease
through Gnutella technology.

It's entirely pay-as-you-go. No huge up-front expenditures. This
entire prototype runs on a Pentium III machine costing only a few
thousand dollars.

We don't believe a million-dollar search server is a good use of your
resources, so InfraSearch is designed to run on hardware that fits on
your child's credit card.

briefly

Just launched your online store? Congratulations: You've created
another black hole on the web.

Fortunately, there's InfraSearch. It unlocks the door for users
searching for what you offer. It lets them peer into your dynamic
content.

InfraSearch is not just another way to crawl and index. InfraSearch
changes search from a passive thing into an active thing. It changes
search from an HTML thing into a semantic thing. Best of all, it
changes search into a thing you, the information provider, control.

Search as infrastructure. Only InfraSearch does it.

InfraSearch Search engines
search your database Yes No
allows you to respond to queries the way you want Yes No
allows you to respond competitively Yes No
search method Whatever you choose Crawling
search data Semantic HTML
who controls the search You do They do

InfraSearch can...

InfraSearch enables information providers to answer searches. After
all, who knows how to answer a question about news better than a news
specialist? Who can answer a question about red roses better than a
florist?

So that's the short of it: Let the people who know the answers answer
the way they want. Let the user experience begin at the search engine.

search as infrastructure

It's a strange idea to throw up a web site and hope the search
crawlers come around and index it. It doesn't make sense to leave
that to chance.

What makes sense is taking charge and driving traffic by controlling
the way searches are answered. You own the data, you should manage
the searching. It's the only part of the user experience that
companies don't even try to manage. It's just outsourced to search
providers which have little interest in indexing one site better than
the next. External visibility is the most overlooked part of any web
site.

You manage your database, you manage your web server farm, you manage
the content production... You manage everything...except your
external visibility. It's time to manage that too. Without it, your
site is just another black hole.

search engines can't...

To current-generation search engines dynamic content (anything with a
question mark in the URL) is invisible. Your favorite online store,
your favorite online news source: invisible.

Search engines appear to take about four weeks to crawl a URL you
submit. Suppose war breaks out. Even if there is an obscure news site
out there that a search engine can index, you won't find anything
about that war for a month.

Search engines index the words on a page, not their meaning. So when
you search for "canon eos-3" on a search engine, you get a bunch of
hits about the camera. Suppose Epinions.com could answer. They could
tell you how much Epinions users liked it. Suppose Nikon could
answer. They could tell you about their F-100 camera at the same
price point.

Remember: it's not because the search engines don't want to do all
that. It's because they can't.

technology

Search engines work by "crawling". This technology has been perfected
over the past six years. A search engine starts at some URL (or some
set of URLs) and basically clicks on every link on every page that it
can click on.

The crawler stores every page it sees on a huge disk. The data is
then indexed. In short, web search engines try to download the entire
web and make sense of it. A lot has gone into the process to make it
efficient and yield the most fruit, but there are inherent
shortcomings.

Part of the reason is that no matter how much technology is thrown at
the effort to crawl the web, crawling is just too slow to keep up
with the web's pace of change.

More than that, the reason crawling is outmoded is because modern
content providers use a growing amount of dynamic content. So, HTML
pages with forms and URLs with question marks aren't crawled.
Unfortunately sites' crown jewels are increasingly stored in their
databases, hidden behind those strange URLs that crawlers are afraid
to visit.

InfraSearch works

InfraSearch fixes all that.

InfraSearch uses Gnutella distributed information search technology
to distribute searches to information sources. InfraSearch Agents
running at information providers provide an interface between
InfraSearch.com and information providers' databases, flat files,
HTML pages, or whatever. Information providers can make use of
whatever data they want to answer queries however they want.

Click to learn a little about InfraSearch's architecture.

------------------------------------------------------------------------

InfraSearch asks those who know and lets them answer as they want
------------------------------------------------------------------------

If an information provider can answer your question, it will answer
your question in its own special way.

Search for "Mercedes-Benz E55" and maybe you'll get a result from BMW
telling you about their new M5.

Search for "MSFT". If a broker answers, it might look something like
this:

    MSFT 70 +0.25 Quote is delayed at least 15 minutes.

------------------------------------------------------------------------

dynamic URLs in hits?!
------------------------------------------------------------------------

Another powerful thing InfraSearch can do is allow content providers
to answer searches with fully dynamic URLs. Search for "mustang drag
race" and get a customized link from Summit Racing to a page listing
all the parts to make your 5.0 a 9.0 second car.

------------------------------------------------------------------------

no dead links!
------------------------------------------------------------------------

And...since answers are based on current data: no dead links!

The information you want, up-to-date, and presented in a way which
makes it easy to find exactly the information you were searching for.

------------------------------------------------------------------------

prototype
------------------------------------------------------------------------

InfraSearch is a prototype at this stage. You know what that means.

It's the idea that counts. Real-time distributed search is the next
thing, and InfraSearch is the first demonstration that it can do
useful things.

------------------------------------------------------------------------

Home | About | Team
Architecture

COPYRIGHT ) 2000 XCF Ventures. ALL RIGHTS RESERVED.

All trademarks are the property of their respective owners.


Date view Thread view Subject view Author view

This archive was generated by hypermail 2b29 : Tue Jun 27 2000 - 17:43:46 PDT