[FoRK] large scale dataset mailing list/resources?

Reza B'Far <reza at voicegenesis.com> on Wed Feb 27 22:39:23 PST 2008

Couple of brief points --

1)  I fully agree that the optimaly "backing store" for ontologies varies
and is not necessarily on RDBMS.
2)  A very important thing is that if most of the predicates in your
ontology are "light" (simple binary relationships), then you probably
shouldn't use an ontology period... you should use something else.  The
power of ontologies is in modeling "complex" relationships
3)  For me, at least, it took a few years to understand that there really
isn't a good mapping from ontologies to RDBMS.  I'm denser than most :), but
the power of ontologies (and when they can totally kill RDBMS and SQL like
approaches) is when you look at your reasoner and your ontology as one
cohesive unit.  The ontology, in and of itself, is a data store... to me
personally, that's of modest interest... but there are lots of stores... and
creating queries like "select this from that where something" is really not
THAT different whether you do it in SQL or RDQL (they're differnet, but it's
not a revolutionary difference).  The "revolutionary" difference comes in
when you have a reasoner that can say something like "x is related to y
through f(n,t)"... have a million of these things... then take a stochastic
reasoner and tell it to walk the graph... now, you have something
interesting... and it can get more interesting if your reasoner starts to
optimize itself by doing things like pattern recognition, linear
predictions, etc.

hope that makes sense :)

-----Original Message-----
From: Stephen D. Williams [mailto:sdw at lig.net]
Sent: Monday, February 25, 2008 9:57 PM
To: reza at voicegenesis.com; Friends of Rohit Khare
Subject: Re: [FoRK] large scale dataset mailing list/resources?


Reza B'Far wrote:
> Hi Luis:
>
> I spent the past year and half at Oracle (well we got bought by Oracle)
> solving a similar problem... Ken is quite right in that "Most straight RDF
> triplestores seem to hit the wall at millions of triples"... however,
there
>
I'm very interested in this also.  I can't see any theoretical reason
why an RDF store, with sufficient automatic intelligence, couldn't do
relational work on RDF data as fast as an RDMS does relational work on
fixed tuples.  The simplistic case has a 'blown to bits' problem, and
that is the very thing that makes it a be-all, end-all flexible data,
er... knowledge, model.

What makes an RDF store, fundamentally, more expensive than an RDBMS?
I think that all it comes down to is that:

o  Each pair of columns becomes a triple so that a single RDBMS tuple
becomes a bunch of triples.  (I.e. "blown to bits".)
o  The default case is that every value of each resulting triple goes
into an index.
o  Everyone seems to come to the conclusion that triples are best
managed as triples of integers, which seems elegant and efficient but
adds a layer of indirection.

Clearly, with relatively obvious algorithms, you could recognize tuple
usage of graph space and tupelize actual storage and indexing.  This can
all be automatic and query driven.  At that point, for those queries,
performance should be the same, no?

That method gives new meaning to "query optimization" as the data would
actually be reorganized physically on the fly.
> is actually a pretty elegant solution to this that combines distributed
> ontology techniques with MapReduce-like technique... folks that know what
> these two things are can probably interpolate the solution fairly
> obviously...
>
Seems to make sense, depending on what you mean by "distributed ontology
techniques".
But isn't this mainly needed because you have a bits multiplier and that
you are doing more complex queries on graph space rather than simple
data joins?
> Another alternative to get to billions of triples is Oracle 11g RDF Store
:)
> (that's blatent plug)...
>

The problem with Oracle is that it has long ago priced itself beyond
anyone's budget except well established, or well funded with rapid burn
rate, companies.  That's fine for Oracle, but isn't interesting for
anything that I'm likely to bootstrap, even when referring to fairly
large corporate or government projects.

sdw
> First solution, IMHO, is better... the second one is quicker.
>
>
> -----Original Message-----
> From: fork-bounces at xent.com [mailto:fork-bounces at xent.com]On Behalf Of
> Luis Villa
> Sent: Wednesday, February 20, 2008 12:32 PM
> To: Friends of Rohit Khare
> Subject: Re: [FoRK] large scale dataset mailing list/resources?
>
>
> On Wed, Feb 20, 2008 at 3:02 PM, Jeff Bone <jbone at place.org> wrote:
>
>>  On Feb 20, 2008, at 9:20 AM, Luis Villa wrote:
>>
>>  > Hey, all-
>>  >
>>  > A friend is working on a fairly large-scale data project- will
>>  > probably top out in the neck of 5M records (but potentially 25-50
>>  > times that if really takes off), each of which is both a lot of text
>>  > to be analyzed (5-50K words, with link and potentially grammar
>>  > analysis) and an associated pdf (original source material.) Goal is to
>>  > do good search and probably eventual statistical analysis for
>>  > research. (No prizes for guessing what this is if you've been
>>  > following my blog ;)
>>
>>  This is big?
>>
>
> Big enough to make doing it on a single machine much less responsive
> than he'd like. Or to put it another way: his data sets are growing
> larger and harder to parse faster than his machines are growing bigger
> and faster at parsing, so things are getting more complicated.
>
>
>>  > Currently search is Apache Solr-powered; he's considering moving to an
>>  > RDF store
>>
>>  Yeah, good luck w/ that! ;-)
>>
>
> Yeah, I didn't want to tell him that flat out, since it isn't really
> my project on the technology side, but I'm hoping to nudge him away.
>
>
>>  Random and tangential, but anybody seen this:
>>
>>    http://blog.freebase.com/?p=108
>>
>
> Eeenteresting.
>
> Luis
> _______________________________________________
> FoRK mailing list
> http://xent.com/mailman/listinfo/fork
>
> _______________________________________________
> FoRK mailing list
> http://xent.com/mailman/listinfo/fork
>

--
swilliams at hpti.com http://www.hpti.com Per: sdw at lig.net http://sdw.st
Stephen D. Williams 703-371-9362C 703-995-0407Fax 94043 AIM: sdw



More information about the FoRK mailing list