[FoRK] large scale dataset mailing list/resources?

Jeff Bone <jbone at place.org> on Wed Feb 20 12:02:34 PST 2008

On Feb 20, 2008, at 9:20 AM, Luis Villa wrote:

> Hey, all-
>
> A friend is working on a fairly large-scale data project- will
> probably top out in the neck of 5M records (but potentially 25-50
> times that if really takes off), each of which is both a lot of text
> to be analyzed (5-50K words, with link and potentially grammar
> analysis) and an associated pdf (original source material.) Goal is to
> do good search and probably eventual statistical analysis for
> research. (No prizes for guessing what this is if you've been
> following my blog ;)

This is big?

> Currently search is Apache Solr-powered; he's considering moving to an
> RDF store

Yeah, good luck w/ that! ;-)

Last time I checked (circa Deepfile) none of the RDF solutions scaled  
well at all.  This was a constant PITA for us.

You might be able to stitch something together with column-oriented  
stores;  this is all deeply dependent on whether or not attribute- 
space is boolean or valued and whether it's sparse or dense for any  
given attribute.  You might also have some luck with roll-your-own  
Bloom filters as a kind of inexpensive initial check for existence of  
a given attribute for a given object, thus avoiding unnecessary table  
scans.

Random and tangential, but anybody seen this:

   http://blog.freebase.com/?p=108

...?

jb


More information about the FoRK mailing list