[FoRK] large scale dataset mailing list/resources?
Jeff Bone
<jbone at place.org> on
Wed Feb 20 12:02:34 PST 2008
On Feb 20, 2008, at 9:20 AM, Luis Villa wrote:
> Hey, all-
>
> A friend is working on a fairly large-scale data project- will
> probably top out in the neck of 5M records (but potentially 25-50
> times that if really takes off), each of which is both a lot of text
> to be analyzed (5-50K words, with link and potentially grammar
> analysis) and an associated pdf (original source material.) Goal is to
> do good search and probably eventual statistical analysis for
> research. (No prizes for guessing what this is if you've been
> following my blog ;)
This is big?
> Currently search is Apache Solr-powered; he's considering moving to an
> RDF store
Yeah, good luck w/ that! ;-)
Last time I checked (circa Deepfile) none of the RDF solutions scaled
well at all. This was a constant PITA for us.
You might be able to stitch something together with column-oriented
stores; this is all deeply dependent on whether or not attribute-
space is boolean or valued and whether it's sparse or dense for any
given attribute. You might also have some luck with roll-your-own
Bloom filters as a kind of inexpensive initial check for existence of
a given attribute for a given object, thus avoiding unnecessary table
scans.
Random and tangential, but anybody seen this:
http://blog.freebase.com/?p=108
...?
jb
More information about the FoRK
mailing list