[FoRK] large scale dataset mailing list/resources?

Stephen Williams <sdw at lig.net> on Wed Feb 20 12:23:26 PST 2008

Justin Mason wrote:
> FoRKer Aaron Swartz to the rescue! http://theinfo.org/
>
>   'This is a site for large data sets and the people who love them: the
>   scrapers and crawlers who collect them, the academics and geeks who
>   process them, the designers and artists who visualize them. It's a place
>   where they can exchange tips and tricks, develop and share tools
>   together, and begin to integrate their particular projects.'
>
> --j.
>   
Cool!
> Luis Villa writes:
>   
>> Hey, all-
>>
>> A friend is working on a fairly large-scale data project- will
>> probably top out in the neck of 5M records (but potentially 25-50
>> times that if really takes off), each of which is both a lot of text
>> to be analyzed (5-50K words, with link and potentially grammar
>> analysis) and an associated pdf (original source material.) Goal is to
>> do good search and probably eventual statistical analysis for
>> research. (No prizes for guessing what this is if you've been
>> following my blog ;)
>>
>> Currently search is Apache Solr-powered; he's considering moving to an
>> RDF store but I don't know the details there. The pre-processing is
>> becoming a total PITA- every time he improves the parser to get more
>> data out of the original sources, several days worth of data
>> processing and many, many gigs (not quite terabyte yet but getting
>> there) of data is created and moved around. He's looking at moving
>> that to AWS to cheaply parallelize, but that only solves some problems
>> and creates others.
>>     
This is exactly the kind of data processing that I am targeting with 
Efficient XML Interchange (W3C, OpenEXI) / EsXML (mine, being retooled 
as EXI+/ERI) / Efficient RDF Interchange (ERI, mine), and avoiding 
parsing / serialization.  OpenEXI will be available soon.  The spec for 
ERI should also be available soon, with code to follow.

In architecting solutions to these kinds of problems, you have to 
balance portable formats, common APIs, extensibility, flexibility, and 
processing models with real work.  There are ways to factor out overhead 
while retaining the benefits of modern data interchange architecture.

sdw
>> This is an area with very sketchy resources: the only people who seem
>> to know how to do it well are deeply locked inside G/Y!/MS. More and
>> more people outside the big three are getting into it, but there
>> doesn't seem to yet be much documentation of the CW, best practices,
>> etc., which has frustrated my friend. (As he put it, 'I don't know
>> anything yet, but that hasn't stopped O'Reilly from asking me to help
>> write a book about it...')
>>
>> So... he's doing his best to teach himself this stuff on the fly, but
>> he asked me if I had any pointers to good resources/discussions/etc.
>> on this. I had no idea, but I said I'd ask around- does anyone have
>> pointers to good resources, places where people discuss these
>> problems, etc.?
>>
>> thanks in advance-
>> Luis
>> _______________________________________________
>> FoRK mailing list
>> http://xent.com/mailman/listinfo/fork
>>     
> _______________________________________________
> FoRK mailing list
> http://xent.com/mailman/listinfo/fork
>   


More information about the FoRK mailing list