An Open Repository of Web Data by Lisa Green, Common Crawl

Thursday 4 October 2012, 15h30-16h30 drinks afterwards
Cafe-restaurant Polder, Science Park 205, Amsterdam
Free admission, no registration required

Every year it becomes cheaper to compute, store, and manage information. Yet software developers and researchers continue to face limited access to open, transparent, trusted data in bulk. Open Data must be part of the Big Data ecosystem if we are to realize the full potential of the new tools and technologies.

The Web is the largest and most diverse collection of information in human history. It is crucial for our information-based society that Web data be openly accessible to anyone who desires to utilize it.

The Common Crawl Foundation produces and maintains a repository of Web crawl data that is openly accessible to everyone. The crawl currently covers over 8 billion pages and the repository includes valuable metadata. The crawl data can be accessed through SARA and through Amazon Web Services. Researchers can now access a large corpus of high quality crawl data that was previously only available to large search engine corporations.

This talk will explicate the benefits of an open and accessible repository of crawl data; give an overview of the Common Crawl data; explain how users can access Common Crawl data; share examples of works that leverage Common Crawl data; and preview the foundation’s future goals.

Lisa Green is the Director of the Common Crawl Foundation where she oversees the foundation’s mission of building, maintaining and openly disseminating a comprehensive crawl of the web. Common Crawl’s 130TB corpus of over 8 billion web pages enables innovation in education, research, and business. Prior to Common Crawl, she was the Chief of Staff at Creative Commons. Lisa holds a PhD in physical chemistry from the University of California Berkeley, lives in San Francisco, and is passionate about open systems and big data.

This colloquium is organized by SARA. 

e-Infrastructure colloquia are organized by BiG Grid, e-BioGrid, NBIC, SARA, Nikhef, EGI, NLeSC, UvA.