Google to Host Terabytes of Open-Source Science Data ~ Encyclopedia - Online Marketing With Google Yahoo MSN

Monday, January 21, 2008

Google to Host Terabytes of Open-Source Science Data

Sources at Google have disclosed that the humble domain, http://research.google.com, will soon provide a home for terabytes of open-source scientific datasets. The storage will be free to scientists and access to the data will be free for all. The project, known as Palimpsest and first previewed to the scientific community at the Science Foo camp at the Googleplex last August, missed its original launch date this week, but will debut soon.

Building on the company's acquisition of the data visualization technology, Trendalyzer, from the oft-lauded, TED presenting Gapminder team, Google will also be offering algorithms for the examination and probing of the information. The new site will have YouTube-style annotating and commenting features.

The storage would fill a major need for scientists who want to openly share their data, and would allow citizen scientists access to an unprecedented amount of data to explore. For example, two planned datasets are all 120 terabytes of Hubble Space Telescope data and the images from the Archimedes Palimpsest, the 10th century manuscript that inspired the Google dataset storage project.

UPDATE (12:01pm): Attila Csordas of Pimm has a lot more details on the project, including a set of slides that Jon Trowbridge of Google gave at a presentation in Paris last year. WIRED's own Thomas Goetz also mentioned the project in his fantastic piece of freeing dark data.

One major issue with science's huge datasets is how to get them to Google. In this post by a SciFoo attendee over at business|bytes|genes|molecules, the collection plan was described:

(Google people) are providing a 3TB drive array (Linux RAID5). The array is provided in “suitcase” and shipped to anyone who wants to send they data to Google. Anyone interested gives Google the file tree, and they SLURP the data off the drive. I believe they can extend this to a larger array (my memory says 20TB).

You can check out more details on why hard drives are the preferred distribution method at Pimm. And we hear that Google is hunting for cool datasets, so if you have one, it might pay to get in touch with them.


Source: http://blog.wired.com/wiredscience/2008/01/google-to-provi.html

No comments: