Friday, April 26, 2013

Brewster Kahle - For the love of media and metadata! Update

A lot of valuable broadcast material is just about to thrown out in Hilversum
Important broadcast material documenting the 20th century, which has cost millions of Euros to make, is rotting in vaults. Worse still, as the economic crisis in Europe worsens, valuable collections are being destroyed. This is complete madness when, in other parts of the world, archivists like Brewster Kahle have come up with practical solutions. Of course, not everything can be preserved. But dumping the lot is destroying heritage which cannot be replaced.

The archivists on the West Coast of the US have a path to preserving heritage for next generations. And I love people who not only have big ideas, but also execute on them. Leo Laporte interviewed Brewster this past week. It's a great interview because Brewster has refined his story of why this is all so important. And he is now getting the backing of many libraries who realise that the old models are breaking down. Libraries provide access to knowledge in coherent ways. And although we often think that "everything is on the Internet", that is just not true. The US Library of Congress has a collection of 26 million pieces. The Internet Archive is collecting about 1 billion items a week. They have around 10 Petabytes (10 thousand Terabytes) which they store inside a former church.

I see the BBC's been to the archive too. There is a clip from Alan Yentob on YouTube.

Scanning books to digitize the content @the Internet Archive

Petabyes of Servers - with backups in Amsterdam it seems

Starting to preserve the physical books...
Few people know that the Internet Archive has also been taping US TV channels off the air since 2009 and coming up with a great search engine to find quotes (like they do on the Daily Show).

West Coast of Europe Connection

Recently had the privilege of talking with Brewster on his visit to Amsterdam to talk with
Part of the Internet Archive is also backed up on Petabyte servers in Amsterdam
Brewster and the team on a recent visit to Amsterdam
Love the sticker on his MacBook. 10 Petabytes filed so far....

Call to action - Help and advice needed.

I have compiled and curated an archive of the Media Network programmes I made between 1980-2000. It is about 200 hours of audio material documenting the (inernational) broadcast media for the last half of the 20th century. We did quite a few documentaries on broadcasting before, during and after the Second World War. I am wondering how to transfer this on-line collection to the Internet Archive, together with the metadata that goes with it. I notice that that in general, people on the production side of things are awful at writing useful metadata so that others can find their great productions. That's why I rewrote the summaries. Without that material, it doesn't make sense. I don't get the 10 million downloads a month that the Internet Archive can boast, but having 5000 unique downloads a month of material which is 20 years old still shows it has some value. By all means explore. You can get in touch through or About me.

Update: interesting stats from this article in the Guardian today.

Kahle, a computer scientist who made a fortune in the 1990s with tech ventures, including Alexa Internet, dreamed of a Great Library of Alexandria 2.0 since he studied at MIT. The archive's first headquarters was in the nearby Presidio district. In 2009 it moved into a former Christian Science church on Funston Avenue; its pillars and facade evoke antiquity.
About 50 staff work here and another 100 work elsewhere in the bay area and in 32 scanning centres, usually in libraries, around the world. The centres digitise books, microfilm and regular film. Automation proved imprecise so it is done manually, each worker processing 800 to 1000 pages per hour. This labour means material such as Boston's John Adams Library, the Hoover archive and the 1930 US census are now online and free. Institutions such as government agencies, libraries and universities, many outside the US, pay modest fees for special requests.
The archive has also stored 750,000 actual books at a nearby climate-controlled storage unit, a literary equivalent of the Svalbard global seed vault. There is space for another 780,000.
Engineers "crawl" the world's top million websites, capturing and storing pages which link to other pages which are captured and stored. Every three months they start over, because the list of top million sites constantly changes. An average web page lasts 75 days. In 2009, they raced against the clock to save as much as they could of the web-hosting service GeoCities, before Yahoo shut it down. If the owner of a defunct website prefers that the pages remain dead, he or she can ask the archive to remove them, requests that are almost always granted.
Engineers also collect news from more than 60 TV stations worldwide and YouTube videos, selecting the latter according to Twitter mentions. "It's not perfect but tweets give us an idea of what people consider important," said Alexis Rossi, the web collections manager. She estimated that the 10bn URLs saved each every three month cycle represented – very, very roughly – about a 10th of the internet's output:

Post a Comment