A solution to the wikipedia problem.

I just came up with a solution to the wikipedia problem. Every year wikipedia goes on about the millions of dollars they need to keep running. Wikipedia is a volunteer effort just like the local volunteer fire department. The labor is free, but the equipment and resources are what cost the money. Most people don’t have fire equipment to donate to the local fire department, but they do have computers that are idle most of the time.

Wikipedia is the perfect system to run as a world-wide-distributed application. People volunteer content, and they can volunteer cpu and disk too. How big could all the wikipedia data possibly be? It’s mostly text. You know there’ll be some geeks out there more than happy to have copies of the entire thing, and everybody else who contributes disk and cpu (just by running a little application on their pc) would host caches of sections of the whole database. Not outrageous to imagine, and given the state of peer systems nowadays, not that hard to do. If wikipedia started building that system and transferring all their current data to it, they’d never have to ask for money again.

— later comments —

Light O’Matic – Well, my first thought is that they would have to have some way of protecting the content from just anyone being able to make their own version of it.. for example, javascript which does a checksum of the page against a per-page hash that is either fetched from a trusted server, or calculated cryptographically with a master key from a trusted server. Second thought was that they’d have to either make the whole wiki editing system work distributed.. or they’d have to keep editing centralized. Then I realized there are actually a lot of systems out there already that at least partly solve these problems and maybe one of them totally solves it…

Stu M – Well firstly realize, that there wouldn’t be much point to putting of fake copies of your section of the database, because… you can just edit the real thing. The effect is the same. But yeah, you could make it easier with trusted servers. What happens now? There are people who scour the changehistory list and just go and edit and validate and remove and stop flamewars. The same thing would happen, but the changes would have to propagate around instead of all being in one place. Not trivial, but I think in the case of wikipedia, it’s a lot easier than say bank records.

Light O’Matic – They could distribute it with git… But maybe it would be simpler to just distribute reads and keep writes centralized. More of a caching scenario. The problem with people being able to modify their copies of pages is that I am assuming that any given page can be served from a lot of different places.. so if one or some of them have tainted versions, it might take a while to even notice it. Then you’d have to have a system to do something about removing that bad data. Whereas now, if you edit a page, everyone sees it, it’s very clear what happened. If I can server any data I want and pretend it’s from wikipedia, I could serve a worm or virus in otherwise totally legit looking pages. So, there has to be protection.

Stu M – I suppose you could go with the ‘signed by one of the trusted authorities’ type of thing, which would mean a certificate-like data included with all changes, but the trusted part would come from a top-down delegated authority, so the root ‘certificate’ would be signed by mr wikipedia himself and everybody in the chain would be trusted by him or the guy in the chain above him.

This entry was posted on Saturday, December 17th, 2011 at 9:59 pm and is filed under Notes. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

A solution to the wikipedia problem.

Leave a Reply