You undoubtedly know what CAPTCHA is, if not the name. It's the distorted or otherwise obscured text that some websites use to prevent bots from spamming them. It's an unfortunately necessary, annoying tool, but now someone has come up with a way to make all the CAPTCHA-interpreting frustration useful.
A team at Carnegie Mellon University in Pittsburgh is
involved in digitising old books and manuscripts supplied by a non-profit organisation called the Internet Archive, and uses Optical Character Recognition (OCR) software to examine scanned images of texts and turn them into digital text files which can be stored and searched by computers.
But the OCR software is unable to read about one in 10 words, due to the poor quality of the original documents.
The only reliable way to decode them is for a human to examine them individually - a mammoth task since CMU processes thousands of pages of text every month.
To solve this problem the team takes images of the words which the OCR software can't read, and uses them as CAPTCHAs.
These CAPTCHAs, known as reCAPTCHAS, are then distributed to websites around the world to be used in place of conventional CAPTCHAs.
Thanks to the adoption of reCAPTCHAs by popular websites like Facebook, Twitter and StumbleUpon, the system is helping to decipher about one million words every day for CMU's book archiving project, according to [CMU Professor Luis] von Ahn.
Given that it takes about 10 seconds to decipher a reCAPTCHA and type in the answer, this represents the equivalent of almost three thousand man hours a day spent deciphering words that CMU's computers find illegible.
A handy extra benefit of this system is that reCAPTCHAs are particularly good at foiling bots while remaining legible to people.
"Firstly, we are starting with words that we know our computers can't read," says von Ahn. "These words have also been distorted naturally over time, and the number of ways they have been distorted is very large.
Very, very cool!
(Via Michele Grant at Letters to the Management.)
Luis von Ahn is like the coolest dude ever, he has a Google Tech Talk you can watch on the internet that's really good. He does these awesome internet games that trick people into labeling images & stuff.
Posted by: Cyn | October 03, 2007 at 02:01 PM
I vaguely recall that one of the search engines - Google, probably - was trying to get certain users to rank the results in terms of whether it was what they were looking for. Was that his idea?
Posted by: Mithras | October 03, 2007 at 03:17 PM
that is really cool, but i don't get how it could work.
say, for example, facebook uses the scanned word "portion" from the graphic in this post as its CAPTCHA. some dude tries to comment on facebook and is flashed the image of the scanned word. he recognizes the word and types "portion" as a response.
how does the computer know that his response is correct? the key to CAPTCHAs is that it already knows that the "SKWCZ" graphic at the top of this post says SKWCZ. so when a user types it correctly, it knows the user is a real person but not a bot. it seems like the reCAPTCHA folks have it backwards. their computer doesn't know what the text days, so how can it verify that the user got it right?
Posted by: upyernoz | October 04, 2007 at 11:22 AM
err, i mean "what the text says" not "days"
Posted by: upyernoz | October 04, 2007 at 12:37 PM
Did you RTFA? It says it gives a known captcha along with an unknown recaptcha. If the user renders the captcha correctly, then it assumes provisionally that the recaptcha is correctly typed, too. Then it compares those provisional "correct" responses from different users to each other to see if they match.
Posted by: Mithras | October 04, 2007 at 12:50 PM
err, i guess that does make sense.
and no, i didn't RTFA.
i guess reCAPTCHA is pretty clever after all.
Posted by: upyernoz | October 04, 2007 at 03:59 PM