22
CAPTCHA A SEMINAR REPORT Submitted by RISHABH AGARWAL (1313310118) Submitted to MR. SURYA PRAKASH SHARMA MOHD JAWED KHAN in the partial fulfilment for the award of degree of Bachelor of Technology in COMPUTER SCIENCE AND ENGINEERING at Department of Computer Science and Engineering , Greater Noida Noida Institute of Engineering & Technology Dr. A.P.J. Abdul Kalam Technical University, Uttar Pradesh,Lucknow 2015-16

Captcha seminar report

Embed Size (px)

Citation preview

Page 1: Captcha seminar report

CAPTCHA

A SEMINAR REPORT

Submitted by

RISHABH AGARWAL

(1313310118)

Submitted to

MR. SURYA PRAKASH SHARMA

MOHD JAWED KHAN

in the partial fulfilment for the award of degree

of

Bachelor of Technology

in

COMPUTER SCIENCE AND ENGINEERING

at

Department of Computer Science and Engineering

, Greater NoidaNoida Institute of Engineering & Technology

Dr. A.P.J. Abdul Kalam Technical University,

Uttar Pradesh,Lucknow

2015-16

Page 2: Captcha seminar report

CERTIFICATE

This is to certify that Rishabh Agarwal of VI Semester, B. Tech (Computer Science &

Engineering) 2015-16, has presented a seminar titled “CAPTCHA” in partial fulfilment for the

award of the degree of Bachelor of Technology under our supervision.

The report is submitted to the Noida Institute of Engineering & Technology, Gr. Noida as a part

of syllabus prescribed by Dr A.P.J Abdul Kalam Technical University, Uttar Pradesh, Lucknow,

for the degree of Bachelor of Technology during the academic year 2015-16. It is certify that all

the correction/suggestions indicated have been incorporated in the report deposited in the

department library. The seminar report has been approved as it satisfies the academic

requirements for the award of the degree.

We wish best for his endeavour.

Supervisor(s)

Mr Surya Prakash Sharma

Mohd Jawed Khan

i

Page 3: Captcha seminar report

ACKNOWLEDGMENET

I take this opportunity to express my gratitude to all those people who have been directly and

indirectly with me during the competition of this seminar.

I extend my immense pleasure in thanking Dr. C.S.YADAV, (Head of Department Computer

Science Engineering), for providing me invaluable guidance for the technical seminar.

I pay thank to Mr Surya Prakash Sharma and Mohd Javed Khan who has given guidance and a

light to me during this seminar.

I acknowledge here out debt to those who contributed significantly to one or more steps.

Rishabh Agarwal

1313310118

B. Tech 3rd

Year

(Computer Science & Engineering)

ii

Page 4: Captcha seminar report

ABSTRACT

A CAPTCHA (an acronym for "Completely Automated Public Turing test to tell Computers and

Humans Apart") is a type of challenge-response test used in computing to determine whether or

not the user is human.

The term was coined in 2003 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John

Langford. The most common type of CAPTCHA was first invented in 1997 by Mark D.

Lillibridge, Martin Abadi, Krishna Bharat and Andrei Z. Broder. This form of CAPTCHA

requires that the user type the letters of a distorted image, sometimes with the addition of an

obscured sequence of letters or digits that appears on the screen. Because the test is administered

by a computer, in contrast to the standard Turing test that is administered by a human, a

CAPTCHA is sometimes described as a reverse Turing test.

In 1999, Slashdot published a poll that asked visitors to choose the graduate school that had the

best program in computer science. Students from two universities - Carnegie Mellon and MIT -

created automated programs called bots to vote repeatedly for their respective schools. While

those two schools received thousands of votes, the other schools only had a few hundred each. If

it's possible to create a program that can vote in a poll, how can we trust online poll results at all?

A CAPTCHA form can help prevent programmers from taking advantage of the polling system.

Registration forms on Web sites often use CAPTCHAs. For example, free Web-based e-mail

services allow people to create an e-mail account free of charge. Usually, users must provide

some personal information when creating an account, but the services typically don't verify this

information. They use CAPTCHAs to try to prevent spammers from using bots to generate

hundreds of spam mail accounts.

However, this user identification procedure has received many criticisms, especially from

disabled people, but also from other people who feel that their everyday work is slowed down by

distorted words that are illegible even for us at all. It takes the average person approximately 10

seconds to solve a typical CAPTCHA.

iii

Page 5: Captcha seminar report

CONTENT

Chapter Title Page No.

Certificate i

Acknowledgement ii

Abstract iii

1 Introduction 1

1.1 Overview 1

1.2 Background & Motivation 1

1.3 CAPTCHAs and the Turing Test 2

2 Types of CAPTCHA 3

2.1 Text CAPTCHA 3

2.1.1 Gimpy 3

2.1.2 Ez-Gimpy 3

2.1.3 Baffle Text 4

2.2 Graphic CAPTCHA 4

2.2.1 Bongo 4

2.2.2 PIX 5

2.3 Audio CAPTCHA 5

2.4 Re-CAPTCHA and Book Digitization 6

3 Application 7

4 Constructing CAPTCHA 9

4.1 Things to know 9

4.2 Implementation 9

4.3 Guidelines for CAPTCHA implementation 9

5 Breaking CAPTCHA 10

5.1 Breaking a Visual CAPTCHA 10

5.2 Breaking an Audio CAPTCHA 11

5.3 CAPTCHA Cracking as a Business 11

6 Issues with CAPTCHA 13

6.1 Usability issues with text based CAPTCHA 13

6.2 Usability of Audio CAPTCHA 13

7 Conclusion 15

8 Reference 16

Page 6: Captcha seminar report

LIST OF FIGURES

Figure Name Figure No. Page No.

Gimpy CAPTCHA 2.1.1 3

Yahoo‟s Ez – Gimpy CAPTCHA 2.1.2 3

Baffle Texts – CAPTCHA 2.1.3 4

Bongo CAPTCHA 2.2.1 4

PIX CAPTCHA 2.2.2 5

Re-CAPTCHA 2.4 6

Page 7: Captcha seminar report

Chapter 1

Introduction:

1.1 Overview:

CAPTCHAs are short for Completely Automated Public Turing test to tell Computers and

Humans Apart. The term "CAPTCHA" was coined in 2000 by Luis Von Ahn, Manuel Blum,

Nicholas J. Hopper.

They are challenge-response tests to ensure that the users are indeed human. The purpose of a

CAPTCHA is to block form submissions from spam bots – automated scripts that harvest email

addresses from publicly available web forms. A common kind of CAPTCHA used on most

websites requires the users to enter the string of characters that appear in a distorted form on the

screen.

CAPTCHAs are used because of the fact that it is difficult for the computers to extract the text

from such a distorted image, whereas it is relatively easy for a human to understand the text

hidden behind the distortions. Therefore, the correct response to a CAPTCHA challenge is

assumed to come from a human and the user is permitted into the website.

The motivation to create a test that can tell humans and computers apart comes from the fact that

people are trying to game the system -- they want to exploit weaknesses in the computers running

the site. While these individuals probably make up a minority of all the people on the Internet,

their actions can affect millions of users and Web sites.

For example, a free e-mail service might find itself bombarded by account requests from an

automated program. The CAPTCHA test helps identify which users are real human beings and

which ones are computer programs. Spammers are constantly trying to build algorithms that read

the distorted text correctly. So strong CAPTCHAs have to be designed and built so that the

efforts of

the spammers are thwarted.

1.2 Background and Motivation The need for CAPTCHAs rose to keep out the website/search engine abuse by bots. In 1997,

AltaVista sought ways to block and discourage the automatic submissions of URLs into their

search engines. Their method was to generate a printed text randomly that only humans could

read and not machine readers. Their approach was so effective that in a year, “spam-add-ons‟”

were reduced by 95%.

In November 1999, slashdot.com released a poll to vote for the best CS College in the US.

Students from the Carnegie Mellon University and the Massachusetts Institute of Technology

created bots that repeatedly voted for their respective colleges. This incident created the urge to

use CAPTCHAs for such online polls to ensure that only human users are able to take part in the

polls.

The proliferation of the publicly available services on the Web is a boon for the community at

large. But unfortunately it has invited new and novel abuses. Programs (bots and spiders) are

being created to steal services and to conduct fraudulent transactions. Some examples:

Free online accounts are being registered automatically many times and are being used to

distribute stolen or copyrighted material.

Recommendation systems are vulnerable to artificial inflation or deflation of rankings.

For example, EBay, a famous auction website allows users to rate a product. Abusers can

easily create bots that could increase or decrease the rating of a specific product, possibly

changing people‟s perception towards the product.

1

Page 8: Captcha seminar report

Spammers register themselves with free email accounts such as those provided by Gmail

or Hotmail and use their bots to send unsolicited mails to other users of that email service.

Online polls are attacked by bots and are susceptible to ballot stuffing. This gives unfair

mileage to those that benefit from it.

In light of the above listed abuses and much more, a need was felt for a facility that checks users

and allows access to services to only human users. It was in this direction that such a tool like

CAPTCHA was created.

1.3 CAPTCHAs and the Turing Test:

CAPTCHA technology has its foundation in an experiment called the Turing Test. Alan Turing,

sometimes called the father of modern computing, proposed the test as a way to examine whether

or not machines can think -- or appear to think -- like humans. The classic test is a game of

imitation. In this game, an interrogator asks two participants a series of questions. One of the

participants is a machine and the other is a human. The interrogator can't see or hear the

participants and has no way of knowing which is which. If the interrogator is unable to figure out

which participant is a machine based on the responses, the machine passes the Turing Test. Of

course, with a CAPTCHA, the goal is to create a test that humans can pass easily but machines

can't. It's also important that the CAPTCHA application is able to present different CAPTCHAs

to different users. If a visual CAPTCHA presented a static image that was the same for every

user, it wouldn't take long before a spammer spotted the form, deciphered the letters, and

programmed an application to type in the correct answer automatically. Most, but not all,

CAPTCHAs rely on a visual test. Computers lack the sophistication that human beings have

when it comes to processing visual data. We can look at an image and pick out patterns more

easily than a computer. But not all CAPTCHAs rely on visual patterns. In fact, it's important to

have an alternative to a visual CAPTCHA. Otherwise, the Web site administrator runs the risk of

disenfranchising any Web user who has a visual impairment. One alternative to a visual test is an

audible one. An audio CAPTCHA usually presents the user with a series of spoken letters or

numbers. It's not unusual for the program to distort the speaker's voice, and it's also common for

the program to include background noise in the recording. This helps thwart voice recognition

programs.

2

Page 9: Captcha seminar report

Chapter 2

Types of CAPTCHAs CAPTCHAs are classified based on what is distorted and presented as a challenge to the user.

2.1 Text CAPTCHAs: These are simple to implement. The simplest yet novel approach is to present the user with some

questions which only a human user can solve. Examples of such questions are:

1. What are twenty minus three?

2. What is the third letter in UNIVERSITY?

3. Which of Yellow, Thursday and Richard is a color?

4. If yesterday was a Sunday, what is today?

Such questions are very easy for a human user to solve, but it‟s very difficult to program a

computer to solve them. These are also friendly to people with visual disability. Other text

CAPTCHAs involves text distortions and the user is asked to identify the text hidden. The

various implementations are:

2.1.1 Gimpy:

Gimpy is a very reliable text CAPTCHA built by CMU in collaboration with Yahoo for their

Messenger service. Gimpy is based on the human ability to read extremely distorted text and the

inability of computer programs to do the same. Gimpy works by choosing ten words randomly

from a dictionary, and displaying them in a distorted and overlapped manner. Gimpy then asks

the users to enter a subset of the words in the image. The human user is capable of identifying the

words correctly, whereas a computer program cannot.

Fig 2.1.1 Gimpy CAPTCHA [4]

2.1.2 Ez – Gimpy:

This is a simplified version of the Gimpy CAPTCHA, adopted by Yahoo in their signup page. Ez

– Gimpy randomly picks a single word from a dictionary and applies distortion to the text. The

user is then asked to identify the text correctly.

3

Page 10: Captcha seminar report

Fig 2.1.2 Yahoo’s Ez – Gimpy CAPTCHA [5]

2.1.3 Baffle Text:

This was developed by Henry Baird at University of California at Berkeley. This is a variation of

the Gimpy. This doesn‟t contain dictionary words, but it picks up random alphabets to create a

nonsense but pronounceable text. Distortions are then added to this text and the user is challenged

to guess the right word.

This technique overcomes the drawback of Gimpy CAPTCHA because, Gimpy uses dictionary

words and hence, clever bots could be designed to check the dictionary for the matching word by

brute-force.

Fig 2.1.3 Baffle Texts – CAPTCHA [6]

2.2 Graphic CAPTCHAs:

Graphic CAPTCHAs are challenges that involve pictures or objects that have some sort of

similarity that the users have to guess. They are visual puzzles, similar to Mensa tests. Computer

generates the puzzles and grades the answers, but is itself unable to solve it.

2.2.1 Bongo:

BONGO, named after M.M. Bongard, asks the user to solve a visual pattern recognition problem.

It displays two series of blocks, the left and the right. The blocks in the left series differ from

those in the right, and the user must find the characteristic that sets them apart.

4

Page 11: Captcha seminar report

Fig 2.2.1 Bongo CAPTCHA [7]

These two sets are different because everything on the left is drawn with thick lines and those on

the right are in thin lines. After seeing the two blocks, the user is presented with a set of four

single blocks and is asked to determine to which group the each block belongs to. The user passes

the test if s/he determines correctly to which set the blocks belong to.

.

2.2.2 PIX:

PIX is a program that has a large database of labeled images. All of these images are pictures of

concrete objects (a horse, a table, a house, a flower). The program picks an object at random,

finds six images of that object from its database, presents them to the user and then asks the

question “what are these pictures of?” Current computer programs should not be able to answer

this question, so PIX should be a CAPTCHA. One way for PIX to become a CAPTCHA is to

randomly distort the images before presenting them to the user, so that computer programs cannot

easily search the database for the undistorted image.

Fig 2.2.2 PIX CAPTCHA [8]

2.3 Audio CAPTCHAs: Another approach to make CAPTCHAs is based on sound. The program picks a word or a

sequence of numbers at random, renders the word or the numbers into a sound clip and distorts

the sound clip; it then presents the distorted sound clip to the user and asks users to enter its

5

Page 12: Captcha seminar report

contents. This CAPTCHA is based on the difference in ability between humans and computers in

recognizing spoken language.

The idea is that a human is able to efficiently disregard the distortion and interpret the characters

being read out while software would struggle with the distortion being applied, and need to be

effective at speech to text translation in order to be successful. This is a crude way to filter

humans and it is not so popular because the user has to understand the language and the accent in

which the sound clip is recorded.

2.4 Re-CAPTCHA and Book digitization: To counter various drawbacks of the existing implementations, researchers developed a

redesigned CAPTCHA called the Re-CAPTCHA. About 200 million CAPTCHAs are solved by

humans around the world every day consuming more than 150,000 hours of work each day. What

if we could make positive use of this human effort?

Re-CAPTCHA does exactly that by channeling the effort spent solving CAPTCHAs online into

"reading" books.

To archive human knowledge and to make information more accessible to the world, multiple

projects are currently digitizing physical books that were written before the computer age. The

book pages are being photographically scanned, and then transformed into text using "Optical

Character Recognition" (OCR).The problem is that OCR is not perfect.

Re-CAPTCHA improves the process of digitizing books by sending words that cannot be read by

computers to the Web in the form of CAPTCHAs for humans to decipher. But if a computer can't

read such a CAPTCHA, how does the system know the correct answer to the puzzle?

Each new word that cannot be read correctly by OCR is given to a user in conjunction with

another word for which the answer is already known. The user is then asked to read both words.

If they solve the one for which the answer is known, the system assumes their answer is correct

for the new one. The system then gives the new image to a number of other people to determine,

with higher confidence, whether the original answer was correct. Currently, Re-CAPTCHA is

employed in digitizing books as part of the Google Books Project.

Fig 2.4 Re-CAPTCHA [9]

6

Page 13: Captcha seminar report

Chapter 3

Applications

CAPTCHAs are used in various Web applications to identify human users and to restrict access

to them. Some of them are:

Online Polls: Bots can wreak havoc to any unprotected online poll. They might create a large

number of votes which would then falsely represent the poll winner in spotlight. This also results

in decreased faith in these polls. CAPTCHAs can be used in websites that have embedded polls to

protect them from being accessed by bots, and hence bring up the reliability of the polls.

Protecting Web Registration: Several companies offer free email and other services Until

recently, these service providers suffered from a serious problem – bots. These bots would take

advantage of the service and would sign up for a large number of accounts. This often created

problems in account management and also increased the burden on their servers. CAPTCHAs can

effectively be used to filter out the bots and ensure that only human users are allowed to create

accounts.

Preventing comment spam: Most bloggers are familiar with programs that submit large number

of automated posts that are done with the intention of increasing the search engine ranks of that

site. CAPTCHAs can be used before a post is submitted to ensure that only human users can

create posts.

Search engine bots: It is sometimes desirable to keep web pages unindexed to prevent others

from finding them easily. There is an html tag to prevent search engine bots from reading web

pages. The tag, however, doesn't guarantee that bots won't read a web page; it only serves to say

"no bots, please." Search engine bots, since they usually belong to large companies, respect web

pages that don't want to allow them in. However, in order to truly guarantee that bots won't enter

a web site, CAPTCHAs are needed.

E-Ticketing: Ticket brokers like Ticketmaster also use CAPTCHA applications. These

applications help prevent ticket scalpers from bombarding the service with massive ticket

purchases for big events. Without some sort of filter, it's possible for a scalper to use a bot to

place hundreds or thousands of ticket orders in a matter of seconds. Legitimate customers become

victims as events sell out minutes after tickets become available. Scalpers then try to sell the

tickets above face value. While CAPTCHA applications don't prevent scalping; they do make it

more difficult to scalp tickets on a large scale.

Email spam: CAPTCHAs also present a plausible solution to the problem of spam emails. All

we have to do is to use a CAPTCHA challenge to verify that an indeed a human has sent the

email.

Preventing Dictionary Attacks: CAPTCHAs can also be used to prevent dictionary attacks in

password systems The idea is simple: prevent a computer from being able to iterate through the

entire space of passwords by requiring it to solve a CAPTCHA after a certain number of

unsuccessful logins. This is better than the classic approach of locking an account after a

sequence of unsuccessful logins, since doing so allows an attacker to lock accounts at will.

As a tool to verify digitized books: This is a way of increasing the value of CAPTCHA as an

application An application called Re-CAPTCHA harnesses users responses in CAPTCHA fields

7

Page 14: Captcha seminar report

to verify the contents of a scanned piece of paper. Because computers aren‟t always able to

identify words from a digital scan, humans have to verify what a printed page says. Then it‟s

possible for search engines to search and index the contents of a scanned document. This is how

it works: The application already recognizes one of the words. If the visitor types that word into a

field correctly, the application assumes the second word the user types is also correct. That

second word goes into a pool of words that the application will present to other users. As each

user types in a word, the application compares the word to the original answer. Eventually, the

application receives enough responses to verify the word with a high degree of certainty. That

word can then go into the verified pool

Improve Artificial Intelligence (AI) technology: Luis von Ahn of Carnegie Mellon University

is one of the inventors of CAPTCHA. In a 2006 lecture, von Ahn talked about the relationship

between things like CAPTCHA and the field of artificial intelligence (AI). Because CAPTCHA is

a barrier between spammers or hackers and their goal, these people have dedicated time and

energy toward breaking CAPTCHAs. Their successes mean that machines are getting more

sophisticated. Every time someone figures out how to teach a machine to defeat a CAPTCHA, we

move one step closer to artificial intelligence.

As people find new ways to get around CAPTCHA, computer scientists like Von Ahn develop

CAPTCHAs that address other challenges in the field of AI.

A step backward for CAPTCHA is still a step forward for AI – “Every defeat is also a victory”

8

Page 15: Captcha seminar report

Chapter 4

Constructing CAPTCHAs

4.1 Things to know: The first step to create a CAPTCHA is to look at different ways humans and machines process

information. Machines follow sets of instructions. If something falls outside the realm of those

instructions, the machines aren‟t able to compensate. A CAPTCHA designer has to take this into

account when creating a test.

For example, it‟s easy to build a program that looks at metadata – the information on the Web

that‟s invisible to humans but machines can read. If you create a visual CAPTCHA and the

images‟ metadata includes the solution, your CAPTCHA will be broken in no time.

Similarly, it‟s unwise to build a CAPTCHA that doesn‟t distort letters and numbers in some way.

An undistorted series of characters isn‟t very secure. Many computer programs can scan an image

and recognize simple shapes like letters and numbers.

One way to create a CAPTCHA is to pre-determine the images and solutions it will use. This

approach requires a database that includes all the CAPTCHA solutions, which can compromise

the reliability of the test. If a spammer managed to find a list of all CAPTCHA solutions, he or

she could create an application that bombards the CAPTCHA with every possible answer in a

brute-force attack. The database would need more than 10,000 possible CAPTCHAs to meet the

qualifications of a good CAPTCHA.

Using randomization eliminates the possibility of a brute-force attack. The odds of a bot entering

the correct series of random letters are very low. The longer the string of characters, the less

likely a bot will get lucky. CAPTCHAs take different approaches to distorting words from

stretching to bending of letters in weird ways. In the end, the goal is the same: to make it really

hard for a computer to figure out what‟s in the CAPTCHA.

Designers can also create puzzles or problems that are easy for humans to solve. Some

CAPTCHAs rely on pattern recognition and extrapolation. For example, a CAPTCHA might

include series of shapes and ask the user which shape among several choices would logically

come next. The problem with this approach is that not all humans are good with these kinds of

problems and the success rate for a human user can go below 80 percent.

4.2 Implementation:

Embeddable CAPTCHAs: The easiest implementation of a CAPTCHA to a Website would be

to insert a few lines of CAPTCHA code into the Website‟s HTML code, from an open source

CAPTCHA builder, which will provide the authentication services remotely.

Custom CAPTCHAs: These are less popular because of the extra work needed to create a secure

implementation.

There are advantages in building custom CAPTCHAs:

1. A custom CAPTCHA can fit exactly into the design and theme of your site. It will not look like

some alien element that does not belong there.

2. We want to take away the perception of a CAPTCHA as an annoyance, and make it convenient

for the user.

3. Because a custom CAPTCHA, unlike the major CAPTCHA mechanisms, obscure you as a

target for spammers. Spammers have little interest in cracking a niche implementation.

9

Page 16: Captcha seminar report

Chapter 5

Breaking CAPTCHAs

The challenge in breaking a CAPTCHA isn't figuring out what a message says -- after all,

humans should have at least an 80 percent success rate. The really hard task is teaching a

computer how to process information in a way similar to how humans think. In many cases,

people who break CAPTCHAs concentrate not on making computers smarter, but reducing the

complexity of the problem posed by the CAPTCHA.

Let's assume you've protected an online form using a CAPTCHA that displays English words.

The application warps the font slightly, stretching and bending the letters in unpredictable ways.

In addition, the CAPTCHA includes a randomly generated background behind the word. A

programmer wishing to break this CAPTCHA could approach the problem in phases. He or she

would need to write an algorithm -- a set of instructions that directs a machine to follow a certain

series of steps. In this scenario, one step might be to convert the image in grayscale. That means

the application removes all the color from the image, taking away one of the levels of obfuscation

the CAPTCHA employs.

Next, the algorithm might tell the computer to detect patterns in the black and white image. The

program compares each pattern to a normal letter, looking for matches. If the program can only

match a few of the letters, it might cross reference those letters with a database of English words.

Then it would plug in likely candidates into the submit field. This approach can be surprisingly

effective. It might not work 100 percent of the time, but it can work often enough to be

worthwhile to spammers.

What about more complex CAPTCHAs? The Gimpy CAPTCHA displays 10 English words with

warped fonts across an irregular background. The CAPTCHA arranges the words in pairs and the

words of each pair overlap one another. Users have to type in three correct words in order to

move forward. How reliable is this approach? As it turns out, with the right CAPTCHA-cracking

algorithm, it's not terribly reliable. Greg Mori and Jitendra Malik published a paper detailing their

approach to cracking the Gimpy version of CAPTCHA. One thing that helped them was that the

Gimpy approach uses actual words rather than random strings of letters and numbers. With this in

mind, Mori and Malik designed an algorithm that tried to identify words by examining the

beginning and end of the string of letters. They also used the Gimpy's 500-word dictionary. Mori

and Malik ran a series of tests using their algorithm. They found that their algorithm could

correctly identify the words in a Gimpy CAPTCHA 33 percent of the time [source: Mori and

Malik]. While that's far from perfect, it's also significant. Spammers can afford to have only one-

third of their attempts succeed if they set bots to break CAPTCHAs several hundred times every

minute.

Another vulnerability that most CAPTCHA scripts have is again in their use of sessions; if we're

on an insecure shared server, any user on that server may have access to everyone else's session

files, so even if our site is totally secure, a vulnerability on any other website hosted on that

machine can lead to a compromise of the session data, and hence, the CAPTCHA script. One

workaround is by storing only a hash of the CAPTCHA word in the session, thus even if someone

can read the session files, they can't find out what the CAPTCHA word is.

5.1 Breaking a visual CAPTCHA: Greg Mori and Jitendra Malik of University of California at Berkeley‟s Computer Vision Group

evaluate image based CAPTCHAs for reliability. They test whether the CAPTCHA can withstand

bots who masquerade as humans.*

Approach: The fundamental ideas behind our approach to solving Gimpy are the same as those

we are using to solve generic object recognition problems. Our solution to the Gimpy CAPTCHA

is just an application of a general framework that we have used to compare images of everyday

10

Page 17: Captcha seminar report

objects and even find and track people in video sequences. The essences of these problems are

similar. Finding the letters "T", "A", "M", "E" in an image and connecting them to read the word

"TAME" is akin to finding hands, feet, elbows, and faces and connecting them up to find a

human. Real images of people and objects contain large amounts of clutter. Learning to deal with

the adversarial clutter present in Gimpy has helped us in understanding generic object recognition

problems.

Breaking an EZ-Gimpy CAPTCHA: Our algorithm for breaking EZ-Gimpy consists of 3 main

steps:

1. Locate possible (candidate) letters at various locations: The first step is to hypothesize a set

of candidate letters in the image. This is done using our shape matching techniques. The method

essentially looks at a bunch of points in the image at random, and compares these points to points

on each of the 26 letters. The comparison is done in a way that is very robust to background

clutter and deformation of the letters. The process usually results in 3-5 candidate letters per

actual letter in the image. In the example shown in Fig 5.1, the "p" of profit matches well to both

an "o” or a "p", the border between the "p" and the "r" look a bit like a "u", and so forth. At this

stage we keep many candidates, to be sure we don't miss anything for later steps.

2. Construct graph of consistent letters: Next, we analyze pairs of letters to see whether or not

they are "consistent", or can be used consecutively to form a word.

3. Look for plausible words in the graph: There are many possible paths through the graph of

letters constructed in the previous step. However, most of them do not form real words. We select

out the real words in the graph, and assign scores to them based on how well their individual

letters match the image.

5.3 Breaking an audio CAPTCHA:

Recent research is suggesting that Google's audio capture is the latest in a string of CAPTCHA's

to have been defeated by software. t has been theorized that one cost-effective means of breaking

audio captures and image captures that have not yet had automated systems developed is to use a

mechanical turk and pay low rates for per-CAPTCHA reading by humans, or provide another

form of motivation such as access to popular sites for reading the CAPTCHA.

However, it always required a significant level of resources to achieve. The development of

software to automatically interpret CAPTCHAs brings up a number of problems for site

operators. The problem, as discovered by Wintercore Labs and published at the start of March is

that there are repeatable patterns evident in the audio file and by applying a set of complex but

straight forward processes, a library can be built of the basic signal for each possible character

that can appear in the CAPTCHA. Wintercore point to other audio CAPTCHAs that could be

easily reversed using this technique, including the one for Facebook. The wider impact of this

work might take some time to appear, but it provides an interesting proof of breaking audio

CAPTCHAs.

At the least, it shows that both of Google's CAPTCHA tools have now been defeated by software

and it should only be a matter of time until the same can be said for Microsoft and Yahoo!'s

offerings. Even with an effectiveness of only 90%, any failed CAPTCHA can easily be reloaded

for a second try.

5.5 CAPTCHA cracking as a business:

No CAPTCHA can survive a human that‟s receiving financial incentives for solving it.

CAPTCHA are cracked by firms posing as Data Processing firms. They usually charge $2 for

1000 CAPTCHAs successfully solved. They advertise their business as “Using the advertisement

in blogs, social networks, etc significantly increases the efficiency of the business. Many services

11

Page 18: Captcha seminar report

use pictures called CAPTCHAs in order to prevent automated use of these services. Solve

CAPTCHAs with the help of this portal; increase your business efficiency now!” Such firms help

spammers in beating the first line of defence for a Website, i.e., CAPTCHAs.

12

Page 19: Captcha seminar report

Chapter 6

Issues with CAPTCHAs

There are many issues with CAPTCHAs, primarily because they distort text and images in such a

way that, sometimes it gets difficult for even humans to read. Even the simplest, but effective

CAPTCHA, like a mathematical equation “What is the sum of three and five?” can be a pain for

cognitively disabled people.

6.1 Usability issues with text based CAPTCHAs:

Are text CAPTCHAs like Gimpy, user–friendly? Sometimes the text is distorted to such an

extent, that even humans have difficulty in understanding it. Some of the issues are listed in table

6.1

Distortion becomes a problem when it is done in a very haphazard way. Some characters like „d‟

can be confused for „cl‟ or „m‟ with „rn‟. It should also be easily understandable to those who are

unfamiliar with the language.

Content is an issue when the string length becomes too long or when the string is not a dictionary

word. Care should be taken not to include offensive words.

Presentation should be in such a way as to not confuse the users. The font and color chosen

should be user friendly.

6.2 Usability of audio CAPTCHAs: In audio CAPTCHAs, letters are read aloud instead of being displayed in an image. Typically,

noises are deliberately added to prevent such audio schemes from being broken by current speech

recognition technologies.

Distortion: Background noises effectively distort sounds in audio CAPTCHAs. There is no

rigorous study of what kind of background noises will introduce acceptable sound distortion.

However, it is clear that distortion methods and levels, just as in text based CAPTCHAs, can have

a significant impact on the usability of audio CAPTCHAs. For example, an early test in 2003

showed that the distorted sound in an audio CAPTCHA that was deployed at Microsoft‟s Hotmail

service was unintelligible to all (four) journalists, with good hearing, that were tested. Due to

sound distortion, confusing characters can also occur in audio CAPTCHAs. For example, we

observed that it is hard to tell apart „p‟ and „b‟; „g‟ and „j‟, and „a‟ and „8‟. Whether a scheme is

friendly to non-native speakers is another usability concern for audio CAPTCHAs.

Content: Content materials used in audio CAPTCHAs are typically language specific. Digits and

letters read in a language are often not understandable to people who do not speak the language.

Therefore, unlike text-based schemes, localisation is a major issue that audio CAPTCHAs face.

Presentation: The use of colour is not an issue for audio CAPTCHAs, but the integration with

web pages is still a concern. For example, there is no standard graphical symbol for representing

an audio CAPTCHA on a web page, although many schemes such as Microsoft and reCAPTCHA

use a speaker symbol. More importantly, what really matters for visually impaired users is that

the html image alternative text attached to any of the above symbol should clearly indicate the

need to solve an audio CAPTCHA.

13

Page 20: Captcha seminar report

When embedded in web pages, audio CAPTCHAs can also cause compatibility issues. For

example, many such schemes require JavaScript to be enabled. However, some users might prefer

to disable JavaScript in their browsers. Some other schemes can be even worse. For example, we

found that one audio scheme requires Adobe Flash support. With this scheme, vision-impaired

users will not even notice that such a CAPTCHA challenge exist in the page, unless Flash is

installed in their computers - apparently, no text alternative is attached to the speaker-like Flash

object, either.

14

Page 21: Captcha seminar report

CONCLUSION

We believe that the fields of cryptography and artificial intelligence have much to contribute to

one another. CAPTCHAs represent a small example of these possible symbiosis reductions, as

they are used in cryptography, can be extremely useful for the progress of algorithmic

development, they are crucial to preventing bot attacks. We encourage security researchers to

create CAPTCHAs based on different AI problems, hopefully they will become more user-

friendly to people with disabilities (visual/mental).

CAPTCHA‟s are mainly produced from Asynchronous Java-script And XML (AJAX) & using a

bit of Hypertext Preprocessor (PHP) technology, various algorithms are present.

Bots, and the damage they cause, are not the fault or responsibility of individual users, and

it's totally unfair to expect them to take the responsibility. They're not the fault of site owners

either, but like it or not they are our responsibility -- it's we who suffer from them, we who

benefit from their eradication, and therefore we who should shoulder the burden. And using

interactive authentication systems such as CAPTCHA effectively cheers and motivates us

and our users.

Developers will try to come up with new and better tests, and spammers will continue to find

ways of cracking them; it's very much a vicious circle. Perhaps, at some point in the future,

somebody will come up with a test that is truly reliable and uncrack able -- something that

identifies humans in a way that cannot be faked. Maybe biometric data such as fingerprints or

retina scans could factor into that somewhere; perhaps we'll have direct neural interfaces that

identify the presence of brain activity.

15

Page 22: Captcha seminar report

REFERENCES

1. Wikipedia : CAPTCHA

2. Luis von Ahn, Manuel Blum, Nicholas J. Hopper and John Langford. The CAPTCHA

URL: http://www.CAPTCHA.net

3. Nicholas J. Hopper, John Langford and Luis von Ahn. Provably Secure Steganography.

In Advances in Cryptology, CRYPTO' 02, volume 2442 of Lecture Notes in Computer

Science, pages 77-92. Santa Barbara, CA, 2002.

4. Greg Mori and Jitendra Malik. Breaking a Visual CAPTCHA.

URL: http://www.cs.berkeley.edu/~mori/gimpy/gimpy.pdf

5. http://www.scottaaronson.com/writings/captcha.html

6. https://www.researchgate.net/figure/285110169/

7. Baffle Texts – CAPTCHA

8. Bongo CAPTCHA

9. http://www.sitepoint.com/better-captcha/

10. http://www.cyclifier.org/project/recaptcha/

16