Deep Web and Digital Investigations

Deep Web and Digital Investigations

Damir DelijaMilano 2014

1

What we will talk about

• Web and “Deep Web”• Web and documents • Definitions• Technical issues • Forensic issues• I’m not an expert on deep or dark web• Discussion based on many sources and

references

Inaccessible Web

• Deep Web is a name for data inaccessible by regular search engines on the Internet

• Deep Web sounds much better than inaccessible

• Searchable / Accessible web is also called surface web

• Dark web is part of www with illegal or immoral content

• Dark web is not Deep Web it is part of it, but dark pages are on the surface web too

• Inaccessible resources– it exists but we don’t know about it or it’s location– we can’t use it

• It is an old problem – you have it, even in your own room

• Is there any solution ?– idea from Gopher days, Veronica – it works well with static pages and data– abandoned in web days, becomes a source of tremendous

power and wealth for Search Engines

Inaccessible Resources

Web and Internet and Documents• WWW is not the Internet ☺

– also full data or document space of each networked computer is not part of the Internet

• WWW is hypertext document based structure – we have links among documents– a document is not necessarily a web page– documents must have a presentation ability to be visible

through the web interface (transcription layer, often dynamicaly generated)

– Links, web pages and documents can be static or dynamically generated

– Dynamic documents are here because of volume of data (can’t be organised in static pages)

Definitions are crucial in understandig deep and surface web

Volume of Data

• For each document there is in average of 11 copies in the system – enterprise measurements pre SAN calculation

• Shows how document space expands rapidly • Even simple mail can cause data avalanches

• From sourface web point of view ?• Mostly invisible

• From Deep Web point of view ?• Data/documents copies are probably floating

around, inaccessible to us

Web and Search Engines

• Web can access material which is only referenced by a link and is not access protected

• Today mostly we assumes search engine span equals web and Internet

• To be effective search engines must have pre organised data to answer query

• Enormous changing volume of collected data and propagation lag http://en.wikipedia.org/wiki/List_of_search_engines

Rabia Gawish

i don't understand what you are trying to say with this sentence in red

Deep Resources

• Deep Web depends on the method of how search engines acquire and store data

• Web can be crawled or explored as link space• Hints are cache, proxy, protocol traffic• No clear boundary between deep resources

and surface resources

Uncollectible Resources Deep Web Resources

• Dynamic Web Pages– returns in response to a query or accessed only through a form

• Unlinked Contents– Pages without any backlinks

• Private Web– sites requiring registration and login (password-protected resources)

• Limited Access web– Sites with captchas, no-cache pragma http headers

• Scripted Pages– Page produced by javascript, Flash, AJAX etc

• Non HTML contents– Multimedia files e.g. images or videos

Uncollectible Resources Documents and Disk Space

• This comes close to e-discovery field• Is this part of Deep Web ?

• Documents not in the web tree• accessible only by direct filesystem access• or by dedicated script effort

• Files generally on the web servers and no-web servers machines– accessible only by direct filesystem access

Forgotten Data

• From the security aspect, forgotten data is a very interesting part of Deep Web

• What is forgotten data – maybe data without custodian ?

• Verizon reported about big data breach from 2008, – unknown data being part of data breach in 66% of

incidents

Data Lifecycle• Data creation and circulation • How to find data and correlate it• Search engines • Proxies • Metadata, Logs , Feeds• Very interesting ideas in “Programming

Collective Intelligence” By: Toby Segaran, O'Reilly Media, August 16, 2007

http://www.oreillynet.com/pub/au/2972

Hidden Data in Surface web ?

• Web handles data available trough html and extensions

• What about metadata and embedded data which is not accessible for search engines ?

Surface Web and Deep Issues

• “Hidden Data in Internet Published Documents”– deep forensic impact

• Specific data formats can have embedded elements which is not visible to search engine – like thumb views embeded in pictures– exif data in images– metadata in documents – stego

Rabia Gawish

I thought google had sorted that out, since now you can look for similar pictures with google :D

Idea of Treasure Island

• What is not on the map is unknown• Hiden as treasure island• Idea of unexplored, uncharted with big gains ..• Because of size idea of Iceberg

Why Deep Web Exists ?

• Why search engine fails?– Technology

• Most of the web data is behind dynamically generated pages (web gateways)– Web crawler cannot reach them or data not announced – Can only be obtained if we have access to the system

containing the information– Forms have to populated with values

– understanding the semantic of the web gateway and data behind it

Measuring the Deep Web• How to measure – estimates are based on known

examples • Try to generate pages based on known home pages

and explore the link space, based on hop distances• First Attempt: Bergman (2000)

– Size of surface web is around 19 TB– Size of Deep Web is around 7500 TB – Deep Web is nearly 400 times larger than the Surface Web

• 2004 Mitesh classified the Deep Web more accurately– Most of the html forms are two hops from the home page

Deep Web SizeCurrent Estimates 2014

• Deep Web about 7500 Terabytes • Surface Web about 19 terabytes • Deep Web has between 400 and 550 times more

public information than the Surface Web.• 95% of the Deep Web is publically accessible• More than 200,000 Deep Web sites currently exist.• 550 billion documents on Deep Web • 1 billion documents on Surface Web

Rabia Gawish

This slide is similar to the one before it, so either combine them or choose one of them :D

History of Deep Web

• Start: static html pages, web crawlers can easily reach, only few cgi-scripts

• In mid-90’s: Introduction of dynamic pages, page generated as a result of a query or link access

• In 1994: Jill Ellsworth used the term “Invisible Web” to refer to these websites.

• In 2001, Bergman coined it as “Deep Web”• Dark web goes in parallel as crime start to spread

over the Internet

Rabia Gawish

maybe to put this slide before the current stats?

Rough Timeline• 2001: Raghavan et al -> Hidden Web Exposure

– domain specific human assisted crawler• 2002: Stumbleupon used Human Crawler

– human crawlers can find relevant links that algorithmic crawlers miss.• 2003: Bergman introduced LexiBot

– used for quantifying the Deep Web• 2004: Yahoo! Content Acquisition Program

– paid inclusion for webmasters• 2005: Yahoo! Subscriptions

– Yahoo started searching subcription only sites• 2005: Noulas et. al. -> Hidden Web Crawler

– automatically generated meaningful queries to issue against search form• 2005: Google site map

– Allows webmasters to inform search engines about urls on their websites that are available for crawling.

– Web 2.0 infrastructure – Today Mobile device and Internet of things

– each gadget can have (and has) web server for configuration

Rabia Gawish

Google says it is spelled Ntoulas... can you check the spelling of this name?this slide also needs to be before the stats

Forensic Issues

From Digital Forensic Viewpoint• Is there a way to carry out forensically sound

actions on Deep Web ?• Can we apply standard digital forensic

procedures and best practices ?

• In both cases yes, – we are always limited in digital forensics, but that

does not prevent reliable results

Web and Digital Forensic

• Web is web ☺• Web artifacts are web artifacts• The type of investigation determines how we

handle web data– key element is: legal

• Many possible scenarios and situations – follow the forensic principles and best practices as

in any other situation– use scientific method– test and experiment to prove method

Deep Web and Forensic Tasks

• How to prove access to Deep Web resources– same as ordinary resources, because it is mostly

through browsers – advantage over blind Deep Web access since there

are history, cache, log artifacts which shows which Deep Web resource was accessed

• Deep Web artifacts – Mostly like any other web artifacts – Hidden Data in Internet Published Documents– Dark web as a specific subrange

Forensic Tools Issues• Forensics of specialised browsers and access tools

– Thor / onion– Unusual browsers/accessing tools links, lynx, wget– Other browsers 12P Freenet

• Key Question: Does our forensic framework support such tools?– Internet Evidence Finder – Encase – FTK– If not how to handle artifacts and data ?

• What about mobile devices?

Conclusion and Questions

• Challenging field• Size will grow with IPv6 take over and

“Internet of things” concept • Cloud concept is important (size, acces, legal

isuses) • Each new tehnology will add a new layer of

invisibility eg. complexity • Size of available data simply force use of

dynamic web pages

References

Too many links ...

• http://papergirls.wordpress.com/2008/10/07/timeline-deep-web

• http://deepwebtechblog.com/federated-search-finds-content-that-google-can’t-reach-part-i-of-iii

• http://deepwebtechblog.com/a-federated-search-primer-part-ii-of-iii

• http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html

• http://www.online-college-blog.com/features/100-useful-tips-and-tools-to-research-the-deep-web/

http://deepwebtechblog.com/a-federated-search-primer-part-ii-of-iii



http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html



Education

Deep Web and Digital Investigations