Upload
damir-delija
View
387
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Presentation on deep web and digital forensic issue
Citation preview
Deep Web and Digital Investigations
Damir DelijaMilano 2014
1
What we will talk about
• Web and “Deep Web”• Web and documents • Definitions• Technical issues • Forensic issues• I’m not an expert on deep or dark web• Discussion based on many sources and
references
Inaccessible Web
• Deep Web is a name for data inaccessible by regular search engines on the Internet
• Deep Web sounds much better than inaccessible
• Searchable / Accessible web is also called surface web
• Dark web is part of www with illegal or immoral content
• Dark web is not Deep Web it is part of it, but dark pages are on the surface web too
• Inaccessible resources– it exists but we don’t know about it or it’s location– we can’t use it
• It is an old problem – you have it, even in your own room
• Is there any solution ?– idea from Gopher days, Veronica – it works well with static pages and data– abandoned in web days, becomes a source of tremendous
power and wealth for Search Engines
Inaccessible Resources
Web and Internet and Documents• WWW is not the Internet ☺
– also full data or document space of each networked computer is not part of the Internet
• WWW is hypertext document based structure – we have links among documents– a document is not necessarily a web page– documents must have a presentation ability to be visible
through the web interface (transcription layer, often dynamicaly generated)
– Links, web pages and documents can be static or dynamically generated
– Dynamic documents are here because of volume of data (can’t be organised in static pages)
Definitions are crucial in understandig deep and surface web
Volume of Data
• For each document there is in average of 11 copies in the system – enterprise measurements pre SAN calculation
• Shows how document space expands rapidly • Even simple mail can cause data avalanches
• From sourface web point of view ?• Mostly invisible
• From Deep Web point of view ?• Data/documents copies are probably floating
around, inaccessible to us
Web and Search Engines
• Web can access material which is only referenced by a link and is not access protected
• Today mostly we assumes search engine span equals web and Internet
• To be effective search engines must have pre organised data to answer query
• Enormous changing volume of collected data and propagation lag http://en.wikipedia.org/wiki/List_of_search_engines
Deep Resources
• Deep Web depends on the method of how search engines acquire and store data
• Web can be crawled or explored as link space• Hints are cache, proxy, protocol traffic• No clear boundary between deep resources
and surface resources
Uncollectible Resources Deep Web Resources
• Dynamic Web Pages– returns in response to a query or accessed only through a form
• Unlinked Contents– Pages without any backlinks
• Private Web– sites requiring registration and login (password-protected resources)
• Limited Access web– Sites with captchas, no-cache pragma http headers
• Scripted Pages– Page produced by javascript, Flash, AJAX etc
• Non HTML contents– Multimedia files e.g. images or videos
Uncollectible Resources Documents and Disk Space
• This comes close to e-discovery field• Is this part of Deep Web ?
• Documents not in the web tree• accessible only by direct filesystem access• or by dedicated script effort
• Files generally on the web servers and no-web servers machines– accessible only by direct filesystem access
Forgotten Data
• From the security aspect, forgotten data is a very interesting part of Deep Web
• What is forgotten data – maybe data without custodian ?
• Verizon reported about big data breach from 2008, – unknown data being part of data breach in 66% of
incidents
Data Lifecycle• Data creation and circulation • How to find data and correlate it• Search engines • Proxies • Metadata, Logs , Feeds• Very interesting ideas in “Programming
Collective Intelligence” By: Toby Segaran, O'Reilly Media, August 16, 2007
Hidden Data in Surface web ?
• Web handles data available trough html and extensions
• What about metadata and embedded data which is not accessible for search engines ?
Surface Web and Deep Issues
• “Hidden Data in Internet Published Documents”– deep forensic impact
• Specific data formats can have embedded elements which is not visible to search engine – like thumb views embeded in pictures– exif data in images– metadata in documents – stego
Idea of Treasure Island
• What is not on the map is unknown• Hiden as treasure island• Idea of unexplored, uncharted with big gains ..• Because of size idea of Iceberg
Why Deep Web Exists ?
• Why search engine fails?– Technology
• Most of the web data is behind dynamically generated pages (web gateways)– Web crawler cannot reach them or data not announced – Can only be obtained if we have access to the system
containing the information– Forms have to populated with values
– understanding the semantic of the web gateway and data behind it
Measuring the Deep Web• How to measure – estimates are based on known
examples • Try to generate pages based on known home pages
and explore the link space, based on hop distances• First Attempt: Bergman (2000)
– Size of surface web is around 19 TB– Size of Deep Web is around 7500 TB – Deep Web is nearly 400 times larger than the Surface Web
• 2004 Mitesh classified the Deep Web more accurately– Most of the html forms are two hops from the home page
Deep Web SizeCurrent Estimates 2014
• Deep Web about 7500 Terabytes • Surface Web about 19 terabytes • Deep Web has between 400 and 550 times more
public information than the Surface Web.• 95% of the Deep Web is publically accessible• More than 200,000 Deep Web sites currently exist.• 550 billion documents on Deep Web • 1 billion documents on Surface Web
History of Deep Web
• Start: static html pages, web crawlers can easily reach, only few cgi-scripts
• In mid-90’s: Introduction of dynamic pages, page generated as a result of a query or link access
• In 1994: Jill Ellsworth used the term “Invisible Web” to refer to these websites.
• In 2001, Bergman coined it as “Deep Web”• Dark web goes in parallel as crime start to spread
over the Internet
Rough Timeline• 2001: Raghavan et al -> Hidden Web Exposure
– domain specific human assisted crawler• 2002: Stumbleupon used Human Crawler
– human crawlers can find relevant links that algorithmic crawlers miss.• 2003: Bergman introduced LexiBot
– used for quantifying the Deep Web• 2004: Yahoo! Content Acquisition Program
– paid inclusion for webmasters• 2005: Yahoo! Subscriptions
– Yahoo started searching subcription only sites• 2005: Noulas et. al. -> Hidden Web Crawler
– automatically generated meaningful queries to issue against search form• 2005: Google site map
– Allows webmasters to inform search engines about urls on their websites that are available for crawling.
– Web 2.0 infrastructure – Today Mobile device and Internet of things
– each gadget can have (and has) web server for configuration
Forensic Issues
From Digital Forensic Viewpoint• Is there a way to carry out forensically sound
actions on Deep Web ?• Can we apply standard digital forensic
procedures and best practices ?
• In both cases yes, – we are always limited in digital forensics, but that
does not prevent reliable results
Web and Digital Forensic
• Web is web ☺• Web artifacts are web artifacts• The type of investigation determines how we
handle web data– key element is: legal
• Many possible scenarios and situations – follow the forensic principles and best practices as
in any other situation– use scientific method– test and experiment to prove method
Deep Web and Forensic Tasks
• How to prove access to Deep Web resources– same as ordinary resources, because it is mostly
through browsers – advantage over blind Deep Web access since there
are history, cache, log artifacts which shows which Deep Web resource was accessed
• Deep Web artifacts – Mostly like any other web artifacts – Hidden Data in Internet Published Documents– Dark web as a specific subrange
Forensic Tools Issues• Forensics of specialised browsers and access tools
– Thor / onion– Unusual browsers/accessing tools links, lynx, wget– Other browsers 12P Freenet
• Key Question: Does our forensic framework support such tools?– Internet Evidence Finder – Encase – FTK– If not how to handle artifacts and data ?
• What about mobile devices?
Conclusion and Questions
• Challenging field• Size will grow with IPv6 take over and
“Internet of things” concept • Cloud concept is important (size, acces, legal
isuses) • Each new tehnology will add a new layer of
invisibility eg. complexity • Size of available data simply force use of
dynamic web pages
References
Too many links ...
• http://papergirls.wordpress.com/2008/10/07/timeline-deep-web
• http://deepwebtechblog.com/federated-search-finds-content-that-google-can’t-reach-part-i-of-iii
• http://deepwebtechblog.com/a-federated-search-primer-part-ii-of-iii
• http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html
• http://www.online-college-blog.com/features/100-useful-tips-and-tools-to-research-the-deep-web/