Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
IBM Labs in Haifa © 2004 IBM Corporation
Search and Storage - Coping with One Billion File Filesystems
Benny Mandler, Naama Kraus, Alain Azagury, Michael Factor
IBM Labs in Haifa
© 2004 IBM Corporation
The Problem
� Easier to find a file on the Web written by some kid on the other side of the world than to find a file on one's own computer� This “feels” wrong
� File system directory structure based upon file cabinet metaphor� Each file exists in one place in a fixed hierarchy� To find a file must remember where it was placed
� Metaphor has not scaled with the growth in the number of files� Modern scalable file systems aim for storing a billion files
� Not possible to search through all files to find a specific file� Growth in number of files dictates a paradigm shift from fixed hierarchies to a more
flexible mechanism� Data explosion drives new information organization, retrieval, analysis
and storage paradigms
IBM Labs in Haifa
© 2004 IBM Corporation
� Easier to find a file on the Web written by some kid on the other side of the world than to find a file on one's own computer� This “feels” wrong
� File system directory structure based upon file cabinet metaphor� Each file exists in one place in a fixed hierarchy� To find a file must remember where it was placed
� Metaphor has not scaled with the growth in the number of files� Modern scalable file systems aim for storing a billion files
� Not possible to search through all files to find a specific file� Growth in number of files dictates a paradigm shift from fixed hierarchies to a more
flexible mechanism
� Data explosion drives new information organization, retrieval, analysis and storage paradigms
The Problem
IBM Labs in Haifa
© 2004 IBM Corporation
Searching in a File System?
� Need an additional metaphor� Web metaphor is “wise old librarian”
� Give a search engine information about the items of interest and get back the desired documents
� Known to scale to a billions of files� Two types of search
� “DB-style”� Deterministic, answer exactly what the user asks, all answers returned
� A query for “Enron” will give documents about the company and e-mail from “Joe Enron”
� “Web-style” (“IR-style”)� Heuristic, answer what user wants, ranked results
� A query on “lift” from England would give a greater ranking to document for Otis Elevator than the same query from the USA
� Probably need both types for a file system� Web-style is likely the more significant game changer
IBM Labs in Haifa
© 2004 IBM Corporation
Why is This Not a Solved Problem?
� Several approaches have been proposed as solutions� Content management systems
� Can only provide search for data it manages� What if multiple content managers are used?� What if content is not managed?
� Requires accessing the content through a specific interface� Not a vanilla file system
� Similar arguments apply to enterprise search facilities� “grep”
� Not scaleable� Only “DB-style”
IBM Labs in Haifa
© 2004 IBM Corporation
Our Solution
� Complement file system scalability with a scalable ability to locate information� Augment physical directory structure with a semantic data access mechanism
� Provide a semantic view, via ‘virtual directories’ based upon the content, structure and the metadata of the files
� Support both a search and a browse paradigm� Search
� Context sensitive, free text search, ranked results� Exact match (DB-like) and ranked results (web-like)
� Navigation (Browsing)� Guided ‘search’ - at each point in the hierarchy present all
valid manners in which to continue the navigation� Augments the existing fixed physical hierarchy� Respect file system semantics, in particular, security� Keep fairly consistent with file system activity
IBM Labs in Haifa
© 2004 IBM Corporation
Our Proposed Solution (cont.)
� Accessible via three interfaces: file system, programmatic, and web based� Manifest a multi-dimensional indexing and ranking� Portability
� Architect in a manner that will be easily applicable to many file systems and search engines
� In principle applicable to any file type which index can parse� Can also index any file metadata (has a natural mapping to XML)� Current implementations support XML and Text
IBM Labs in Haifa
© 2004 IBM Corporation
Semantic Access File System (SAFS) Characteristics
� Adds a semantic view of the entire file system� The traditional physical view remains� Realized via a ‘virtual directories’ mechanism� Query language: an XPath variation
� File System and SAFS are loosely coupled via a thin interface� Indexing work done at the background� Access control information is integrated into the indexing system
� Index is a stand alone component� Based upon existing free-text search engines
� Make as easy as Web or DB search� The index is a multi-dimensional index, thus the semantic view is a hierarchical
view which dynamically changes as files are added or removed
IBM Labs in Haifa
© 2004 IBM Corporation
Semantic Access File System (SAFS) on Storage Tank (ST)
� SAFS adds a semantic view of the entire file system�The traditional physical view
remains� ST and SAFS are loosely
coupled via a thin interface�Indexing work done at the
server level�Access control information is
integrated into the indexing system
� Index is a stand alone component�Based upon Juru (part of
Trevi) SAFS
SAN
MD
IPST server
cluster
IBM Labs in Haifa
© 2004 IBM Corporation
SAN
File Update - Flow
SAFS
MDS
MDS
IBM Labs in Haifa
© 2004 IBM Corporation
Readdir - Flow
SAN
SAFS
MDS
MDS
IBM Labs in Haifa
© 2004 IBM Corporation
Security Challenge
� Problem: Need to filter information based on user credentials since virtual directories expose files content
� Solution: Security information is embedded into the index� Indexing phase: Each file is indexed with the security information
associated with it� Query phase: along with a query, client passes credentials of the
user performing the query to SAFS
IBM Labs in Haifa
© 2004 IBM Corporation
Status
� Working prototype as an enhanced Storage Tank File System & an NFS Server, as well as a stand-alone indexing engine are available� Support a wide variety of query types� Automatic and immediate content based indexing� Index based upon
� Content-based hierarchical attribute - value pairs for known file types� Free-text for other file types� File metadata for all file types
� Support both browse and search capabilities� Support XPath-like queries
IBM Labs in Haifa
© 2004 IBM Corporation
Demo – Semantic Access File System on Storage Tank
� Have also a running prototype on an NFS server� A traditional static directory structure� A semantic directory structure based on file’s content
� A dynamic view, created on the fly� Semantic file system navigation
� Walk through the virtual directories� A virtual directory contains all valid ways to continue the search
� Locate a file� A file is located in a virtual directory� That file satisfies the query represented by that directory� No need to remember file’s physical location� Same file may exist in different virtual locations
� Add a new file to the file system� Semantic view is dynamically updated