14
IBM Labs in Haifa © 2004 IBM Corporation Search and Storage - Coping with One Billion File Filesystems Benny Mandler, Naama Kraus, Alain Azagury, Michael Factor

Search and Storage - Coping with One Billion File Filesystems€¦ · Problem: Need to filter information based on user credentials since virtual directories expose files content

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Search and Storage - Coping with One Billion File Filesystems€¦ · Problem: Need to filter information based on user credentials since virtual directories expose files content

IBM Labs in Haifa © 2004 IBM Corporation

Search and Storage - Coping with One Billion File Filesystems

Benny Mandler, Naama Kraus, Alain Azagury, Michael Factor

Page 2: Search and Storage - Coping with One Billion File Filesystems€¦ · Problem: Need to filter information based on user credentials since virtual directories expose files content

IBM Labs in Haifa

© 2004 IBM Corporation

The Problem

� Easier to find a file on the Web written by some kid on the other side of the world than to find a file on one's own computer� This “feels” wrong

� File system directory structure based upon file cabinet metaphor� Each file exists in one place in a fixed hierarchy� To find a file must remember where it was placed

� Metaphor has not scaled with the growth in the number of files� Modern scalable file systems aim for storing a billion files

� Not possible to search through all files to find a specific file� Growth in number of files dictates a paradigm shift from fixed hierarchies to a more

flexible mechanism� Data explosion drives new information organization, retrieval, analysis

and storage paradigms

Page 3: Search and Storage - Coping with One Billion File Filesystems€¦ · Problem: Need to filter information based on user credentials since virtual directories expose files content

IBM Labs in Haifa

© 2004 IBM Corporation

� Easier to find a file on the Web written by some kid on the other side of the world than to find a file on one's own computer� This “feels” wrong

� File system directory structure based upon file cabinet metaphor� Each file exists in one place in a fixed hierarchy� To find a file must remember where it was placed

� Metaphor has not scaled with the growth in the number of files� Modern scalable file systems aim for storing a billion files

� Not possible to search through all files to find a specific file� Growth in number of files dictates a paradigm shift from fixed hierarchies to a more

flexible mechanism

� Data explosion drives new information organization, retrieval, analysis and storage paradigms

The Problem

Page 4: Search and Storage - Coping with One Billion File Filesystems€¦ · Problem: Need to filter information based on user credentials since virtual directories expose files content

IBM Labs in Haifa

© 2004 IBM Corporation

Searching in a File System?

� Need an additional metaphor� Web metaphor is “wise old librarian”

� Give a search engine information about the items of interest and get back the desired documents

� Known to scale to a billions of files� Two types of search

� “DB-style”� Deterministic, answer exactly what the user asks, all answers returned

� A query for “Enron” will give documents about the company and e-mail from “Joe Enron”

� “Web-style” (“IR-style”)� Heuristic, answer what user wants, ranked results

� A query on “lift” from England would give a greater ranking to document for Otis Elevator than the same query from the USA

� Probably need both types for a file system� Web-style is likely the more significant game changer

Page 5: Search and Storage - Coping with One Billion File Filesystems€¦ · Problem: Need to filter information based on user credentials since virtual directories expose files content

IBM Labs in Haifa

© 2004 IBM Corporation

Why is This Not a Solved Problem?

� Several approaches have been proposed as solutions� Content management systems

� Can only provide search for data it manages� What if multiple content managers are used?� What if content is not managed?

� Requires accessing the content through a specific interface� Not a vanilla file system

� Similar arguments apply to enterprise search facilities� “grep”

� Not scaleable� Only “DB-style”

Page 6: Search and Storage - Coping with One Billion File Filesystems€¦ · Problem: Need to filter information based on user credentials since virtual directories expose files content

IBM Labs in Haifa

© 2004 IBM Corporation

Our Solution

� Complement file system scalability with a scalable ability to locate information� Augment physical directory structure with a semantic data access mechanism

� Provide a semantic view, via ‘virtual directories’ based upon the content, structure and the metadata of the files

� Support both a search and a browse paradigm� Search

� Context sensitive, free text search, ranked results� Exact match (DB-like) and ranked results (web-like)

� Navigation (Browsing)� Guided ‘search’ - at each point in the hierarchy present all

valid manners in which to continue the navigation� Augments the existing fixed physical hierarchy� Respect file system semantics, in particular, security� Keep fairly consistent with file system activity

Page 7: Search and Storage - Coping with One Billion File Filesystems€¦ · Problem: Need to filter information based on user credentials since virtual directories expose files content

IBM Labs in Haifa

© 2004 IBM Corporation

Our Proposed Solution (cont.)

� Accessible via three interfaces: file system, programmatic, and web based� Manifest a multi-dimensional indexing and ranking� Portability

� Architect in a manner that will be easily applicable to many file systems and search engines

� In principle applicable to any file type which index can parse� Can also index any file metadata (has a natural mapping to XML)� Current implementations support XML and Text

Page 8: Search and Storage - Coping with One Billion File Filesystems€¦ · Problem: Need to filter information based on user credentials since virtual directories expose files content

IBM Labs in Haifa

© 2004 IBM Corporation

Semantic Access File System (SAFS) Characteristics

� Adds a semantic view of the entire file system� The traditional physical view remains� Realized via a ‘virtual directories’ mechanism� Query language: an XPath variation

� File System and SAFS are loosely coupled via a thin interface� Indexing work done at the background� Access control information is integrated into the indexing system

� Index is a stand alone component� Based upon existing free-text search engines

� Make as easy as Web or DB search� The index is a multi-dimensional index, thus the semantic view is a hierarchical

view which dynamically changes as files are added or removed

Page 9: Search and Storage - Coping with One Billion File Filesystems€¦ · Problem: Need to filter information based on user credentials since virtual directories expose files content

IBM Labs in Haifa

© 2004 IBM Corporation

Semantic Access File System (SAFS) on Storage Tank (ST)

� SAFS adds a semantic view of the entire file system�The traditional physical view

remains� ST and SAFS are loosely

coupled via a thin interface�Indexing work done at the

server level�Access control information is

integrated into the indexing system

� Index is a stand alone component�Based upon Juru (part of

Trevi) SAFS

SAN

MD

IPST server

cluster

Page 10: Search and Storage - Coping with One Billion File Filesystems€¦ · Problem: Need to filter information based on user credentials since virtual directories expose files content

IBM Labs in Haifa

© 2004 IBM Corporation

SAN

File Update - Flow

SAFS

MDS

MDS

Page 11: Search and Storage - Coping with One Billion File Filesystems€¦ · Problem: Need to filter information based on user credentials since virtual directories expose files content

IBM Labs in Haifa

© 2004 IBM Corporation

Readdir - Flow

SAN

SAFS

MDS

MDS

Page 12: Search and Storage - Coping with One Billion File Filesystems€¦ · Problem: Need to filter information based on user credentials since virtual directories expose files content

IBM Labs in Haifa

© 2004 IBM Corporation

Security Challenge

� Problem: Need to filter information based on user credentials since virtual directories expose files content

� Solution: Security information is embedded into the index� Indexing phase: Each file is indexed with the security information

associated with it� Query phase: along with a query, client passes credentials of the

user performing the query to SAFS

Page 13: Search and Storage - Coping with One Billion File Filesystems€¦ · Problem: Need to filter information based on user credentials since virtual directories expose files content

IBM Labs in Haifa

© 2004 IBM Corporation

Status

� Working prototype as an enhanced Storage Tank File System & an NFS Server, as well as a stand-alone indexing engine are available� Support a wide variety of query types� Automatic and immediate content based indexing� Index based upon

� Content-based hierarchical attribute - value pairs for known file types� Free-text for other file types� File metadata for all file types

� Support both browse and search capabilities� Support XPath-like queries

Page 14: Search and Storage - Coping with One Billion File Filesystems€¦ · Problem: Need to filter information based on user credentials since virtual directories expose files content

IBM Labs in Haifa

© 2004 IBM Corporation

Demo – Semantic Access File System on Storage Tank

� Have also a running prototype on an NFS server� A traditional static directory structure� A semantic directory structure based on file’s content

� A dynamic view, created on the fly� Semantic file system navigation

� Walk through the virtual directories� A virtual directory contains all valid ways to continue the search

� Locate a file� A file is located in a virtual directory� That file satisfies the query represented by that directory� No need to remember file’s physical location� Same file may exist in different virtual locations

� Add a new file to the file system� Semantic view is dynamically updated