Parallel Distributed Data Retrieval

Parallel and Distributed Methods for Image Retrieval with Dynamic Feature Extraction on Cluster Architectures

Odej Kao Department of Computer Science, Technical University of Clausthal

Julius-Albert-Strasse 4, D-38678 Clausthal-Zellerfeld, Germany [email protected]

Abstract

The ongoing production of Petabytes of multimedia data per year creates an urgent need fo r the organisation, management, and retrieval of multimedia information. Related memory, bandwidth, and computational requirements often surpass the capabilities of traditional database systems and computer architectures. Moreovel; improved retrieval techniques allow a manual selection of regions of interest, which are subsequently searched in all media in the database by using dynamically extracted features.

This paper presents techniques for parallel multimedia retrieval by considering an image database as an example. The discussed cluster architecture depicts one possible solution fo r the performance problem. The distribution of the image data over a large number of nodes enables a parallel processing of the compute intensive operations fo r dynamic image retrieval. Thus, the partitioning of the data and the applied strategies for workload balancing have a decisive impact on the performance, eficiency, and the usability of such image databases.

1. Introduction

Image management systems are used to organise, manage, and retrieve different classes of images. Pattern recognition systems work with a homogeneous set of images, e.g. images of work pieces on a production line, finger prints in police files, etc. These are compared to a manually com- piled set of patterns in order to check the quality or to iden- tify a person. Thus, the goal is to find one particular image.

Image databases manage large, general sets with pictorial information and allow searches for a number of images that are similar to a given sample image or which satisfy user-defined conditions. The main focus is to restrict the large set in the database to a few suitable images. Subse- quently, these results can be used to refine the initial search.

Pattern recognition systems have been used for a long time, e.g. specialised information systems were developed to evaluate medical images, as well as to manage, organise, and to retrieve patient data [5]. The importance of image databases rose enormously in recent years. One of the rea- sons is the spreading of digital technology and multimedia applications producing Petabytes of pictorial material per year. Document libraries offer their multimedia stock all over the world. This is also true for art galleries, museums, research institutions, photo and press agencies, document management systems, trademark databases, and civil services managing many current and archived images. Fur- thermore, systems are created in combination with applied image processing, where the image database is only part of a more complex system. Some well-known research systems are QBIC [ 2 ] , SURFIMAGE [9] and VISUALSEEK [13]. A survey of existing image retrieval systems is provided e.g. by VENTERS and COOPER [ 141.

An image is a complex data structure containing syntax and semantic information, which can be divided in following categories:

Raw image data, which is stored in a certain format.

Technical information describes the image resolution, number of used colours, format, etc.

Information derived by image processing encloses extracted features, such as objects and regions, statistical characteristics, topological data, etc.

Knowledge-based information describes the relation- ship between the image elements and the real world entities, e.g. who or what is shown on the image.

World-oriented information includes attributes like ac- quisition date, photographer, and manually inserted keywords describing the subjective understanding and impression of a particular viewer.

The raw image data, the technical, and the world- oriented information can be represented by standard data

110 1529-4188/01$10.00 0 2001 IEEE

mailto:[email protected]

structures and stored in existing databases. The user can either browse or search the database by entering keywords. Examples of such databases with images [ 1 I ] are widely available on the Internet as a part of entertainment sites, web presentations of museums, art galleries, etc.

However, these storage and retrieval methods require la- borious construction of appropriate keyword combinations and manual annotation of the images. An image database should therefore support a content-based similarity search on a general set with pictorial information.

2. Image retrieval

The state-of-the-art approach for the creation and retrieval of image databases is based on the comparison of a priori defined features, which can be directly derived from the image raw data when updating the database. The extracted features can be combined and weighted in different ways and describe characteristic content properties.

At query time the user creates a sample sketch or loads an image, which is subsequently processed in the same manner as the images stored in the database. The similarity degree of a query image and the target images is then determined by calculation of a multidimensional dis- tance between the corresponding features using adapted and weighted versions of well-known metrics and functions. Acceptable system response times are achieved, because no further processing of the image raw data is necessary during the retrieval process. The straightforward integration in existing database systems is a further advantage of this approach.

Extraction of simple features results in a disadvanta- geous reduction of the image content. Important details like objects, topological information etc. are not sufficiently considered in the retrieval process making a precise detail search difficult. Furthermore, i t is not clear, whether the known, relatively simple features can be correctly combined for the retrieval of all kinds of images. Therefore, image retrieval with dynamically extracted features is necessary.

Dynamical retrieval is the process of analysis, extraction, and description of any manually selected image elements, which are subsequently compared to all image sections in the database. Other regions of the query image and the object background are not regarded, so that a detail search can be performed. The selected region of interest is described by a number of features. For example, following methods can be used:

0 Template Matching: The region is represented by a minimal bounding rectangle and correlated with all images in the database, e.g. by subtraction of the corresponding colour values. These are added to a sum, which describes the similarity of the examined sec-

tions. Distortions caused by rotation and deviations regarding the size, colours, etc. have to be considered.

Wavelet Coefficients: The region is transformed using a wavelet decomposition and described by a vector containing m coefficients, currently m = 64. These are e.g. the largest approximation coefficients, which are then compared to the corresponding coefficients of every analysed section. The summing of all differ- ences gives the similarity degree.

Gabor-Wavelets: These features are often used in the object representation, e.g. in order to track faces in video sequences [8]. A set of manually selected points determines the important image elements and servers as a basis for the comparison.

3. CAIRO - Cluster architecture for image retrieval and organisation

CAIRO is an image database combining standard methods for image description and retrieval with efficient processing on a cluster architecture. The graphical user interface offers various tools for formulating database queries and for visualising the resulting hits such as browsing, SQL interface, image montage, and query-by-example-image / sketch, where the user loads a sample image similar to the one looked for, or sketches a new one. Figure 1 displays the user interface.

Figure 1. Graphical user interface: Sketching tools and browser for the retrieval results

111

A relational database system manages the technical, world-oriented, and a part of the information extracted from the images, as well as the existing algorithms for feature extraction and comparison. Details on the image size, number of pixels, format, etc. belong to the first group. The information on the size is vital for the approximation of the processing time and thus for the image partitioning and dynamic re-distribution in case of workload balancing. A part of the features is modelled with conventional data structures and stored in the database. Other a priori extracted features are stored as BLOBS, so that only the final storage position is referred to in the database. This holds true for the raw data as well. These are also stored in a downscaled dimen- sion as thumbnails and are used for the visualising of the query results.

Next to the image information, the existing procedures are managed by the database, too. It is noted, which procedures are available for which image types, if the features are extracted dynamically or a priori, and the corresponding programs. Further, each operator is assigned a comparison metric that transforms the results of the analysis in a retrieval ranking. To accelerate the evaluation of a priori extracted features, different index structures - like VP trees [4] and VA files [ 1.51 -are usable. But these remain invisible for the user.

The analysis of all image sections in the database re- quires enormous processing and communication resources. Therefore, the utilisation of parallel computer architectures is necessary for the solution of the performance problem. Clusters have an advantage that each node has an own U 0 subsystem, thus the transfer effort is shared by a number of nodes. The reasonable price per node allows the creation of systems with a large number of processing elements.

PFISTER [ I O ] defines a cluster as a parallel or distributed system consisting of a collection of interconnected stand- alone computers and used as a single, unified computing re- source. The best-known cluster platform is Beowulf, a triv- ially reproducible multi-computer architecture built using commodity software and hardware components. A master node controls the cluster and serves files to the client nodes. It is also the console and gateway to the outside world [ 121.

A Beowulf cluster consisting of symmetric multiproces- sors is used for the implementation of the C A I R O system for image retrieval with dynamically extracted features. Based on their functionality the cluster nodes are divided in:

Query stations host the graphical user interface and provide a web-based access to the image database.

Master node controls the cluster, receives the query re- quests, and broadcasts the algorithms, parameters, and the sample image to the computing nodes. Further- more, it acts as a redundant storage server and unifies the intermediate results of the computing nodes.

Computing nodes perform the image processing and comparisons. Each of these nodes contains a disjunctive subset of the existing images and executes all operations with the data stored on the local devices. The sub results are sent to the master node.

Figure 2 shows a simplified schematic of the implemented cluster architecture.

A priori extracted features

access a Query

h Dynamic Image Retrieval

lcpullcpu]

Figure 2. Schematic chitecture

Slave node N b of the CAIRO cluster ar-

The distribution of the image set across the individual cluster nodes is decisive for the retrieval efficiency. Similar storage sizes of the partitions and thus an even distribution should minimise the time-intensive communication of large image blocks between the cluster nodes.

A partition can consist of multiple image classes, the elements of which differ significantly from other partitions. On the other hand, the images should be characterisable by a shared feature, like images showing landscapes, por- traits, etc. A reliable, content-based partitioning of the image into independent subset is, however, currently not real- isable. This is in particular the case when a general image stock is used. An unsuitable assignment can lead to some images being unfindable, since they are not even considered during the corresponding queries.

This is the reason why the initial CAIRO partitioning of the image set B uses a content-independent, size-based strategy, that leads to a set P = { Pl,P2, . . . , P,} of partitions with VPa, Pj c B : Pa n P3 = @, i # j and size(P,) E si-te(P,), i , j = 1,. . . ,n. The processing of a partition P, with an operator p is executed per image, thus the individual operations are independent of one an- other and the order of execution is irrelevant.

112

4. Parallel execution of retrieval operations

The distribution of the data across a number of nodes enables a retrieval parallelism by executing the same operations on all nodes and only considering the local image subset. Components called transaction, distribution, computa- tion, and result manager are necessary to implement this approach. They are based on the well-known parallel libraries PVM and MPI, that are used for distributed and parallel computations on a network of workstations. An alternative approach is provided among many others by CARINO and STEARLING (Teradata multimedia database [3]). A general overview over techniques for parallel databases is given by ABDELGUERFI and WONG [ 11.

The functionality of the transaction manager encom- passes the analysis of the transformations to be executed and determining the order of the operations. Opposed to a conventional database management system, the data is usually only read, so that no read and write conflicts need to be resolved. The order of operations should be set in a way that the time for the processing and the presenting of the system response is minimised and all suitable images have been considered. A query usually consists of a combination of a priori and dynamically extracted features, which can be composed as follows:

1. A priori extracted features are evaluated in the first phase and a list of all potential hits is constructed. This list is forwarded - together with the extraction algorithms - to the distribution manager, which causes the procedures to be only applied on these images.

2. Inverting the order of operations of ( 1 ) leads to the case where the list of potential hits is determined according to the dynamically extracted features, which is then further narrowed down by considering a priori extracted features.

3. Both processing streams can initially be regarded as independent of each other and be executed in parallel. The resulting intermittent lists are transformed into a final ranking by a comparison process.

Each of these possibilities has certain advantages and disadvantages regarding speed of execution and precision. The combination a priori/dynamically extracted features limits the set of images to be analysed dynamically so much, that the fastest system processing time is achieved with this method. On the other hand, suitable images could be re-' moved from the result set by imprecise comparisons with the a priori extracted features and are not considered any- more in the second step. This disadvantage is eliminated in the other two approaches, but the processing time necessary clearly grows, as every image needs to be analysed dynamically for each query.

The transaction manager also controls the module for dynamic re-distribution of images across the nodes. If only a selection of images need to be processed, the list is handed to the scheduler, which returns a re-distribution plan for balancing the workload across the cluster.

The distribution manager receives a list of images and the identifier of the extraction and comparison algorithms to be executed as input and generates the program calls for the image analysis and comparison. They are composed according to the PVM and MPI syntax and are sent to all nodes via the communication routines of the active virtual machine.

The computing manager controls the execution of the extraction algorithms with the local data. This process runs on each cluster node and takes care of the communication between the cluster and master node. The partitioning of the image data in disjunctive sets results in each node compos- ing a ranking of hits that need to be unified by the result manager in the next step. All features have to be visible for this component. A large communication overhead is gener- ated if the raw data needs to be compared as well, drastically reducing the advantages of the parallel processing.

The update manager realises the insertion of images: First, the raw image data is transformed into a uniform format and is tagged with a unique identifier. All existing procedures for a priori feature extraction are then applied to this image. Furthermore, the technical and, if existent, world-oriented data is determined and extended by a set of user-defined keywords. All information is composed into a given data structure and stored in the relational database.

The next phase determines the cluster node, on whose hard disk the raw image data is to be stored. In the case of an even data distribution, the image is sent to the node with the smallest data volume. It may be necessary to re-distribute the data to achieve a balanced storage load, if larger images are used. In the last phase the exact image position is stored and the index structures are updated.

5. Strategies for workload balancing

As already described the combination a priori/dynamically feature extraction distorts the initially even distribution of the images over the nodes. In the worst case all images to be processed dynamically are located on a single cluster node, thus no parallel processing can be done. A permanent or temporal migration of images within the cluster in order to balance the workload and to equalise the processing times is required. It can be proven that this problem is NP-complete [7] and thus no exact polynomial algorithm for optimisation exists (unless P = NP).

For all available operators the average processing time per 1000 pixel is determined empirically and stored in the database. With this value the system processing time t,, the

113

optimal processing time topt, and the minimal processing time tmin can be calculated for a given image distribution after the evaluation with a priori extracted features.

A heuristic strategy called LTF (Largest Task First) [6] is one possible solution for workload balancing within a cluster-based image database: Large images with long compute times are analysed first and are mainly processed on the local node. Thus, the images on the nodes have to be sorted in descending order according to their size. After tmin at least one node idles and an image re-distribution is necessary in order to avoid unused cluster resources. During the communication phase selected images are moved from overloaded to idling or less loaded nodes. Subsequently, the image analysis continues. This approach reduces the transfer effort, as only small images are communicated between the nodes. Furthermore, the LTF strategy has a low complexity O(n log n). Disadvantages result mainly from the image communication between all nodes at the nearly same time, as the network is usually overloaded resulting in a high number of access conflicts and long latency.

The second strategy named RBS (Randomised Block Size) distributes the communication over a longer period of time in order to reduce the signalling conflicts and the network overload. It uses an asynchronous communication, where the transferred images are buffered until the addressed node has finished all processing tasks with the local data and it is ready to receive additional images. The execution list Li of the node i created with the LTF strategy is fragmented into processing, sending, and receiving tasks. These three different types of tasks are combined in order to distribute the communication over the interval [0, tmin] and minimise the access conflicts. For this pur- pose an additional parameter - so called Block Size BS - is introduced, which can be adapted to the capabilities of the current system, e.g. network bandwidth, average size of images, etc. This parameter defines, how many send tasks may be grouped and successively - without an intermediate processing task - executed. The communication blocks are subsequently distributed randomly over the processing tasks in the time [O,tmin]. The execution lists of the receiving nodes are analogously completed by receiving tasks after the last processing task before t,,, is reached.

Measurements with the implemented prototype resulted into significant improvement of the system processing time. A discussion of these results is presented in [6].

6. Conclusions

This paper gives an overview over different techniques for parallel and distributed multimedia databases like features and retrieval operations, data partitioning and distribution, parallel execution of image analysis and workload balancing by considering the cluster-based image database

CAIRO as an example. Future work includes the development of content-based

methods for image partitioning, which could reduce the total number of images to be analysed. Moreover, the workload balancing strategies have to be adapted to hetero- geneous architectures and multiple, simultaneous queries. Quality of service, scalability, and reliability aspects of the distributed image databases are further challenges in the research project.

References

[ I ] M. Abdelguerfi and K.-F. Wong. Parallel Database Tech- niques. IEEE Computer Society Press, 1998.

[2] J. Ashley. Automatic and semi-automatic methods for image annotation and retrieval in QBIC. In Proceedings of Storage and Retrieval for Image and Video Databases Ill, volume 2420, pages 24-35, SPIE, 1995.

[3] E Carino and W. Sterling. Parallel strategies and new con- cepts for a petabyte multimedia database computer. In Par- allel Database techniques, pages 139-1 64, IEEE Computer Society, 1998.

[4] T. Chiueh. Content-based image indexing. In Proceedings of the 20th VLDB Conference, pages 582-593, 1994.

[SI H. K. Huang, M. Shiu, and E R. Suarez. Anatomical cross- sectional geometry and density distribution data base. In S. K. Chang, K. S. Fu (Edts.): Pictorial Information Systems, pages 351-367. Springer, 1980.

[6] 0. Kao, G. Steinert, and E Drews. Scheduling aspects for image retrieval in cluster-based image databases. In Pro- ceedings of IEEE/ACM Symposium on Cluster Compiifing and Grid (CCGrid 2001), 2001. to be published.

Reducibility among combinatorial problems. In Complexity of Computer Conpuafions. pages 85-104. Plenum Press, 1972.

[8] V. Krueger and G. Sommer. Gabor wavelet networks for object representation. Technical Report 2002, University of Kiel, 2000.

[9] C. Nastar, M. Mitschke, C. Meilhac, and N. Boujemaa. Sur- fimage: A flexible content-based image retrieval system. In Proceedings of ACM Multimedia, pages 339-344, 1998.

[IO] G. E Pfister. In Search of Clusters. Prentice Hall, 1998. [ I I ] S. Santini and R. Jain. Image databases and not databases

with images. In Proceedings of Conference on Image Anal- ~ s i s and Processing (ICIAP 97), pages 3848, 1997.

[I21 D. Savarese and T. Sterling. Beowulf. In R. B q y a (Edt.): High Performance Cluster Computing - Archirectirres and Systems, pages 625-645, Prentice Hall, 1999.

[I31 J. Smith and S.-F. Chang. Visualseek: a fully automated content-based image query system. I n Proceedings of ACM Multimedia, pages 87-98, 1996.

[I41 C. Venters and M. Cooper. A review of content-based image retrieval systems. Technical Report jtap-054, University of Manchester, 2000.

[15] R. Weber, H. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the Corgerence on Very Large Databases, pages 194-205, 1998.

[7] R. Karp.

114

Documents

Parallel Distributed Data Retrieval