A Dynamic Workload Balancing Technique of a Text Matching ... · A Dynamic Workload Balancing Technique of a Text Matching Algorithm on a Cluster ODYSSEAS EFREMIDES University of

A Dynamic Workload Balancing Technique of a TextMatching Algorithm on a Cluster

ODYSSEAS EFREMIDESUniversity of Hertfordshire, Hatfield, UK,

in collaboration with IST StudiesDept. of Computer Science

72 Pireos Str., 183 46, Moschato, AthensGREECE

[email protected]

GEORGE IVANOVUniversity of Hertfordshire, Hatfield, UK,

in collaboration with IST StudiesDept. of Computer Science

72 Pireos Str., 183 46, Moschato, AthensGREECE

[email protected]

Abstract: A dynamic workload allocation model which utilizes a data pool manager is investigated herein. Itaims at heterogeneous multicomputer environments and the implementation is written in C using the MPICH NT1.2.5 message passing interface for Microsoft Windows based clusters. The algorithm utilized is the data paralleladaptation of the Brute Force exact pattern matching algorithm. Performance evaluation for both homogeneousand heterogeneous system configurations is experimentally investigated.

Key–Words: Parallel pattern matching, Heterogeneous clusters, Workload Allocation, Data pool model

1 IntroductionApplications designed for heterogeneous systemsbuilt on a cluster of workstations must implement amechanism for dynamic workload allocation basedon their processing capacity. Nevertheless, for largescale systems consisted of a variety of workstations,predetermining a specific workload amount for eachnode based on its processing capabilities can becomea cumbersome methodology.

An automated dynamic workload allocationmechanism that utilizes a data pool manager for theparallel - exact string matching problem over a hetero-geneous cluster of workstations is investigated herein.The pattern matching algorithm employed is a dataparallel adaptation of the well known sequential BruteForce [3] [4] algorithm. The implementation is testedand evaluated in homogeneous and heterogeneous en-vironments with an adjustable number of workstationsand is compared to a simple static workload allocationversion [12].

2 Data Pool Workload ManagementThe dynamic workload allocation mechanism followsa different approach in the utilization of available dataparallelism compared to the static relaxed - parallelimplementation [12]. Instead of partitioning data ac-cording to the number of cluster nodes and then as-signing a part of roughly the same size to each nodethat will be processed during the computation phase,

data are split into packets that form a pool which con-sists of the entire search file. Worker nodes cannot usethis information without being managed by the poolmanager. Note, that all segments in the pool have thesame size, with the exception of the last one whosesize may be smaller than the selected size value.

Information concerning the data packets can onlybe found at the pool manager whose role is under-taken by the root node. To manage this data, a dy-namic packet availability table is built and stored lo-cally, based on the size of the search file and the prede-fined packet size. Each table position acts as an iden-tifier of a packet. If that specific packet is availablefor processing the position is appropriately marked,while packets that have been scanned for occurrencesby a node are marked as unavailable. Hence, the pro-gram returns with the entire number of pattern occur-rences within the search file when all packets havebeen processed and therefore the pool is empty. Sincethis table is only available at the manager, packet sizeshould be such so that it does not incur too muchnetwork traffic, due to often requests for identifiers.At the same time it has to be small enough so thatprocessing time for each packet is not excessive atslow nodes.

Partitioning cannot be performed by workernodes and is a part of the world initialization processat the root. This happens because all nodes may havethe entire search data stored locally or in a sharedmedium but not the information on how to use it.This information is only obtained after a message ex-

Proceedings of the 5th WSEAS International Conference on Telecommunications and Informatics, Istanbul, Turkey, May 27-29, 2006 (pp287-292)

change with the pool manager. Each communicationconcerns only one data packet. Each packet is ap-propriately overlapped locally by pattern length−1so that no occurrences are lost between packets. Fi-nally, in order to handle packets of any size withoutany problems due to physical memory restrictions, in-ternal packet splitting during the read - load process isalso performed. These packets are also appropriatelyoverlapped, without requiring additional I/O activity,again ensuring that no occurrences are lost.

The entire process, concerning task allocation, isdesigned so that waiting time is extremely low: thetime to process a single request includes a scan in theavailability table to locate a non processed packet andthen a 4 byte communication of the id (identifier) ofthe packet follows. This id will be used by the workernode to locate the specific packet in the search file andwill no longer be available for processing.

Nodes that process packets faster than other nodeswill request more packets more often than slowerones. As a result the workload is automatically dis-tributed to nodes based on their processing capabil-ities. Moreover, the execution time is not boundedby the slowest node as workload is appropriately bal-anced amongst nodes of different capabilities. Finally,the overhead from inefficient utilization of the entirecluster due to nonexistence of an adequate number ofpackets in order for all nodes to be busy at the end ofthe scan process is extremely small, because the pack-ets are of relatively small size and require only tenthsof seconds of computation time.

2.1 Orchestration and MappingAfter the initialization phase, the pool manager waitsfor incoming requests from any node that wants to besent the identification of a packet to process. At thesame time, worker nodes immediately send a requestmessage to the root. The pool manager processes re-quests one by one and sends to each node one packetfor scanning. More specifically an integer identifierof the segment of the search file is sent and is used bythe worker to resolve the data to be scanned. Whena worker finishes processing the packet and becomesidle a new request message is sent and so on. When allpackets in the pool have been scanned the root sendsa terminate message to all workers and a collectivecommunication follows so that results of the compu-tation are sent back to the root.

The overhead from the communication over thenetwork should be relatively small in a 100Mbit/s Eth-ernet infrastructure. This happens because the requestmessage is only 4 bytes in size and the identifier of thepacket sent with each communication to the request-ing node is also 4 bytes long. Even if these are en-

Figure 1: Pool Manager Based Dynamic WorkloadAllocation Algorithm

capsulated in a larger packet for communication rea-sons [5], the overhead should still be trivial.

2.2 The AlgorithmFigure 1 shows the algorithm of the pool managerbased dynamic workload allocation model. Althoughthis implementation depends on communication, itsgeneral overhead to performance should be minimal.Hence, the implementation should be efficient enoughwith a good linear speedup on a heterogeneous or ho-mogeneous cluster.

2.2.1 The Data Secure AdaptationTwo implementations where developed. The first one,called simple, implements the entire procedure as de-scribed above. The other, described herein, suggestsa simplified and more secure data management ap-proach. Hence, instead of having the dataset scatteredamongst a large number of workstations, the searchfiles and patterns are only located in one node, thedata secure pool manager and no other nodes havedirect access to this information.

This mainly differentiates this modified version inthe fact that it relies even more on the available inter-connection network and its speed. Specifically, it re-quires the transmission of the pattern that will be usedin the search phase. Furthermore, during the compu-tation process each communication between the poolmanager and worker nodes involves real data transferinstead of identifiers that help locate the data locally


or in a file server. Since the data will only be read byone workstation it also incurs higher I/O overhead.

3 Experimental Results3.1 The Cluster ConfigurationThe developed workload allocation model is evaluatedusing a cluster consisting of maximum 16 worksta-tions with Intel Pentium 4 2.40GHz CPUs and 512MBof RAM and 16 Intel Celeron 2.20GHz CPUs with256MB of RAM. Microsoft Windows 2000 Profes-sional is the operating system installed on all work-stations. Appropriate combinations achieve a hetero-geneous or homogeneous computing environment, asdesired and denoted where applicable. Note, that thecluster is dedicated [6] during all conducted tests.

Nodes of the cluster communicate using a100Mbit/s Ethernet interconnection and are actuallypart of a larger local area network.

3.2 Development EnvironmentThe implementation is written in C, utilizing theArgonne National Laboratory MPICH NT MessagePassing Interface (MPI) library version 1.2.5 [8] [9].Access to the search files is realized through theWin32 application programming interface, in orderto support files of huge sizes, based on the provided64bit addressing.

3.3 The DatasetThe developed workload allocation model is tested us-ing a number of files of sizes ranging from 161MBand up to 3575MB. These text database files con-tain the Homo sapiens genome and are obtained fromthe public section of GenBank National Institutes ofHealth [7], part of the International Nucleotide Se-quence Database Collaboration. The utilized patternsizes range from 3bytes to 60bytes.

3.4 Results AnalysisAs found experimentally, pattern size does not notice-ably affect performance and only extreme variationsin pattern size are expected to have a minimum im-pact [10]. Results from the use of only one pattern areshown for that reason.

As the search file size increases, the benefits fromparallelization in this implementation become morenoticeable, because the higher computation require-ments overlap the communication time. Thus, thereis a noticeable drop in the performance for the twosmallest files as the system scales. As depicted in

Figure 2: Total execution time for different searchfiles with a pattern of 3 bytes

Figure 3: Communication time for all search filescans using the 3 bytes pattern

Figure 2 when all nodes are utilized the total execu-tion time for all search files is roughly identical. Thisis because communication overhead increases as thenumber of nodes increase and is almost equivalent forall search files as seen in Figure 3. Hence, the over-head of communication overlaps the gains from par-allelization in this case, being clearly visible for filesthat require less computation, compared to a sequen-tial implementation.

To evaluate and compare the performance of theparallel dynamic workload allocation implementationwith the static one [12], supplementary tests are pre-sented below. The main characteristic of the staticimplementation is that it (statically) assigns the sameamount of work to all nodes, including the root, andignores workstation performance characteristics. Thestatic and dynamic workload allocation versions aretested in a heterogeneous environment utilizing four,eight and sixteen processors. In each case, only half ofthe workstations are of the same type. Since the staticimplementation is relaxed, the root node has no im-plementation specific management duties and henceall nodes take part in the computation process. Con-


Figure 4: Speedup based on the average of resultsusing all patterns

Figure 5: Parallelization efficiency based on the aver-age of results using all patterns

sequently, for comparability reasons, one additionalnode is added in each of the dynamic workload al-location implementations to assume the role of thepool manager, for the same level of available data par-allelism to be exploited. Results from the compari-son using the smallest and largest search files are pre-sented in Figures 6 and 7.

The outcomes of the comparison indicate a satis-factory performance of the dynamic implementationfor the largest search file. This further supports theconclusion that the benefit from dynamic allocationis mostly visible for files that require more computa-tion, in which case the percentage of the communi-cation overhead in the total execution time is small.Still, as the utilized multicomputer scales and compu-tation time is significantly reduced, the communica-tion overhead becomes noticeable. Additionally, forvery small files any advantages of this implementationare hidden by the communication time. Note thoughthat the performance difference for the two versions isbetween 0.1 and 0.2 seconds for the small file, whilethe benefit from the dynamic adaptation for the largefile is in the range of 1 to 5 seconds.

Figure 6: Static VS Dynamic Performance Compari-son (Search File: 161MB)

Figure 7: Static VS Dynamic Performance Compari-son (Search File: 3575MB)

In contrast to the satisfactory performance of thesimple pool manager, the secure data pool managershowed extremely poor results, as expected, mostlydue to the slow 100Mbit/s Ethernet interconnection ofthe networked workstations that was available, whichcannot be compared to local I/O read times, in con-junction with the relatively fast computation times thatcharacterize the pattern matching problem on the as-sembled parallel system. Figures 8, 9, and 10 showthe total execution time as well as the computation andcommunication times, respectively. Note that the im-plementation is not compared with the static workloadallocation version, since it would have no meaning un-der the available test environment.

Although, computation times decrease indicat-ing good speedup values, the total execution time isseverely limited by communication (Figure 10) andthus results are extremely poor. Actually, the time re-quired to run the application is roughly equal to thetime required to copy the search file from one node toanother, based on a transfer rate of 10MB/s achievedat best in an 100Mbit/s Ethernet. The only notice-able improvement in the execution time appears in the


Figure 8: Execution time for different search fileswith a pattern of 3 bytes

Figure 9: Computation time for different search fileswith a pattern of 3 bytes

transition from 2 to 4 nodes (Figure 8). This is dueto the fact that when using two nodes, there is onlyone worker and the application does not at all take ad-vantage of data parallelism. However, when the nodesbecome four, three of them are workers and thus par-allelization is exploited. This also validates the three-fold improvement of the computation time in the tran-sition from two to four nodes, compared to twofoldimprovements that appear as processor numbers dou-ble and support the theory in the case of a relaxed al-gorithm (Figure 9). In all other cases the benefits fromparallelization are hidden by communication time.

Communication impact on system performanceis evident in the speedup obtained separately fromthe average execution and computation times usingall patterns. Speedup results, based on computationtime (Figure 12), show excellent implementation per-formance that supports the theory: speedup increasesalmost linearly along with the number of processors.It should be noted that, the root does not take partin the computation process, so the number of proces-sors actually scanning the search file is one less thanthe number shown. In contrast, the speedup acquired

Figure 10: Transmission time for various search filesizes with a pattern of 3 bytes

form the total execution time (Figure 11), which in-cludes communication time, is extremely poor to suchan extent which leads to the conclusion that the useof more than four nodes is pointless, when solvingthe parallel pattern matching problem over a slow net-work interconnection. This simply happens becausethe network is unable to sustain a fast enough data rateto each processor, making obvious the communicationoverhead in the total execution time.

Figure 11: Speedup based on average execution timesof all patterns for the four search files

4 Conclusion RemarksA dynamic workload allocation model, based ona pool manager implementation was investigated,herein. The implementation exhibits an excellent per-formance for larger files that require more process-ing time. For such files the communication time withthe pool manager is negligible to the total execu-tion time. Hence, there appears a significant perfor-mance increase in a heterogeneous system configura-tion, compared to the static workload allocation ap-proach [12]. On the other hand, small files presenta slightly worst performance mainly due to the over-


Figure 12: Speedup based on average computationtimes of all patterns for the four search files

head imposed by the pool manager and the intercom-munication infrastructure. Furthermore, as the clusterscales the computation time becomes smaller whilethe communication overhead increases, leading to afurther performance drop. It is expected that for largercluster configurations utilizing 128 or even 256 work-stations, the pool manager can become a significantperformance bottleneck. In such configurations therole of the pool manager may have to be assignedto more than one node, following an appropriate loadbalancing approach [11].

The data secure pool manager adaptation was mo-tivated by both security and ease of data managementreasons. The approach suggests that all data should bestored in only one workstation, the secure root node.Consequently, performance can be severely limited bylocal I/O read speed and network transmission speed,which was the case in the test environment. This re-sulted in extremely low performance compared to thesimple pool manager and the static workload alloca-tion version, to a point where parallelization gainsare completely lost due to communication. Still, itis expected that given a faster network interconnec-tion, system performance should be much better. Fur-thermore, other data parallel problems with a morecomplex computation phase could also perform bet-ter. Then the benefits from having the dataset localin only one node in conjunction with parallelizationshould be apparent. Both of the above research ques-tions are left open for future work.

References:

[1] J. Hennessy and D. Patterson, Computer Ar-chitecture: A Quantitative Approach, MorganKaufmann Publishers, Third Edition, 2003

[2] V. Donaldson , F. Berman and R. Paturi, Pro-gram Speedup in a Heterogeneous Computing

Network, Journal of Parallel and DistributedComputing, vol. 21, no. 3, 1994, pp. 316–322

[3] Aho Alfred, Corasick Margaret. (1975). Effi-cient String Matching: An Aid to BibliographicSearch. Association of Computing MachineryInc.

[4] C. Christian and L. Thierry, Handbook of Ex-act String Matching Algorithms, Kings CollegePublications, 2004

[5] A. Tanenbaum, (2002), Computer Networks,Prentice Hall Publishing, Fourth Edition, 2002

[6] G. Pfister (1998), In Search of Clusters, SecondEdition, Prentice Hall, 1998

[7] GenBank: National Institutes of Health geneticsequence databaseURL: http://www.ncbi.nih.gov/Genbank

[8] D. Ashton, W. Gropp and E. Lusk, Installa-tion and User’s Guide to MPICH, a PortableImplementation of MPI Version 1.2.5. URL:http://www-unix.mcs.anl.gov/mpi/mpich/

[9] The Message Passing Interface Forum, MPI: AMessage Passing Interface StandardURL: http://www.mpi-forum.org/

[10] P. Michailidis and K. Margaritis, Parallel TextSearching Application on a Heterogeneous Clus-ter of Workstations, Proceedings of the 2001 In-ternational Conference on Parallel ProcessingWorkshops IEEE (ICPPW’01), 2001a, pp. 1530–2016

[11] R. Buyya, High Performance Cluster Comput-ing: Architectures and Systems, Prentice HallPTR, 1999

[12] G. Ivanov, A Study of Parallel ProgrammingTechniques on a Cluster of Workstations, ISTStudies, University of Hertfordshire


Documents

A Dynamic Workload Balancing Technique of a Text Matching ... · A Dynamic Workload Balancing Technique of a Text Matching Algorithm on a Cluster ODYSSEAS EFREMIDES University of