Building an Integrated CBR-Big Data Oriented Architecture ...ceur-ws.org/Vol-2028/paper19.pdf · case base grows, the swamping utility problem can adversely a↵ect case retrieval

Building an Integrated CBR-Big Data OrientedArchitecture for Case-Based Reasoning Systems

Kareem Amin1,2

1 PhD Candidate, German Research Center for Artificial Intelligence, Smart Dataand Knowledge Services, Trippstadter Strae 122, 67663 Kaiserslautern, Germany,

2 Big Data Consultant, Sulzer GmbH, Frankfurter Ring 162, 80807, Munich,Germany

[email protected],[email protected]

1 Introduction

The growth of intensive data-driven decision-making is now being recognizedbroadly. Big data systems are mainstream and the demand for building sys-tems that able to process data streams is growing. Yet many decision supportsystems act like ”black boxes”, providing little or no transparency in the ratio-nale of their processes [1]. The ”black box” methodologies are not acceptablein crucial domains like health care, aviation, and maintenance. Experts preferto reason the decisions. Current big data strategies tend to process in-motiondata and o↵er many potential scenarios to work with. The big data term refersto dynamic, large, structured and unstructured volumes of data generated fromdi↵erent sources with di↵erent formats [2]. Therefore, it is a must for CBR sys-tems that tends to process the in-motion data to manage their sub-tasks, suchas collecting and formatting data, case base maintenance, cases retrieval, casesadaptation and retaining new cases [3]. In my research I will describe the idea ofspanning the gap between CBR and Big Data based on the SEASALT architec-ture [4] [5]. SEASALT is an application independent architecture to work withheterogeneous data repositories and modularizing knowledge. It was proposedbased on the CoMES approach to develop collaborative multi-expert systemsand provides an application-independent architecture that features knowledgeacquisition from a Web community, knowledge modularization, and agent-basedknowledge maintenance. Its first research prototype was developed for the travelmedicine application [4]. SEASALT aims to provide a coherent multi-agent CBRarchitecture that can define the outlines and interactions to develop multi-agentCBR systems.

2 CBR & Big Data

When CBR research has addressed increased data sizes, the primary focus hasbeen compression of existing data rather than scale-up. Considerable CBR re-search has focused on the e�ciency issues arising from case-base growth. As the

Copyright©2017forthispaperbyitsauthors.Copyingpermittedforprivateandacademicpurpose.InProceedingsoftheICCBR2017Workshops.Trondheim,Norway

189

case base grows, the swamping utility problem can adversely a↵ect case retrievaltimes, degrading system performance [6]. CBR and Big Data collaboration is anemerging topic, some researches have been carried out focusing mainly on casebase maintenance methods, aiming to reduce the case base size while preservingcompetence [8][7]. Few CBR projects have considered scales up to a million ofcases [10][9]. The ability of case-based reasoning to reason from individual ex-amples and its inertia-free learning makes it appear a natural approach to beapplied to big-data problems such as predicting from very large example sets [6].Likewise, if CBR systems had the capability to handle very large data sets, sucha capability could facilitate CBR research on very large data sources alreadyidentified as interesting to CBR, such as cases harvested from the ExperienceWeb [11], cases resulting from large-scale real-time capture of case data frominstrumented systems [12], or cases arising from case capture in trace-based rea-soning [13].

3 Research Focus

In my thesis I am going to concentrate on building a multi-agent CBR systemthat extends the SEASALT architecture. The proposed approach is designedto semi-automate the building of cases based on chunks of data coming fromdi↵erent streams, and being able to work with big number of historical casesstored in our case base. A real use case to elaborate the main goal of my modelwould be in manufacturing [14]. In manufacturing processes data comes fromdi↵erent machines and sensors. We need to detect any pattern that has led toa disqualified end product, and give a proactive solution to avoid or mitigatethe e↵ect of these kinds of patterns [15]. Hence, from the Big Data 4V’s, I willmainly focus on velocity and volume with lower exposure to variety. I need tocollect data from di↵erent sources and be able to detect patterns that matchour old cases in real time. To achieve the aforementioned goals, the followingobjectives have to be fulfilled:

1. Extend the original SEASALT architecture with a new layer ”KnowledgeStream Management”

2. Correlate and synchronize between the chunks of data that come from dif-ferent sources

3. Collect su�cient knowledge from domain experts that help in achieving point2

4. Develop a methodology to apply the new approach to existing multi-agentsystems as well as integrating it into the development of new multi-agentsystems

5. Evaluate the new approach and the methodology within an industrial usecase

6. Compare performance and accuracy with other existing techniques and sys-tems

The proposed approach is roughly described in details in the following sec-tions.

190

4 The Knowledge Stream Management

The original idea of the SEASALT architecture comes from Altho↵, Bach andReichle [5]. SEASALT consists mainly of five main layers, Knowledge Source,Knowledge Formalization, Knowledge Representation, Knowledge Provision, andIndividualized Knowledge. Every layer contains several software agents desig-nated for several tasks. Through my work, a new layer will be added: ”Knowl-edge Stream Management” (See Figure 1). The new layer has two main tasks,the first is processing the streams of data coming from Knowledge Sources inreal time, and the second is to give real time analysis to data patterns foundwithin the streams. The Knowledge Stream Management layer will contain soft-ware agents designated for the prescribed tasks. System nodes 3 would be theavailable processing power. The Knowledge Provision layer will be distributedacross several nodes, and hence each node contains Knowledge Provision agents.The Coordination Agent will act as the system manager who is aware of allthe system nodes and responsible for the whole system control. He will be thedata tap that uses the underlying framework to distribute the incoming requestsacross the system nodes. Normally, there are two kinds of nodes, one for Queriesprocessing to retrieve results and the second for New Cases processing. It is pos-sible to have up to N nodes in the system according to the volume of data thatshould be processed in real time. According to Big Data system architecture andsizing best practices provided from Hortonworks 4, for sustained throughput of50MB/sec and thousands of events per second, we need 1-2 nodes and 8+ coresper node (more is better), 6+ disks per node (SSD or Spinning) and 2 GB ofmemory per node and 1GB bonded. In every node, there would be a Classifi-cation Agent to classify the received data chunks and assign it to the intendedTopic Agent. Each Classification Agent is aware of the knowledge map gatheredfrom knowledge sources and classify the incoming data according to predefinedclasses (collected before from domain experts). Then, the Classification Agentassigns the request to the intended Topic Agent(s). The Topic Agent is perform-ing queries to retrieve the most similar cases. Since distributed nodes are beingused in the hardware cluster, the Case Base will be replicated to avoid dataintegrity problems using replication channels to replicate data between all CaseBase instances. The Case Factory agents will be centralized, and hence the CaseFactory will have only one instance that performs case maintenance on a singleCase Base. Afterwards, the results will be distributed to the whole system nodesusing the replication channels.

We assume that solving big data problems will require also manual knowl-edge modelling. CBR - standing with one foot in the area of Machine Learn-ing [automated knowledge generation] and with the other foot in the area ofKnowledge-Based Systems [manual and semi-automatic knowledge modelling] -

3 Every single node is a processing power4 Hortonworks is one of the biggest big data software companies founded inJune 2011 as an independent company based in Santa Clara, California.http://www.hortonworks.com

191

Fig. 1. Big Data Oriented SEASALT Architecture

is a natural candidate for finding a domain-and task-specific approach of inte-grating automated knowledge generation [using machine learning] with manualknowledge modeling [using knowledge-intensive CBR].

4.1 Potential Applications

1. Internet of Things (IoT) applications.

2. Ticketing systems and customer support applications.

3. Server logs anomaly detection applications.

4. Condition monitoring applications.

5 Current Progress & Future Directions

Currently I am shaping my PhD goals and approach. I intend to implementour approach and compare accuracy and speed performance with other casebase maintenance methods. I am currently working to learn the big data systemarchitectures and tools, that will help in the implementation phase. In the mean-while, I am trying to find a suitable industrial use case to apply the proposedapproach.

192

References

1. Herlocker, J.L., Konstan, J.A., Riedl, J.: Explaining Collaborative Filtering Recom-mendations. In: Proceedings of the 2000 ACM Conference on Computer supportedcooperative work, ACM (2000).

2. A community white paper developed by leading researchers across the United States,”Challenges and Opportunities with Big Data,” Purdue University, USA, 2012.

3. Aitor Mata, ”A Survey of Distributed and Data Intensive CBR Systems,”SpringerVerlag Berlin Heidelberg, pp. 582-586, 2009.

4. Meike Reichle, Kerstin Bach and Klaus-Dieter. Altho↵, ”Knowledge engineeringwithin the application-independent architecture SEASALT” International JournalKnowledge Engineering and Data Mining, vol. 1.1, no. 3, pp. 202-215, 2011.

5. Kerstin Bach,Meike Reichle and Klaus-Dieter. Altho↵, ”A Domain IndependentSystem Architecture for Sharing Experience,” Proceedings of LWA 2007, WorkshopWissens- und Erfahrungsmanagement, September 2007, pp. 296-303

6. Vahid Jalali and David Leake, ”CBR Meets Big Data: A Case Study of Large-ScaleAdaptation Rule Generation,” Case-Based Reasoning Research and Development,pp. 181-196, 2015.

7. Smyth, B., Keane, M.: Remembering to forget: A competence-preserving case dele-tion policy for case-based reasoning systems. In: Proceedings of the Thirteenth Inter-national Joint Conference on Artificial Intelligence, San Mateo, Morgan Kaufmann(1995) 377382

8. Smyth, B., McKenna, E.: Building compact competent case-bases. In: Proceedingsof the Third International Conference on Case-Based Reasoning, Berlin, SpringerVerlag (1999),329342

9. Daengdej, J., Lukose, D., Tsui, E., Beinat, P., Prophet, L.: Dynamically creatingindices for two million cases: A real world problem. In: Advances in Case-BasedReasoning, Berlin, Springer (1996) 105119

10. Beaver, I., Dumoulin, J.: Applying mapreduce to learning user preferences innear realtime. In: Case-Based Reasoning Research and Development, ICCBR 2014,Berlin, Springer (2014) 1528

11. Plaza, E.: Semantics and experience in the future web. In: Proceedings of the NinthEuropean Conference on Case-Based Reasoning, Springer (2008) 4458

12. Ontanon, S., Lee, Y.C., Snodgrass, S., Bonfiglio, D., Winston, F., McDonald, C.,Gonzalez, A.: Case-based prediction of teen driver behavior and skill. In: Case-BasedReasoning Research and Development, ICCBR 2014, Berlin, Springer (2014) 375389

13. Cordier, A., Lefevre, M., Champin, P.A., Georgeon, O., Mille, A.: Trace-basedreasoning modeling interaction traces for reasoning on experiences. In: Proceedingsof the 2014 Florida AI Research Symposium, AAAI Press (2014) 363368

14. S. Windmann, A. Maier, O. Niggemann, C. Frey, A. Bernardi, Ying Gu, H. Pfrom-mer, T. Steckel, M. Kruger and R. Kraus, ”Big Data Analysis of ManufacturingProcesses,” in 12th European Workshop on Advanced Control and Diagnosis, 2015.

15. Mller, G., Bergmann, R.: Workflow Streams: A Means for Compositional Adap-tation in Process-Oriented CBR. In: Proceedings of ICCBR 2014. Cork, Ireland,2014

193

Documents

Building an Integrated CBR-Big Data Oriented Architecture ...ceur-ws.org/Vol-2028/paper19.pdf · case base grows, the swamping utility problem can adversely a↵ect case retrieval