24
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice One Graph is worth a Thousand Logs Uncovering Hidden Structures in Massive System Event Logs Gilad Barash Co-Authors: Ira Cohen, Michal Aharon, Eli Mordechai

One Graph is worth a Thousand Logs Uncovering Hidden Structures in Massive System Event Logs

  • Upload
    janna

  • View
    63

  • Download
    2

Embed Size (px)

DESCRIPTION

One Graph is worth a Thousand Logs Uncovering Hidden Structures in Massive System Event Logs. Gilad Barash Co-Authors: Ira Cohen, Michal Aharon, Eli Mordechai. Event Logs in the IT environment. MANY LOGS Millions of events Per DAY. IT. - PowerPoint PPT Presentation

Citation preview

  • ****Event Logs in the IT environmentITEach System component writes to its own log (events, errors)Logs are used to detect and troubleshoot problems in systems

  • ****Event Logs : The ProblemHuge volume of data is not amenable for human consumption

    Semi-Structured text data not meant for automated consumption

  • ****Roadmapfailed to get licenses for project session the session auth has failed. failed to get licenses for project session the session auth has failed. failed to get licenses for project session the session auth has failed. failed to get licenses for project session the session auth has failed. failed to get licenses for project session the session auth has failed. Template DiscoveryMachine-Ready DataRaw LogsHuman-readable DataPARIS

  • Logs in raw form

    12/1/2008 12:34:03 failed to retrieve the meta data of project null0 the session auth has failed.12/1/2008 12:35:03 failed to get licenses for project session the session auth has failed.12/1/2008 12:40:31 error processing request from 192.111.22.33 data starts with 0 \00000023\0 conststr12/1/2008 12:44:03 unexpected failure while trying to ping user session #44444 the session auth has failed 12/1/2008 12:50:03 failed to retrieve the meta data of project null1 the session authentication has failed. 12/1/2008 12:50:05 unexpected failure while trying to ping user session #33333 the session auth has failed 12/1/2008 12:50:23 failed to get licenses for project session the session auth has failed. 12/1/2008 12:55:09 failed to get licenses for project session the session auth has failed. 12/1/2008 12:56:22 error processing request from 192.222.22.55 data starts with 0 \00000014\0 conststr12/1/2008 12:56:56 Failed to retrieve the meta data of project null3 the session auth has failed.12/1/2008 12:57:03 error processing request from 193.111.26.33 data starts with 0 \00000512\0 conststr12/1/2008 12:57:25 error processing request from 192.111.22.43 data starts with 0 \00000014\0 conststr

  • Lets Rearrange the Messages**11 messages9 distinct messages4 templatesVariable WordVariable WordVariable WordVariable Word

    failed to retrieve the meta data of project null0 the session auth has failed.failed to retrieve the meta data of project null1 the session auth has failed. Failed to retrieve the meta data of project null3 the session auth has failed.

    failed to get licenses for project session the session auth has failed.failed to get licenses for project session the session auth has failed. failed to get licenses for project session the session auth has failed.

    error processing request from 192.111.22.33 data starts with 0 \00000023\0 conststrerror processing request from 192.222.22.55 data starts with 0 \00000512\0 conststrerror processing request from 192.111.22.43 data starts with 0 \00000014\0 conststr

    unexpected failure while trying to ping user session #44444 the session auth has failed unexpected failure while trying to ping user session #33333 the session auth has failed

  • Requirements for Template Discovery

    Online Produce immediate valueConsistentTemplate assignment of a message should remain consistent over timeEfficient Keep up with incoming message rates

  • Template Discovery Algorithm:

    Incremental Text Clustering

    Step 1: Rough clustering: Creating/Assigning events to root clusters

    Step 2: Cluster refinement:Splitting root clusters

    Output: Forest of clusters**

  • Step 1: Rough Clustering**Log MessageSequentiallycompare to ExistingClusterSimilar to Existing Cluster?Assign to Existing ClusterCreate new ClusterCheck for Splityesno

  • Step 2: Cluster Refinement**Compute Entropy ofEach Word PositionSplit Clusters based onWord position EntropyDo nothing

    MIN(entropy) < thresholdfor all entropy > NoYes No word occurs too frequently (e.g., Pkj = 0.99)threshold At least one word occurs frequently enough (e.g., Pkj =0.10)

  • **Template discovery algorithm

    m1: B C D F A Bm2: B C D F A B Jm3: A C D F E Km4: B C D F E BSimilarity Threshold: 0.8m1,m2,m4m3Clustering example:1000 appearances of m4800 appearances of BCDFAB_=0.91 =0.5 =0.83E

  • Entropy Calculation for Split**m4: B C D F E Bmx: B C D F A B *1000 *800 *Entropy:00000.1500.45

  • **Template discovery algorithm

    m1: B C D F A Bm2: B C D F A B Jm3: A C D F E Km4: B C D F E BSimilarity Threshold: 0.8m1,m2,m4m3Clustering example:m3m1,m2m4m4=0.91 =0.5 =0.83Em5: B C D F E D

  • Discovering System Process PatternsJDBC3 getGeneratedKeys(): disabled

    (3) Connection release mode: auto

    (5) getConnectionURLs=tcp://websiteURL:2507(6) Query translator: hql.ast.ASTQueryTranslatorFactory

    (9) create connection. connectId (10) mercury_db_loader_DB_Loader user=;pwd=;(2) SH remote was null. Exported object monitor.

    (4) Add task Main Flow

    (7) Register provider class dataentry.loader.LoaderMain(8) Service manager started

    (3) Connection release mode: auto(11) mercury_db_loader is up and running

    Same messageIn different processOptional messagesPARIS (Principal Atoms Recognition In Sets)Identifies sets of events that tend to occur together with noInnate knowledge of messagesProvides enhanced view of system behavior dynamics

    Database Connection StartupService Manager startup(1) JDBC3 getGeneratedKeys(): disabled(3) Connection release mode: auto(5) getConnectionURLs=tcp://websiteURL:2507(6) Query translator: hql.ast.ASTQueryTranslatorFactory(9) create connection. connectId (10) mercury_db_loader_DB_Loader user=;pwd=;(2) SH remote was null. Exported object monitor.(4) Add task Main Flow(7) Register provider class dataentry.loader.LoaderMain(8) Service manager started(3) Connection release mode: auto(11) mercury_db_loader is up and running

  • PARISGets as input a large number of sets, that are assumed to have some mutual characterization.Detects principal sets of elements that tend to appear together in the data.Overcomes non-exact repetitionsIgnores additional noiseUses:Analysis CompressionAnomaly Detection ElementsSetsRepresentationAtoms

  • PARISApplicable for:Analysis & understandingCompressionDetection of irregularities / errors in the dataSynthesisElementsSetsRepresentationAtoms

  • PARISElementsSetsRepresentationAtomsRepresentation error must be small, but not necessarily zero.Representation should serve some sense of compression of the data (sparsity).Minimal number of atoms (K).

  • PARIS Cost FunctionMinimize the representation error of the data.Minimize the size of the representation (compression).Minimize the number of principal atoms.Representation error must be small, but not necessarily zero.Representation should serve some sense of compression of the data (sparsity).Minimal number of atoms (K).

  • ResultsDatasets

    SourceNumber of eventsNumber of distinct eventsBusiness App 14,210,513153,619Printer Press11,2045,631Windows Events66,10225,340Business App 2483,76870,102

  • ResultsTemplate identificationRepresentationAccuracy: 95%

    SourceNumber of eventsNumber of distinct eventsNumber of clusters(templates)Business App 14,210,513153,6194,193Printer Press11,2045,631204Windows Events66,10225,340476Business App 2483,76870,1021,115

  • ResultsCompression

    SourceNumber of eventsNumber of distinct eventsNumber of clusters(templates)Index size reduction (number of words saved)Business App 14,210,513153,6194,19390%Printer Press11,2045,63120450%Windows Events66,10225,34047640%Business App 2483,76870,1021,11590%

  • Visualizing the logs: Business App 2 Event Timeline70,000 distinct messagesAppear in one graph viewBehavioral patterns become Evident in this viewX axis: TimelineY axis: msg IDOne graph is worth a thousand logs!!!

  • ****PARIS Result: Correct Process IdentificationBuss App 2 LogsAtom ID: 12 890 Failed creating SiS sample 924 Failed processing http request: report_ss_samples, from remoteHost : Failed to acquire lock for publishing sample 1183 Failed processing http request: report_transaction, from remoteHost : Failed to acquire lock for publishing sample

    Atom ID: 27734User operation - stop nanny748message_broker STOPPED753Input main(String[] args:754Going to call WrapperManager.start(new Main(), args)755Initializing Spring files757Path for spring files is E:\HPBAC\conf\supervisor\spring759Loading spring file764NannyConfigurationRepository initialization completed765Will load JMX security info from E:\HPBAC\conf\jmxsecurity.txt766No security file: E:\HPBAC\conf\jmxsecurity.txt - JMX will not be secured767Registering beans for JMX exposure on startup768Autodetecting user-defined JMX MBeans769Bean with name 'nannyManager' has been autodetected for JMX exposure770HTTP adapter port is 11021771Succeeded adding html adapter772manager thread loop started.773Verifying time diff between cpp (local machine) and Java.774Log file of time diff is: E:\HPBAC\tools\TimeDiff\time_diff.log776Run java time diff780Trying to initialize Properties Manager792Config server check passed793Prerequisites have been met794start() Nanny Manager795Nanny Manager need to start all services?:true796Going to start all services. Service Restart

  • SummarySummarize Event Logs: Template CreationLossless Reduction in size of dataMachine-readableProcess identification: PARISStrategic importance in managing IT environmentHuman-readable**

  • **Q&A

    Minimizing the sum balances tradeoffs between representation and compressionThe last two expressions are regularizors of the equationThis automatically finds the number of atoms (not the case in the paper)Organizing into unique messages and saving those reduces the amount of data to store drasticallyUsing the dictionary creation algorithm to create a dictionary of templates reduces the volume of data to look at even more!Index size reduction: by comparing the number of words we need to keep with distinct messages vs. clusters + variable words to reconstruct the messages. Can speed up search and data manipulation.Since Windows events already have ids, we could compare and measure accuracy. 95% is due to missing and duplicate ids in Windows events!!PARIS identified the messages and mapped them to known processes