Upload
thephantom1972
View
233
Download
7
Embed Size (px)
Citation preview
Understanding system crashes and collecting appropriate data
Nitin Vig - JTAC (ERX)
Target Audience
This training is intended for engineersThis training is intended for engineers
who are supporting the E-series platformwho are supporting the E-series platform
in the field.in the field.
A basic familiarity with the product and aA basic familiarity with the product and a
good network troubleshooting skills aregood network troubleshooting skills are
expected.expected.
Agenda
Why does an E-series router crash?Why does an E-series router crash?
What happens after a crash?What happens after a crash?
What can I do once it crashes?What can I do once it crashes?
What information does JTAC need?What information does JTAC need?
Why does the router crash?Why does the router crash?
Crashes can happen on the SRP or Crashes can happen on the SRP or the Line Modulethe Line Module
Can be due to software or hardware Can be due to software or hardware problemproblem
Crashes can be a good thingCrashes can be a good thing
Why does the router crash?Why does the router crash? A crash happens when:A crash happens when:
– We do not know what to do:We do not know what to do:– Received a packet that the software does not know how Received a packet that the software does not know how
to handle.to handle.– A mis-behaving piece of hardware causes undesirable A mis-behaving piece of hardware causes undesirable
operation.operation.
– Someone does not listen to us:Someone does not listen to us:– An application is in a deadlock and does not release An application is in a deadlock and does not release
resources.resources.– A line module does not respond to the SRP.A line module does not respond to the SRP.
Type of crashesType of crashes Software panicsSoftware panics
– Software not designed to handle a specific Software not designed to handle a specific condition.condition.
Processor ExceptionsProcessor Exceptions– Processor on the SRP or LM hits a violation while Processor on the SRP or LM hits a violation while
processing data.processing data.
Detector crashesDetector crashes– Recovery and Detection mechanism implemented Recovery and Detection mechanism implemented
to address forwarding fault conditionsto address forwarding fault conditions
Software panicsSoftware panics Software panics (example)Software panics (example)
time of reset: THU NOV 01 2007 00:57:26 CDTtime of reset: THU NOV 01 2007 00:57:26 CDTrun state: primaryrun state: primaryimage type: applicationimage type: applicationlocation: slot (6)location: slot (6)build date: 0x46392b64 THU MAY 03 2007 00:23:00 UTCbuild date: 0x46392b64 THU MAY 03 2007 00:23:00 UTCreset type: panicreset type: panictask: cliActortask: cliActorfile: osSemaphore.ccfile: osSemaphore.ccline: 153line: 153arg: 38516744arg: 38516744last errno: 0x3d0001last errno: 0x3d0001pc: 0x1c480f74: fatalPanic__Fv +0x8pc: 0x1c480f74: fatalPanic__Fv +0x8lr: 0x1c532eec: take__11OsSemaphore +0x1c0lr: 0x1c532eec: take__11OsSemaphore +0x1c0
<output truncated><output truncated>
– SRP crash seen when clearing an improperly terminated SSH session using SRP crash seen when clearing an improperly terminated SSH session using the 'clear line vty’ from another SSH session.the 'clear line vty’ from another SSH session.
– The fix involved changes to the SSH application behavior when clearing a The fix involved changes to the SSH application behavior when clearing a VTY session.VTY session.
– KB 31413
Software panicsSoftware panics Software panics (example)Software panics (example)
time of reset: Thu Aug 2 01:17:18 2007time of reset: Thu Aug 2 01:17:18 2007run state: unknown (0)run state: unknown (0)image type: applicationimage type: applicationimage id: 0x464c3086image id: 0x464c3086build date: 0x464c3086 Thu May 17 2007 10:37:58 GMTbuild date: 0x464c3086 Thu May 17 2007 10:37:58 GMTlocation: internal slot (1), processor 0, boardId 0x33, boardRev 0x3location: internal slot (1), processor 0, boardId 0x33, boardRev 0x3reset type: panicreset type: panicfile: dhcpDemux.ccfile: dhcpDemux.ccline: 221line: 221task: schedulertask: schedulerlast errno: 0x30065last errno: 0x30065pc: 0x3a1f5c -> fatalPanic(void) offset: 0x8pc: 0x3a1f5c -> fatalPanic(void) offset: 0x8lr: 0x709c4 -> lr: 0x709c4 -> DhcpDemux::receivePacket(Uid, Uid, OsBuffer &, unsigned char)DhcpDemux::receivePacket(Uid, Uid, OsBuffer &, unsigned char)
<output truncated><output truncated>
– Crash noticed on line modules running DHCP Relay proxy application after Crash noticed on line modules running DHCP Relay proxy application after an SRP failover. an SRP failover.
– DHCP application on LM received a DHCP packet before it was ready
– Fix involved discarding DHCP packets until DHCP application is readyFix involved discarding DHCP packets until DHCP application is ready
– KB 29531
Processor ExceptionsProcessor Exceptions Processor Exceptions (example)Processor Exceptions (example)
time of reset: Tue Mar 19 18:35:45 200time of reset: Tue Mar 19 18:35:45 200location: slot 9 (a), processor 0, boardId 0x19, boardRev 0x0location: slot 9 (a), processor 0, boardId 0x19, boardRev 0x0image type: applicationimage type: applicationbuild date: 0x3c7608b1 (Fri Feb 22 09:00:33 2002)build date: 0x3c7608b1 (Fri Feb 22 09:00:33 2002)reset type: processor exception 0x200 (machine check)reset type: processor exception 0x200 (machine check)task: icctask: iccpc: 0x37f2ccc -> memPartAlignedAlloc offset: 0x108pc: 0x37f2ccc -> memPartAlignedAlloc offset: 0x108lr: 0x37f2c9c -> memPartAlignedAlloc offset: 0xd8lr: 0x37f2c9c -> memPartAlignedAlloc offset: 0xd8dar: 0x00000000 cr: 0x24000080 xer: 0x00000000 fpcsr: 0x0369beecdar: 0x00000000 cr: 0x24000080 xer: 0x00000000 fpcsr: 0x0369beecsrr1: 0x0010b030srr1: 0x0010b030 dsisr: 0x00000000 ctr: 0x00000000 dsisr: 0x00000000 ctr: 0x00000000
<output truncated><output truncated>
– SRP crash due to L2 cache memory parity errorSRP crash due to L2 cache memory parity error
– The SRP CPU encounters a parity error when reading data from the L2 data The SRP CPU encounters a parity error when reading data from the L2 data cachecache
– No software fix available for the problem. Historically the crash never recurs No software fix available for the problem. Historically the crash never recurs on the same SRPon the same SRP
– KB 2443
Processor ExceptionsProcessor Exceptions Processor Exceptions (example)Processor Exceptions (example)
time of reset: Thu Aug 23 20:48:09 2007time of reset: Thu Aug 23 20:48:09 2007run state: primaryrun state: primaryimage type: applicationimage type: applicationimage id: 0x462e5408image id: 0x462e5408build date: 0x462e5408 Tue Apr 24 2007 19:01:28 GMTbuild date: 0x462e5408 Tue Apr 24 2007 19:01:28 GMTlocation: internal slot (7), processor 0, boardId 0x3e, boardRev 0location: internal slot (7), processor 0, boardId 0x3e, boardRev 0reset type: processor exception 0x300 (data access: protection violation (read attempt))reset type: processor exception 0x300 (data access: protection violation (read attempt))task: IpSubscriberManatask: IpSubscriberManapc: 0x4d0b8724 -> OsPooledObject::operator delete(void *) offset: 0x74pc: 0x4d0b8724 -> OsPooledObject::operator delete(void *) offset: 0x74lr: 0x490518f4 -> IpSubscriberMgr::SubscriberEntry::~SubscriberEntry(void) offset: 0x24lr: 0x490518f4 -> IpSubscriberMgr::SubscriberEntry::~SubscriberEntry(void) offset: 0x24
<output truncated><output truncated>
– SRP crash due to a data access violation.SRP crash due to a data access violation.
– IPSM application wrote data to an incorrect location. SRP CPU does not that IPSM application wrote data to an incorrect location. SRP CPU does not that data in that location hence it crashes while trying to read from that locationdata in that location hence it crashes while trying to read from that location
– IPSM application was fixed to ensure that it does not write to the invalid IPSM application was fixed to ensure that it does not write to the invalid locationlocation
– KB 29909
Detector crashesDetector crashes Detector crashesDetector crashes
– Internal mechanism developed by Juniper to detect and Internal mechanism developed by Juniper to detect and recover from forwarding faultsrecover from forwarding faults
– The objective is to minimize the forwarding impact on the The objective is to minimize the forwarding impact on the routerrouter
– The system initially tries to recover the fault without any The system initially tries to recover the fault without any external impact. If that is not possible, a crash is performed.external impact. If that is not possible, a crash is performed.
– Also records information about these faults for troubleshooting Also records information about these faults for troubleshooting purpose.purpose.
– KB 16800: Enhanced PFTE support on E-series
Detector crashesDetector crashesDetector crashesDetector crashes
– 2 basic mechanisms:2 basic mechanisms:– Run by SRPRun by SRP
– Commonly known as Commonly known as “PIMTE” “PIMTE” (Ping/Icc Monitoring Threshold (Ping/Icc Monitoring Threshold Exceeded) orExceeded) or “PFTE” “PFTE” (Ping Failure Threshold Exceeded)(Ping Failure Threshold Exceeded)
– Frequently polls the line modules to check their health (aka “ping”)Frequently polls the line modules to check their health (aka “ping”)– Thresholds are defined for applications interacting between SRP and Thresholds are defined for applications interacting between SRP and
LM LM – If thresholds are exceeded, the SRP decides on what action should If thresholds are exceeded, the SRP decides on what action should
be taken.be taken.– Additional information is written to a file with extension “.tsa”Additional information is written to a file with extension “.tsa”– TSA file generation does not always mean there was a crash TSA file generation does not always mean there was a crash
!!!!!!– These crashes have a These crashes have a generic crash signature. generic crash signature. – Crash can occur on the standby SRP or Line ModuleCrash can occur on the standby SRP or Line Module– Reboot.htyReboot.hty, , coredumpcoredump and and TSA fileTSA file (if present) are required in each (if present) are required in each
case to analyse the root cause.case to analyse the root cause.
Detector crashesDetector crashes Detector crashes: PIMTE (example)Detector crashes: PIMTE (example)
time of reset: Fri Nov 10 00:13:00 2006time of reset: Fri Nov 10 00:13:00 2006run state: unknown (0)run state: unknown (0)image type: applicationimage type: applicationimage id: 0x4550dd8bimage id: 0x4550dd8bbuild date: 0x4550dd8b Tue Nov 07 2006 19:24:59 GMTbuild date: 0x4550dd8b Tue Nov 07 2006 19:24:59 GMTlocation: internal slot (4), processor 0, boardId 0xff, boardRev 0xfflocation: internal slot (4), processor 0, boardId 0xff, boardRev 0xffreset type: panic, msg "Ping/ICC monitoring threshold exceeded"reset type: panic, msg "Ping/ICC monitoring threshold exceeded"file: ontrolNetwork.ccfile: ontrolNetwork.ccline: 775line: 775task: schedulertask: schedulerlast errno: 0x110001last errno: 0x110001pc: 0x9e4228 -> fatalPanic(void) offset: 0x8pc: 0x9e4228 -> fatalPanic(void) offset: 0x8lr: 0x122088 -> Hw2Ic1ControlNetwork::DefaultSession::acceptCall(CbusMessageconst &, CbusMessage &, CbusReplyDoneAction *&) offset: lr: 0x122088 -> Hw2Ic1ControlNetwork::DefaultSession::acceptCall(CbusMessageconst &, CbusMessage &, CbusReplyDoneAction *&) offset:
0x13080x1308<output truncated><output truncated>
time of reset: Tue Aug 29 11:21:41 2006time of reset: Tue Aug 29 11:21:41 2006run state: standbyrun state: standbyimage type: applicationimage type: applicationimage id: 0x44ed842eimage id: 0x44ed842ebuild date: 0x44ed842e Thu Aug 24 2006 10:49:18 Eastern Standard Timebuild date: 0x44ed842e Thu Aug 24 2006 10:49:18 Eastern Standard Timelocation: internal slot (9), processor 0, boardId 0x3e, boardRev 0location: internal slot (9), processor 0, boardId 0x3e, boardRev 0reset type: unknown software error signature (0xfadead4), msg "Ping/ICC monitoring threshold exceeded"reset type: unknown software error signature (0xfadead4), msg "Ping/ICC monitoring threshold exceeded"file: ontrolNetwork.ccfile: ontrolNetwork.ccline: 752line: 752task: cbusSlavetask: cbusSlavelast errno: 0x380003last errno: 0x380003pc: 0x4cc98f10 -> fatalPanic(void) offset: 0x8pc: 0x4cc98f10 -> fatalPanic(void) offset: 0x8lr: 0x4ce26e7c -> Hw1SrpSlaveControlNetwork::DefaultSession::acceptCall(CbusMessage const &, CbusMessage &) offset: 0x1548lr: 0x4ce26e7c -> Hw1SrpSlaveControlNetwork::DefaultSession::acceptCall(CbusMessage const &, CbusMessage &) offset: 0x1548<output truncated><output truncated>
Detector crashesDetector crashes Detector crashes: PFTE (example)Detector crashes: PFTE (example)
time of reset: Mon Apr 24 12:44:33 2006time of reset: Mon Apr 24 12:44:33 2006
run state: unknown (0)run state: unknown (0)
image type: bootimage type: boot
image id: 0x4390bdc4image id: 0x4390bdc4
build date: 0x4390bdc4 Fri Dec 02 2005 21:33:56 GMTbuild date: 0x4390bdc4 Fri Dec 02 2005 21:33:56 GMT
location: internal slot (5), processor 0, boardId 0x33, boardRev 0x3location: internal slot (5), processor 0, boardId 0x33, boardRev 0x3
reset type: panic, msg "ping failure threshold exceeded"reset type: panic, msg "ping failure threshold exceeded"
file: ontrolNetwork.ccfile: ontrolNetwork.cc
line: 1182line: 1182
task: schedulertask: scheduler
last errno: 0last errno: 0
pc: 0x19235d4 -> fatalPanic(void) offset: 0x8pc: 0x19235d4 -> fatalPanic(void) offset: 0x8
lr: 0x1999d90 -> Hw1Ic1ControlNetwork::DefaultSession::acceptCall(CbusMessage const &, CbusMessage &, CbusReplyDoneAction *&) offset: lr: 0x1999d90 -> Hw1Ic1ControlNetwork::DefaultSession::acceptCall(CbusMessage const &, CbusMessage &, CbusReplyDoneAction *&) offset: 0xbf80xbf8
<output truncated><output truncated>
Detector crashesDetector crashes Detector crashesDetector crashes
– 2 basic mechanisms that run:2 basic mechanisms that run:– Run by the Line moduleRun by the Line module
– Commonly known as Commonly known as “ic1Detector” “ic1Detector” crashescrashes– Various components on the line module inform the IC (line module Various components on the line module inform the IC (line module
CPU) about any forwarding faults.CPU) about any forwarding faults.– Based on the severity of the problem, the line module decides what Based on the severity of the problem, the line module decides what
action should be takenaction should be taken– The IC (line module CPU) initially attempts to recover the particular The IC (line module CPU) initially attempts to recover the particular
component. component. – If it can not be recovered or if the problem recurs, a crash is takenIf it can not be recovered or if the problem recurs, a crash is taken– The “ic1Detector” crashes are seen only on Line modules.The “ic1Detector” crashes are seen only on Line modules.– They have a They have a generic crash signature. generic crash signature. – Reboot.hty and coredump are required in each case to analyse the Reboot.hty and coredump are required in each case to analyse the
root cause.root cause.
Detector crashesDetector crashes Detector crashes: ic1Detector (example)Detector crashes: ic1Detector (example)
time of reset: Mon Aug 28 14:55:44 2006time of reset: Mon Aug 28 14:55:44 2006run state: unknown (0)run state: unknown (0)image type: applicationimage type: applicationimage id: 0x44a294a3image id: 0x44a294a3build date: 0x44a294a3 Wed Jun 28 2006 14:39:31 GMTbuild date: 0x44a294a3 Wed Jun 28 2006 14:39:31 GMTlocation: internal slot (5), processor 0, boardId 0xff, boardRev 0xfflocation: internal slot (5), processor 0, boardId 0xff, boardRev 0xffreset type: panic, msg "Ic1Detector::requestRecovery executed forced IC crash"reset type: panic, msg "Ic1Detector::requestRecovery executed forced IC crash"file: ic1Detector.ccfile: ic1Detector.ccline: 718line: 718task: schedulertask: schedulerlast errno: 0x110001last errno: 0x110001pc: 0x9b528c -> fatalPanic(void) offset: 0x8pc: 0x9b528c -> fatalPanic(void) offset: 0x8lr: 0x1286544 -> Ic1Detector::requestRecovery(Ic1Detector *, unsigned long, int, unsigned short, bool, bool, char const *, bool) offset: 0xa1clr: 0x1286544 -> Ic1Detector::requestRecovery(Ic1Detector *, unsigned long, int, unsigned short, bool, bool, char const *, bool) offset: 0xa1c<output truncated><output truncated>
time of reset: Tue Jan 22 11:32:39 2008time of reset: Tue Jan 22 11:32:39 2008run state: unknown (0)run state: unknown (0)image type: applicationimage type: applicationimage id: 0x46d571bfimage id: 0x46d571bfbuild date: 0x46d571bf Wed Aug 29 2007 13:16:47 GMTbuild date: 0x46d571bf Wed Aug 29 2007 13:16:47 GMTlocation: internal slot (15), processor 0, boardId 0xff, boardRev 0xfflocation: internal slot (15), processor 0, boardId 0xff, boardRev 0xffreset type: panic, msg "Ic1Detector::requestRecovery executed forced IC crash"reset type: panic, msg "Ic1Detector::requestRecovery executed forced IC crash"file: ic1Detector.ccfile: ic1Detector.ccline: 738line: 738task: schedulertask: schedulerlast errno: 0x110001last errno: 0x110001pc: 0xa6b740 -> fatalPanic(void) offset: 0x8pc: 0xa6b740 -> fatalPanic(void) offset: 0x8lr: 0x147dc24 -> Ic1Detector::requestRecovery(Ic1Detector *, unsigned long, int,lr: 0x147dc24 -> Ic1Detector::requestRecovery(Ic1Detector *, unsigned long, int,unsigned short, bool, bool, char const *, bool) offset: 0xa54unsigned short, bool, bool, char const *, bool) offset: 0xa54<output truncated><output truncated>
Agenda
Why does an E-series router crash?Why does an E-series router crash?
What happens after a crash?What happens after a crash?
What can I do once it crashes?What can I do once it crashes?
What information does JTAC need?What information does JTAC need?
What happens after a crash?What happens after a crash? After an SRP or Line module crashes, it goes After an SRP or Line module crashes, it goes
through the boot processthrough the boot process
During this time it writes entries to the During this time it writes entries to the reboot.hty file and generates a coredump (if reboot.hty file and generates a coredump (if enabled)enabled)
What is the reboot.hty file? What is a What is the reboot.hty file? What is a coredump?coredump?
What is the reboot.hty file?What is the reboot.hty file? Reboot.hty is a file maintained on the SRP’s flash which keeps Reboot.hty is a file maintained on the SRP’s flash which keeps
a history of the reboots that happen on the router.a history of the reboots that happen on the router.
These include regular reboots performed by the user and These include regular reboots performed by the user and unexpected crashes that may happen on the router.unexpected crashes that may happen on the router.
Each SRP maintains its own copy of the reboot.hty file. This file Each SRP maintains its own copy of the reboot.hty file. This file is NOT synchronized.is NOT synchronized.
Each SRP keeps a record of its own rebootsEach SRP keeps a record of its own reboots
Additionally, when a line module reboots it’s entries are written Additionally, when a line module reboots it’s entries are written to the to the primary SRP’s reboot.htyprimary SRP’s reboot.hty
What is the reboot.hty file?What is the reboot.hty file? How do I view the contents of the reboot.hty file?How do I view the contents of the reboot.hty file?
– Primary SRPPrimary SRP– Use the ‘show reboot-history’ commandUse the ‘show reboot-history’ command
– Standby SRPStandby SRP– Make a copy of the Standby SRP’s reboot.hty file on the primary SRP’s Make a copy of the Standby SRP’s reboot.hty file on the primary SRP’s
flash:flash:copy standby:reboot.hty <filename>.htycopy standby:reboot.hty <filename>.hty
– Use the ‘show reboot-history <filename>.hty’ commandUse the ‘show reboot-history <filename>.hty’ command
How do I copy the reboot.hty file to an FTP server?How do I copy the reboot.hty file to an FTP server?– Primary SRP:Primary SRP:
copy reboot.hty <FTPserver>:<path>/<filename>.htycopy reboot.hty <FTPserver>:<path>/<filename>.hty
– Standby SRP:Standby SRP:copy standby:reboot.hty <FTPserver>:<path>/<filename>.htycopy standby:reboot.hty <FTPserver>:<path>/<filename>.hty
What is the reboot.hty file?What is the reboot.hty file? What is the format of a crash record in the reboot.hty file?What is the format of a crash record in the reboot.hty file?
time of reset: Thu Aug 23 20:48:09 2007time of reset: Thu Aug 23 20:48:09 2007
run state: primaryrun state: primary
image type: applicationimage type: application
image id: 0x462e5408image id: 0x462e5408
build date: 0x462e5408 Tue Apr 24 2007 19:01:28 GMTbuild date: 0x462e5408 Tue Apr 24 2007 19:01:28 GMT
location: internal slot (7), processor 0, boardId 0x3e, boardRev 0location: internal slot (7), processor 0, boardId 0x3e, boardRev 0
reset type: processor exception 0x300 (data access: protection violation (read attempt))reset type: processor exception 0x300 (data access: protection violation (read attempt))
task: IpSubscriberManatask: IpSubscriberMana
pc: 0x4d0b8724 -> OsPooledObject::operator delete(void *) offset: 0x74pc: 0x4d0b8724 -> OsPooledObject::operator delete(void *) offset: 0x74
lr: 0x490518f4 -> IpSubscriberMgr::SubscriberEntry::~SubscriberEntry(void) offset: 0x24lr: 0x490518f4 -> IpSubscriberMgr::SubscriberEntry::~SubscriberEntry(void) offset: 0x24
<output truncated><output truncated>
– time of reset:time of reset: Identifies the time the reset took placeIdentifies the time the reset took place
– run state:run state: Relevant only for the SRP. Identifies if the SRP was in ‘primary’ or Relevant only for the SRP. Identifies if the SRP was in ‘primary’ or ‘standby’ state when the reset occurred. Set to ‘unknown’ for line modules.‘standby’ state when the reset occurred. Set to ‘unknown’ for line modules.
– image type:image type: Identifies the type of image the SRP or line module was running when Identifies the type of image the SRP or line module was running when it reloaded. This can be boot, diag or application image.it reloaded. This can be boot, diag or application image.
– image id:image id: Internal ID used by Juniper to identify the releaseInternal ID used by Juniper to identify the release
– build date:build date: Identifying the date when the release on this router was built.Identifying the date when the release on this router was built.
What is the reboot.hty file?What is the reboot.hty file? What is the format of a crash record in the reboot.hty file?What is the format of a crash record in the reboot.hty file?
– location:location: – Identifies the slot in which the SRP or line module was present when it reloaded.Identifies the slot in which the SRP or line module was present when it reloaded.– There is a difference in the physical slot in which SRP or line module resides and There is a difference in the physical slot in which SRP or line module resides and
the “internal slot” reported in the reboot.hty file.the “internal slot” reported in the reboot.hty file.– Mapping of the internal slot to physical slot:Mapping of the internal slot to physical slot:
– ERX:ERX:
Physical SlotPhysical Slot Internal SlotInternal Slot00 0011 1122 2233 3344 4455 5566 7777 9988 101099 11111010 12121111 13131212 14141313 1515
– E320: E320: KB 26791: E320 Internal Slot Numbering
What is the reboot.hty file?What is the reboot.hty file? What is the format of a crash record in the reboot.hty file?What is the format of a crash record in the reboot.hty file?
– reset type:reset type: Identifies the type of reset. There can be quite a few different reset types Identifies the type of reset. There can be quite a few different reset types such as ‘processor exception’, ‘panic’, ‘user reboot’such as ‘processor exception’, ‘panic’, ‘user reboot’
– task:task: Identifies the task that was running on the SRP or line module when it was reset. Identifies the task that was running on the SRP or line module when it was reset. On the line module this will always be set to ‘scheduler’ as that is the only task that On the line module this will always be set to ‘scheduler’ as that is the only task that runs on the line module.runs on the line module.
What is the core dump file?What is the core dump file? Core dump is a snapshot of the memory at time Core dump is a snapshot of the memory at time
when the crash occurred. when the crash occurred.
It is an important tool for JTAC and engineering It is an important tool for JTAC and engineering teams to identify the root cause of a crashteams to identify the root cause of a crash
The size of the core dump varies based on the The size of the core dump varies based on the amount of memory on the SRP or Line module.amount of memory on the SRP or Line module.
The core dump is generated during the boot process The core dump is generated during the boot process after the crashafter the crash
What is the core dump file?What is the core dump file? How do I check if a core dump was generated?How do I check if a core dump was generated?
- Coredumps are stored on the SRP’s flashCoredumps are stored on the SRP’s flash- Use the ‘dir’ output to checkUse the ‘dir’ output to check
Can core dumps be disabled?Can core dumps be disabled?- Yes.Yes.- Disabling line card coredumps:Disabling line card coredumps:
E120(config)#exception dump srp-onlyE120(config)#exception dump srp-only
- Disabling SRP coredumps:Disabling SRP coredumps:E120(config)#exception dump except-srpE120(config)#exception dump except-srp
How do I check if core dumps are enabled?How do I check if core dumps are enabled?– Use the ‘show exception dump’ commandUse the ‘show exception dump’ command
How do I copy a core dump to an FTP server?How do I copy a core dump to an FTP server?copy <filename>.dmp <FTPserver>:<path>/<filename>.dmpcopy <filename>.dmp <FTPserver>:<path>/<filename>.dmp
What is the core dump file?What is the core dump file? Why do customers sometimes disable core dumps?Why do customers sometimes disable core dumps?
– A coredump takes a few minutes to complete A coredump takes a few minutes to complete – This adds to the boot time of the SRP/line moduleThis adds to the boot time of the SRP/line module
Should I be disabling them?Should I be disabling them?– It is not recommended to disable core dumpsIt is not recommended to disable core dumps
– In certain cases, when repetitive crashes are seen on a router, this may be In certain cases, when repetitive crashes are seen on a router, this may be considered in consultation with JTAC.considered in consultation with JTAC.
Sometimes why does a crash not generate a core dump?Sometimes why does a crash not generate a core dump?– Some of the possible causes could be:Some of the possible causes could be:
– Not enough space on flash to store the core dumpNot enough space on flash to store the core dump– Core dumps were disabledCore dumps were disabled– Power glitchesPower glitches– Conditions where the SRP/Line module could not take a dump of the Conditions where the SRP/Line module could not take a dump of the
memory.memory.
Agenda
Why does an E-series router crash?Why does an E-series router crash?
What happens after a crash?What happens after a crash?
What can I do once it crashes?What can I do once it crashes?
What information does JTAC need?What information does JTAC need?
What can I do once it crashes?What can I do once it crashes? Ensure services are restoredEnsure services are restored
– A brief checklist:A brief checklist:– Are the SRP and line modules in online state?Are the SRP and line modules in online state?
– Did all the routing protocols converge?Did all the routing protocols converge?
– Have the subscribers started reconnecting?Have the subscribers started reconnecting?
– Are the traffic levels restored?Are the traffic levels restored?
– Is the CPU utilization normal after some time?Is the CPU utilization normal after some time?
What can I do once it crashes?What can I do once it crashes? Assess the impact of the crashAssess the impact of the crash
– A brief checklistA brief checklist– What crashed?What crashed?– Is the SRP/Line module stable after the crash?Is the SRP/Line module stable after the crash?– Did any customer applications suffer an impact?Did any customer applications suffer an impact?– How many subscribers were impacted? For how long?How many subscribers were impacted? For how long?– For an SRP crashFor an SRP crash
– Was it the primary or standby SRP?Was it the primary or standby SRP?– Was high availability enabled?Was high availability enabled?– Did the standby SRP take over?Did the standby SRP take over?
– For a Line module crashFor a Line module crash– Was the line module in a redundancy group?Was the line module in a redundancy group?– If yes, did the redundant line module take over?If yes, did the redundant line module take over?– Was it subscriber-facing or core-facingWas it subscriber-facing or core-facing
What can I do once it crashes?What can I do once it crashes? Research the cause of the crashResearch the cause of the crash
- Ask yourself (or your customer Ask yourself (or your customer ):):- Any recent changes to configuration?Any recent changes to configuration?- Any recent changes to the load on the router?Any recent changes to the load on the router?- Any recent changes in the network?Any recent changes in the network?
– Search the knowledge baseSearch the knowledge base– All defects found at a customer site have a knowledge base article All defects found at a customer site have a knowledge base article
associated with them.associated with them.– Use the knowledge base effectivelyUse the knowledge base effectively– If you find a match, always double-check with JTACIf you find a match, always double-check with JTAC
– Contact JTACContact JTAC– It is recommended to contact JTAC whenever there is a crash on the It is recommended to contact JTAC whenever there is a crash on the
router router
Tips on searching the KBTips on searching the KB Some tips on searching the knowledge base:Some tips on searching the knowledge base:
- The crash record in the reboot.hty is a good pointer to startThe crash record in the reboot.hty is a good pointer to start- Look for information in the stack trace which seems uniqueLook for information in the stack trace which seems unique
- Filenames, Line numbers, Reset typeFilenames, Line numbers, Reset type
- Search for these in the knowledge baseSearch for these in the knowledge base- Remember:Remember:
- Some crashes have very generic crash records (eg: Detector crashes)Some crashes have very generic crash records (eg: Detector crashes)- In such cases, a match in the KB does NOT necessarily mean you are hitting In such cases, a match in the KB does NOT necessarily mean you are hitting
the same problemthe same problem
- Some crash records may match closely but not exactlySome crash records may match closely but not exactly- In some cases this may be the same problem showing up in a different formIn some cases this may be the same problem showing up in a different form- In some cases, it may be an entirely different issueIn some cases, it may be an entirely different issue
- Read the problem description and solution fields carefullyRead the problem description and solution fields carefully- Some times these are good pointers to confirm if you are hitting the same Some times these are good pointers to confirm if you are hitting the same
issueissue
- When in doubt, consult JTAC When in doubt, consult JTAC
http://www.juniper.net/kb
Agenda
Why does an E-series router crash?Why does an E-series router crash?
What happens after a crash?What happens after a crash?
What can I do once it crashes?What can I do once it crashes?
What information does JTAC need?What information does JTAC need?
What information does JTAC What information does JTAC need?need?
Files to be collectedFiles to be collected– Collect the following files from the routerCollect the following files from the router
– reboot.htyreboot.hty– It is usually a good idea to collect the file from both primary and It is usually a good idea to collect the file from both primary and
standby SRPsstandby SRPs
– Core dumps (if any)Core dumps (if any)
– Files with extension “.tsa” (if any)Files with extension “.tsa” (if any)
– system.log file system.log file
– Copy of the router configuration in CNF and SCR formatCopy of the router configuration in CNF and SCR format
What information does JTAC What information does JTAC need?need?
OutputsOutputs- Collect the following outputs:Collect the following outputs:
sh versionsh version
sh hardwaresh hardware
sh env allsh env all
dirdir
sh log data nv-filesh log data nv-file
sh log data severity debugsh log data severity debug
sh redundancysh redundancy
- Depending upon the problem, there may be other outputs Depending upon the problem, there may be other outputs that JTAC may require.that JTAC may require.
What information does JTAC What information does JTAC need?need?
Other informationOther information– The following information is very useful to JTAC The following information is very useful to JTAC
when troubleshooting crash cases:when troubleshooting crash cases:- Services deployed on the routerServices deployed on the router- Logical diagram of the networkLogical diagram of the network- Information about devices connected to the routerInformation about devices connected to the router- Number of subscribers connected to the router.Number of subscribers connected to the router.- The amount of traffic (in Mbps) on the router.The amount of traffic (in Mbps) on the router.- Information about any changes to the router Information about any changes to the router
configuration or deployment scenarioconfiguration or deployment scenario- Information about any changes in the external networkInformation about any changes in the external network
What information does JTAC What information does JTAC need?need?
When all else failsWhen all else fails– Some crashes are illusiveSome crashes are illusive– Crashes are seen on customer routers however the core Crashes are seen on customer routers however the core
dumps do not provide us enough information dumps do not provide us enough information – And, the crashes are not reproducible in JTAC labs worldwideAnd, the crashes are not reproducible in JTAC labs worldwide– In such cases there may be a need to collect additional data In such cases there may be a need to collect additional data
from the customer’s routerfrom the customer’s router– Some of the techniques used in the past:Some of the techniques used in the past:
• Installing a debug image on the customer routerInstalling a debug image on the customer router• Enabling memory debuggingEnabling memory debugging• Enabling assertionsEnabling assertions
– This is required in special cases only and JTAC will provide all This is required in special cases only and JTAC will provide all necessary information in such cases.necessary information in such cases.
SummarySummary Crashes can be a good thingCrashes can be a good thing
Assess the severity of the crash and its Assess the severity of the crash and its impact in the networkimpact in the network
Work closely with JTAC to analyze the root Work closely with JTAC to analyze the root causecause
A good understanding of the E-series A good understanding of the E-series behavior helps build customer behavior helps build customer confidenceconfidence
Questions…. ???Questions…. ???
Copyright © 2006 Juniper Networks, Inc. Proprietary and Confidential www.juniper.net 39