Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Observability
2.Taking tracing for a ride Jaeger provides an example app called HotRod
Illustrates some std instrumentation plus custom instrumentation patterns
Implemented in go as is Jaeger Backend depends upon masterminds/glide for the build Node dependency for compiling the front end For production needs Cassandra or elasticsearch for persistence All in one option available which provides persistence etc all in a
single process, ideal for dev etc Prebuilt binaries are available on github Prebuilt Docker Images available as well Jaeger supports the illustration of Dependencies using a Force Directed
Graph Jaeger also provides Directed Acyclic Graph (DAG) - should be an easier
read
Jaeger provides the means to search the traces and a means to view the nested calls
Difference between a span tag and a span log Log time stamps will be within the period of the span Both provide annotation Tags apply to the entire span Logs represent specific events within the span Jaeger indexes both for search
Baggage is intended as a general key value store associated with the context
Useful to help with tenancy linkages to the information Can associate attribute information that may help understand why
some executions are quick or not This means we can calculate and attribute compute effort to
information in the baggage such as the tennant ID Classic Jaeger use cases
Distributed transaction monitoring Performance and latency optimisation Root cause analysis Service dependency analysis Distributed context propagation
Distributed Tracing Fundamentals
Request correlation Anatomy of distributed tracing
Special tracepoints can be at the edge of microservices Inject trace point Extract trace point These special points are also handle meta data movement
across processes These points capture and send the data to a backend
Sampling Preserving causality Trace Models
Event model Trace points are recorded as events Assuming the happens before information is captured, then
a Directed Acyclic Graph can be constructed Span model
Shared spans Multi-server spans Concept of parent and child spans Single host with client spans Original span model used by Dapper
Clock skew adjustment
Even with NTP keeping server times in lock step tighter than 1 millisecond is impossible
When spans reside within the same server, it’s fair to assume they are accurate in relationship to each other.
Can compensate based on knowing when sync calls occur client can’t end before server call
By analysing multiple calls across the same servers the timing differential can help determine likely skew value
Trace analysis
Instrumentation basics with OpenTracing Primary entities ...
Tracer Singleton for creating spans
Expose methods of transferring context across processes and components
Span Interface for generating a trace point Span represents a unit of work within a solution Casual links to predecessors Startspan() Finish() Spans can be annotated Span provides access to baggage
OpenTracing is just the API therefore we need to use a concrete implementation
For recording information with a span you have tags and logs Tags - key value pairs Logs much like conventional logs and can be used to record
events, particularly where we don’t want to create a span Only the act of creating the tracer are vendor specific Applications should only need a single tracer Service meshes often provide the mechanisms for this Tracer can create a global instance that can be addressed Some solutions will provide a dependency injection means to
address Jaeger specifics ... Example 2 (span and nested span)
Java Tracer config ...
Each created span is given an operation name in open tracing
The operation name is used for correlation and analysis
Create a span.
Should always close the span in a finally block
Use the span to record relevant information just as you would with logging
Span being annotated ...
Tracing an individual function as a child of the parent span
This approach does have the issue of sharing the span
In-process context propergation Example 3 (scopes)
With a scope a nest span would look like
Working with a scope manager, we would create a span and tell the scope manager
Example 4 - RPC Each service instantiates it’s own tracer with unique
naming Need to change the server port through the
configuration
Java can simplify the process by using the base class TracedController ,,,
Incorporating tag management
Example 5 using baggage Retrieving baggage
Example 6 - autoinstrumentation Span references can either be ... Scopes are handled by .. Tracing solutions may not provide all the capabilities provided by things
like an ELK stack Recommend every span has key value pair of key = “event” that describes
the span log In process context propagation is difficult to solve and different languages
can solve it in different ways Crossing processes means we need to introduce operations to pass the
context Inject Extract
The means to pass context have a number of challenges ... It is customary to start new span’s for http calls Open tracing recommended tags
span.kind - the role of the service in an RPC request typically values are
Client Server Producer - when messaging systems are involved Consumer - when messaging systems are involved
Http.url - record the URL requested by the client or served by the server
Http.method - get, post etc Typically these are populated through the get method in the
tracedcontroller Baggage
term was originally coined by Prof. Rodrigo Fonseca, one of the authors of the X-Trace system
The Jaeger instrumentation libraries recognize a special HTTP header that can look like this: jaeger-baggage: k1=v1, k2=v2, .... It is useful formanually providing some baggage items for testing
Instrumentation can be simplified through auto instrumentation in a vendor neutral manner
Http://github.com/opentracing-contrib/meta replaced by https://opentracing.io/registry/
Spring provides simple instrumentation by just adding a jar This means no coding needed except response tags,
baggage etc Span names are generic Tracerresolver extension creates and tells open tracing
about the global span. However some consider this an atipattern
Spring provides a instrumentation capability if the appropriate bean is included
Tracer resolver can instantiate the Open tracing implementation https://github.com/opentracing-contrib/java-tracerresolver
Kafka has open tracing support through Spring.although It utilizes JSON serialisation rather than AVRO
Instrumentation of Asynchronous Applications
Currently Jaeger can’t show the type of tracing going on e.g. message, http etc
Consumer of the span (receiver) is always a folllows on span as there could always be multiple receivers but the consumer will not know this
Having spans that run from the moment the producer creates the event to consumer consuming it is at odds with opentracing principles
Each span should only be associated with a single process, so starting the consumer span on the event generation would be at odds
You would lose the ability to model the time impact of events waiting to be consumed
How would multiple consumers get represented? Ability to support async e.g.
Node.js Java
Futures Executors
Tracing Standards and Ecoststem
The manual instrumentation approaches aren’t practical at scale Most instrumentation trace points are next to process boundaries
These boundaries are often handled through frameworks Therefore focus on instrumenting around the frameworks
Agent based Zero touch approach Uses an approach sometimes called monkey-patching Dynamically modifies the code wrapping actions that would
require spans etc Java can do this with the command line -javaagent which then
loads a library the works with the instrument feature
Monkey patching approaches can be difficult to maintain Some frameworks provide extensibility support Agent model providers include ...
Datadog Elastic Appdynamics New relic Apache skywalking
Agent models are often linked to a specific backend github.com/opentracing-contrib/java- specialagent/
Requirements of an instrumentation api Other frameworks
AWS X-Ray Google StackDriver When a solution is distributed, or uses PaaS elements you may
experience the issue of not getting cohesive solution
There have been attempts to define and industry wide standard tracing format for wirelevel communication but non yet truest exist
Zipkin (Twitter) It’s naming using b3 has become defacto standard B3 comes from the naming convention of systems named
after birds Big Brother Bird (aka b3) Tracing can often be used to refer to one or more different dimensions
Ben Siegelman suggested these could be Analyzing Recording Transaction description Federating
Could also be presented as
Tracing and its view points
This all points to knowing who is involved in the discussion Standards work
Product notes Dapper - Google Zipkin origins at Twitter Jaeger came from Uber TChannel - RPC framework - Uber
Under the hood
Host your own Customise and integrate Bandwidth costs Own the data
Emerging standards Use open tracing to abstract so only need to instrument once B3 header option common ...
Open census W3C trace context format
Architecture and deployment modes Basic model
Streamlined model
Components Client
Client - library embedded aggregating calls and passing batches on
Client typically allows the feedback/control flow to allow tracer config changes
Client commonly uses UDP so don’t need IP of collector
Agent Jaeger implements the sidecar pattern Supports communication to collectors Includes supporting load balancing and discovery Agents allow client logic to be kept simple Agents can be deployed as either
Agent on bare metal Kubernetes daemon set Side car to businesss app e.g. in same pod
Collector Receives span data as
JSON Thrift Protobuf
Using Http Tchannel gRPC
Converts data to a normalised internal data model Sends data to configured/pluggable data store Provides adaptive sampling logic Memory queueing to smooth out load spikes
Query service and UI Search and retrieve traces used by
Jaeger UI Or another solution conversant with API
Data mining Post processing such as Spark applied
Use tags so we can attribute spans, to processes, therefore charge based on backend usage
Implementing in a large organization
Why is it hard? Reducing barriers to adoption
Standard frameworks In house adaptors and tooling
Jumpstart / accelerators Preconfigured setups etc
Trace by default Monolithic repos
Single repos Easier to manage, locate source code Easier to implement code analytics to support
implementation Mono repo increases chances of common framework
adoption Integration with existing infrastructure
Where to start Many m/s solutions are broad rather than deep, so instrumenting
the gateway and 1st level or two can yield a lot of insight - 80/20 Incremental tracing rollout can accelerate ROI, shorten problem
investigation Successes wil drive peer pressure to adopt
Creating culture Communicate value Incorporation into developer flows
Trace quality measurement As a part of a wider code quality analytics set Needs to be more than binary -applied or not, but account for
correct application etc Dimensions..
Comoleteness Has spans Has client spans Minimum client version check - which Jaeger version
being used Quality
Meaningful endpoint name Unique id
Other Provide implementation and troubleshooting guide
Insights via data mining
Integration with Metrics and Logs
Integration with metrics Standard metrics via tracing instrumentation
Adding context to metrics Context aware metrics APIs
Integration with logs Se,I structured logs e.g. log4j vs highly structured logs eg JSON The better the structure the more efficient the indexing can be Slf4j doesn’t support strong structured but when combined with a
structured formatted for Logstash more sutrctire can be applied Resources/Logstash-spring.xml
Correlating logs with trace context Scope manager is pluggable so can be extended using a decorator pattern
Distributed Context Propagation
Turning the lights on
8.Sampling Trade off with logging on performance and cost of generating info
Consider tracing backend capacity Tracing can easily generate more data than the business process
Sampling as a means to cut down tracing info being processed is cut down at source
Dapper without sampling created a 1.5% throughout and 16% latency in the workload. Reducing workload via sampling at 0.01% reduced figures to 0.06% and 0.20% respectively
Head based sampling Decide once per trace at the trace start Is an all or nothing model Heavily used in production
Rate limit based sampling Use leaky bucket algorithm aka reservoir sampling Good when work loads are erratic
Adaptive sampling Can overcome load surge for the backend by using Kafka for the
events, so consumed more steadily Sampling considerations Jaeger provides the option to shed traffic when the DB is overloaded
Tracing with Service Meshes
Rather than using an ESB as a hub microservices leverage the side car pattern to abstract the central services
Side car implemented as a light weight process or container in its own right
Sidecar benefits Can be implemented in its own language Collocates with the application meaning limited latency Each service instance has its own side car so any failing side
car does not disrupt the entire service Side car can be used to compensate for features missing
from the core service Sidecar lifecycle and identity aligned to the service
Made up of 2 key components Side cars can emit uniformly names metrics about traffic in/out, latency
and error rates etc RED Rate, Error, Duration Rate Error Duration
Envoy can handle network traffic not only for gRPC and HTTP but also MySQL, Redis and others asa result concise and rich trace data can be generated
Spring boot-open tracing tracing needs jaeger dependencies Sidecar recognises Tracing and can action new spans
However result is a lot more spans Jaeger configuration passed as values from the Docker file using env vars Envoy doesn’t understand Jaeger’s default wire representation, but it
doesn’t understand Zipkin aka b3 Jaeger port forwarding is needed in a Docker environment Istio Tracing without the microservice using spring sleuth will result in
the outbound call not being auto instrumented with the context, result new span generated
Linked and envoy require app to propergate Context propagation is the most challenging consideration White box tracing implementation recommended, because ...
Ore control over data collection Ability to tag to the span key event values Application logic does not need to know Understanding which headers relate to tracing for
propagation can be complex, white box hides this Istio can create servicegraph without needing tracing
Graph visualisation provided by Istio ... Forced Directed Graph ... istio/force/force graph.html Graphviz /dotvis
Why distributed tracing
Microservices and cloud native apps Characteristics of microservices/cloud native solutions
Componentization via (micro)services Smart endpoints and dumb pipes Organized around business capabilities Decentralized governance Decentralized data management Infrastructure automation
Design for failure Evolutionary design
2015, the Cloud Native ComputingFoundation (CNCF) was created as a vendor-neutral home for many emerging open source projects
Cncf charter: Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably withminimal toil.
Monitoring tools under CNCF Prometheus Fluentd Open tracing Jaeger
What is observability? in control theory states that the system is observable if the internal
states of the system and, accordingly, its behavior, can be determined by only looking at its inputs and outputs
However not practical in software engineering terms YouTube https://youtu.be/U4E0QxzswQc
https://youtu.be/U4E0QxzswQc Sometimes linked more widely with the idea of monitoring,
metrics, logs and traces Oxford dictionaries of the verb “monitor” is “to observe and
check the progress or quality of (something) over a period of time; keep under systematic review.”
3 pillars of observability Metrics Logs Traces
Observability challenge of microservices Whilst microservices yield benefits they also have some challenges Vijay Gill, Senior VP of Engineering at Databricks, goes as far as
saying that the only good reason to adopt microservices is to be able to scale your engineering organization and to “ship the org chart”
Not a popular / common view 2018 “Global Microservices Trends” study [6] by Dimensional
Research® found that over 91% of interviewed professionals are using or have plans to use microservices in their system
2018 “Global Microservices Trends” study [6 - 73% find “troubleshooting is harder” in a microservices environment
Challenges Orchestration of Container deployment Ability for microservices to locate each other Reliability can actually drop with more components
involved e.g. multiple components at 99.9% avail doesn’t total 99.9%
Risk of latency rise as each ms takes tile invoking the next - need to consider max time not min
Questions that we need to solve What services did a call go through What did each service involved do? Where did the error happen? How have things differed from normal?
New services in the mix? Or services removed? What was performance like?
What is the critical path for the request? Who should be called?
Traditional monitoring tools Traditional tools have limitations in the microservice space Metrics are helpful as they are concise/ numerical truths. But they
can be aggregated removing the nuances Logs only show us a single instance of a stream There are multiple forms of concurrency to deal with...
Ben Sigelman - Kubecon 2016
Concurrency where threads pickup and put down sessions means events can start on 1 thread and complete on another
Using time stamps to sequence across servers and logs are susceptible to clock skew
Distributed tracing Bryan Cantrill. Visualizing Distributed Systems with Statemaps.
Observability Practitioners Summit at KubeCon/CloudNativeCon NA 2018, December 10: https://sched.co/HfG2.https://sched.co/HfG2
Ben Sigelman. Keynote: OpenTracing and Containers: Depth, Breadth, and the Future of Tracing. KubeCon/CloudNativeCon North America, 2016,
Seattle: https://sched.co/8fRU.https://sched.co/8fRU
Benjamin H. Sigelman, Luiz A. Barroso, Michael Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed system tracing infrastructure. Technical Report dapper-2010-1, Google, April 2010.