26
Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer [email protected] I will assume basic SQL Server knowledge

Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer [email protected] I will assume basic SQL Server

Embed Size (px)

Citation preview

Page 1: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

Windows Azure SQL Database (WASD) Troubleshooting

Bob WardPrincipal Architect Escalation Engineer

[email protected]

I will assume basic SQL Server knowledge

Page 2: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

2

My Goals for You Today

Prepare

React

Prevent

Page 3: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

3

What Will We Cover Today

The Azure Troubleshooting Challenge

Troubleshooting Connectivity

WASD Errors

Query Performance

Practical Advice and Tips

Page 4: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

4

The Azure Troubleshooting Challenge

WASD is a platform service (PAAS) This is not a VM running SQL Server “box” (IAAS)

Multi-tenant platform You are sharing a SQL instance with other databases from

other customers You are abstracted from the SQL Server instance,

Windows, and computer server Less admin tasks means lower TCO but also means less access

You are isolated to a specific database You have a logical server and a master but most things are

done in your database Most things are database scoped (Ex. DMVs)

We make decisions to maximize all database availability

Application design may be required The service can be updated far quicker than the

“box” product

Page 5: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

5

WASD Connectivity Errors

WASD specific errors

Firewall blocked in Azure

Windows authentication not

supported

Invalid login – Invalid account or password

Denial of Service – After a large number of

login failures

Network related errors

“…Server not found”

Connection Timeout Expired

Msg 121 “.. The semaphore timeout period has expired”

You could lose

connectivity

Idle connections terminated after 30 minutes (Msg 10053

and 10054)

We may forcibly disconnect on

failover/some errors or change to MAXSIZE

Retries you need to take into account

Use min 30sec login

timeout

Page 6: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

6

Example Connectivity Errors

40XXX errors unique to WASD

Be sure to give this to support

May see this after deleting a server

Network latency

After getting dropped on idle connection

Page 7: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

7

Troubleshooting Connectivity

Configuration issues• WASD Firewall and your firewall• Allow Windows Azure Service

Is it our service or your internet?

• Windows Azure Management Portal• Windows Azure Service Dashboard• Windows Azure SQL Database Connectivity Troubleshooting Guide

General Tools to use• ping.exe, telnet.exe, tracert.exe• SQL Server 2012 Management Studio – Free with SQL Server 2012 Express• ostress.exe and sqlcmd.exe (username requires @<full server name>)• SQL Database Management Portal –

https://<servername>.database.windows.net

New System Views (Event Tables) – in master database• sys.event_log• sys.database_connection_stats

History tables – not real time

Page 8: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

Demo

Tools for Connectivity

Page 9: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

9

WASD Errors

Failover

Governance and Quota

Throttling Limits

Engine Throttling

“Not supported”

Database copy

Federation

These can result in connection termination

and possible future rejection of work

Many “box” errors still apply – Ex.

1205 = deadlock

Msg 40XXX range can be

seen in sys.messages in SQL Server

2012

full list here

Page 10: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

10

Failover

• Your database, the instance, or the computer is “unhealthy”• We may need to patch the instance and/or computer

We may decide to “move you” to a replica of your database to another server

• Msg 40197• “..Server not available”

What will you see?

Implement retry logic in your application

The partition is in transition and transactions are being terminated.

SHUTDOWN is in progress.

Page 11: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

11

Governance Max number of concurrent worker threads (currently

180) per database Msg 10928 if you exceed the limit

Connection terminated. Retry when your concurrent work subsides Check for blocking problems or inefficient queries

Msg 10929 if the overall system has too many workers

You may get less than 180 max Connection terminated. You can retry but it may take longer to

stabilize Still could be an application issue but a service issue could also be

occurring

Resource ID : 1 = worker threads

Page 12: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

12

Quotas Quota errors for space used

Msg 40544 when you run out of space for your max size for your db

Only reads and DELETE/DROP allowed until you free up space Use sys.dm_db_partition_stats to find what is consuming

space Solutions

Increase max size Delete data or drop tables/indexes Partition out database

But…freeing up may not be immediately recognized

Changing MAXSIZE disconnects all users

Page 13: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

13

Throttling Limits

• We have a service called a “Watchdog Service” querying the instance for “conditions” to terminate connections to prevent resource problems.

We also call these “Watchdogs alerts”

• We will kill the session with a “reason”. The “reason” is the error message you get• Application gets an error message (high severity) and connection terminated

(KILL/ROLLBACK status)• Sometimes retry works but these usually require some change on your part• throttling_long_transaction in sys.event_log

We monitor all databases and look for conditions to prevent problems

Error Condition40549 Session blocking system task for long period of time (20 secs)

40550 Session is consuming too many locks (1 million)

40551 Session is consuming too much tempdb space (5Gb)

40552 Transaction consuming too much log space or active transaction preventing log truncation

40553 Session consuming memory (16Mb) and there are memory waits (20secs)

Rebuild index Online

Page 14: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

14

Engine Throttling This is more of a legacy monitoring method used to keep instances

healthy Another external service monitors the health of the instance and

computer Soft throttling – we have detected a resource issue so pick specific

databases Hard throttling – entire instance at risk so all databases are affected

How it Works Existing requests run to completion New requests for existing connections and new connections may get

Msg 40501 and connection terminated depending on type of request Reason code in Error has more details on soft vs hard, what will be

rejected, and why throttling in sys.event_log

Decode reason codesAnother resource

0x8003

x03 = RejectAllx80 = Hard Throttling on I/O

Page 15: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

15

“Not Supported” Errors USE <db> not supported – specify when connecting ALTER DATABASE supported minimally (Ex. Name, Edition,

MAXSIZE, READ_ONLY) All DBCC commands not supported except for DBCC

SHOW_STATISTICS Database scoped DMVs supported Feature Support for Windows Azure SQL Database Unsupported Transact-SQL Statements (Windows Azure SQL Datab

ase)

Partially Supported Transact-SQL Statements (Windows Azure SQL Database)

Page 16: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

Demo

Using Event Tables to Troubleshoot WASD Errors

Page 17: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

17

WASD and Query Performance

Stick to the basics…..

• Running or waiting? Blocking or CPU?• Is it your application, Windows Azure role, your computer, or

queries?• Is it network latency?• Differences from when “good”? Did the query plan change?• Proper indexes – Avoid scans, large sorts, ….• Auto create and Auto update stats on by default

There are methods to optimize performance specific to Azure

• Windows Azure SQL Database and SQL Server -- Performance and Scalability Compared and Contrasted

• Inevitably you may have to shard your data• “Chatty” applications don’t usually perform well• Avoid large result sets• Application problems may show up earlier on this platform (Ex.

Transaction keeping the log from being truncated)

Page 18: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

18

WASD Performance Scenarios

Interesting Performance

ScenariosOn-premise clients may

see higher ASYNC_NETWORK_IO

waits

Small transactions may result in WRITELOG and

SE_REPL* waits

Deadlocks (Msg 1205) just like the “box” – Use sys.event_log to debug

Troubleshooting Query Timeouts

Could just be blocking

Trace your queries so you know which one timed

out

Examine query plan and tune the query/indexes

Page 19: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

19

Dynamic Management Views (DMV) for Performance

• Find out currently running requests in your database. Use this to detect blocking

sys.dm_exec_requests

• Find out the performance of queries that have run in your database. Look here for worst performing queries

sys.dm_exec_query_stats

• Display the query plan of a specific querysys.dm_exec_query_plan

• Aggregation history of waits – Some new for WASD

• Only shows any wait_type with count > 0

sys.dm_db_wait_stats

• Could indexes help query performance?“missing index DMVs”

Page 20: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

20

A look at WASD Wait Types

Page 21: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

Demo

Troubleshooting Query Performance on WASD

Page 22: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

22

Watch Out for These

Keep database copies for “user error”

Be careful dropping servers and databases in portal

DML may fail if no clustered index (temp tables excluded)

DMVs are database scoped

Databases have RCSI on by default – tables can be larger

DATETIME in all data centers is stored as UTC time

You may not have access to objects that appear in catalog views

Non-supported or partial supported commands/featuresSystem Views Unique to WASD

Page 23: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

23

Before you contact support

Check the Azure forums: MSDN or stackoverflow

Check the service dashboard

Is it Windows Azure? On-premise problem?

Have exact error message(s) available

Have TracingID available

Do you know the query?

Do you have application retry logic?

Give us the date and time of issue with “observed” timezone

Is this happening now or in the past?We can do RCA but….

It can take some time and we may not have enough history

Page 24: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

24

References

Retry Logic for Transient Failures in Windows Azure SQL Database

Error Messages (Windows Azure SQL Database) Windows Azure SQL Database Performance and Elast

icity Guide

Windows Azure SQL Database Connection Management

sys.event_log documentation CSS SQL Escalation Blog Troubleshoot and Optimize Queries with Windows Az

ure SQL Database

Page 25: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

Questions?

Thank you!

http://sdrv.ms/Zqdkex

Page 26: Windows Azure SQL Database (WASD) Troubleshooting Bob Ward Principal Architect Escalation Engineer bobward@microsoft.com I will assume basic SQL Server

26

The Troubleshooting Checklist

Does the Windows Azure Portal work and list your databases?

Is there a dashboard posting for an outage in your region?

Does the SQL Management Portal work? Does SQL Server Management Studio work? Is there an internet provider issue? Is your firewall configuration correct? Is the problem Windows Azure vs WASD? Is there blocking? Are your queries and index tuned? Is this really an application retry issue? Governance, quotas, limits, and throttling are “part

of this platform” Have you looked at Event Tables?