Friday, June 01, 2007

Exposing Windows 2003 SP2 Bugs..1 Blog at a time.

Some people might see this as a “system” or “resource” problem, but can’t figure out what resource we’re hitting?

The symptom is simple and often overlooked but it’s likely a cause of a Bug in Windows 2003 SP2 and was also inside Windows XP SP2 (which was already addressed).

We have a DFS Server that runs Windows 2003 and recently upgraded to Service Pack 2.

After you mount the DFS Share on your server and try to browse the share, you get this horrible lag (or it goes unresponsive) on Windows Explorer when browsing and/or mounting the DFS share. Sometimes this isn’t so surprising with Windows Explorer, but theirs usually an associated memory pressure or a CPU Spike that would follow along with it (which in this case, their wasn’t).

If you never mount DFS shares from your servers then you might be okay but if you do then your servers will become extremely slow to where you will see a major lag in response time. This particular problem is kind of hard to troubleshoot because the CPU usage does not increase, so simply running perfmon or looking at Task Manager in this instance did not help. I used “ProcessMonitor” to help find this bug, it was great at showing me the explorer process looping through the DFS servers like crazy.

Microsoft PSS is aware of this and there are private fixes available. At the time of this post, their is no public fix is for Win2k3 SP2 yet. The workaround is the same for XP SP2 Issue listed below:

To disable the Previous Versions tab
To disable the Previous Versions tab on individual client computers, follow these steps:

1. Click Start, click Run, type regedit in the Open box, and then click OK.

2. Locate and then click either of the following registry subkeys:
HKEY_LOCAL_MACHINE\Software\Microsoft\Windows\CurrentVersion\ExplorerHKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Explorer

3. On the Edit menu, point to New, and then click DWORD Value.

4. Type NoPreviousVersionsPage, and then press ENTER.

5. Right-click NoPreviousVersionsPage, and then click Modify.

6. In the Value data box, type 1, and then click OK.

7. Quit Registry Editor.

Ref: Update to disable Volume Shadow Copy Service restore options on Terminal Server client computers that are running Windows Server 2003 or Windows XP
http://support.microsoft.com/kb/888603

Wednesday, January 31, 2007

Add Another Step To your Leak Analysis

We all seen applications leak, and our next move is to just pass it to the Vendor. Sometimes that’s easier said then done, so it’s good to know a few other counters to check out what could be the Problem.

My Specific situation dealt with a .NET application

I’m not sure how often this happens anymore but this was a specific problem to Garbage Collection.

So the Infrastructure was a 3 Teir system. Web Teir, App Teir and a Database Backend. And for some odd reason we would have failures in the App teir due to application hangs.
You restart the Service, an BOOM! The app is back up.

Pretty clear cut? Not when you have a USELESS vendor that must have a whole bunch of rookie outsourced engineers that are fresh dropouts.

As patches continued to get applied to our failing app, we needed a way to monitor what exactly what was failing. I ran a Perfmon and I see the App process consume Virtual Bytes at a steady incline. We then took a dump of the App and Ran it through DebugDiag and this is what we see:

Analysis Summary

Warning

mscorwks.dll (a known Windows memory manager) is responsible for 1.29 GBytes worth of outstanding allocations.

Virtual Memory Summary

Size of largest free VM block 23.81 Mbytes
Free memory fragmentation 95.93%
Free Memory 585.66 MBytes (28.60% of Total Memory)
Reserved Memory 395.63 MBytes (19.32% of Total Memory)
Committed Memory 1.04 GBytes (52.08% of Total Memory)
Total Memory 2.00 GBytes


Function details

Function
mscorwks!virtual_alloc+5b
Allocation type Virtual memory allocation(s)
Allocation Count 78 allocation(s)
Allocation Size 1.25 GBytes
Leak Probability 69%

So what we see is that MSCORWKS is doing some crazy VirtualAllocs and its conking out the app cause its hitting a threshold of 1.29GB.

So what can we do to Monitor it even More? If our VirtualAllocs are never coming back down, then that could be that we have a problem with garbage collection (GC).

We can track when the App consumes and release by looking at the Process Virtual Bytes and # Gen 0 Collections inside perfmon. This is where objects use to allocate using Virtual Address Space.

We start garbage collecting when 1 of 3 tasks are triggered:

1) Allocation exceeds the Gen0 threshold;
2) System.GC.Collect is called;
3) System is in low memory situation;

The Counters to watch that are the following:

Memory\Available Bytes

Process\Private Bytes
Process\Virtual Bytes

.NET CLR Memory\% Time in GC

This indicates the percentage of time spent in the GC since the end of the last collection. If you see a really high value (say 50 percent or more), then you should look at what is going on inside the managed heap. If, instead, % Time in GC does not go above 10 percent, it usually is not worth spending time trying to reduce the time the GC spends in collections because the benefits will be marginal.

.NET CLR Memory\# Gen 0 Collections
.NET CLR Memory\# Gen 1 Collections
.NET CLR Memory\# Gen 2 Collections

A high allocation rate can cause a lot of collection work and as a result might be reflected in a high % Time in GC. A mitigating factor is if objects usually die young, since they will typically be collected during a Gen 0 collection. To determine how object lifecycles impact collection, check the performance counters for Gen collections: # Gen 0 Collections, # Gen 1 Collections, and # Gen 2 Collections. These show the number of collections that have occurred for the respective generation since the process started. Gen 0 and Gen 1 collections are usually cheap, so they don't have a big impact on the application's performance. Gen 2 collectors however, can be very expensive.

In the end we found that in our situation, the objects we’re being held on longer then they should be so that’s why the GC never cleaned them out. Always good practice to let the app hang for a while to see if the objects eventually get cleaned up.

Ref: http://msdn.microsoft.com/msdnmag/issues/06/11/CLRInsideOut/default.aspx#S2

Wednesday, September 20, 2006

LSASS Leaking? No it’s just by design…

So I had a struggling case and I thought I should blog about it since I couldn’t find shit (nor could the USELESS Microsoft Engineer) give me a quick answer or solution.

So here’s the story….

We have Windows 2003 Enterprise Server with SP1 in Asia which is a DC (DC is the only role on this box). Since the upgrade to SP1 (or so claimed) the LSASS process consumes memory at a dangerous amount and causes alarms and alerts to be fired off to our “Command Center” prompting the server to be rebooting to avoid a hang. This has been going on for 6 months and they asked me to take a look. I took a Perfmon and some dumps of lsass.exe and here’s what I found.

So we see that our Virtual and Private bytes for LSASS.exe is hitting our Virtual Memory threshold (1.8GB).



Object: Process

lsass
Private Bytes 1,755,975,680
Virtual Bytes 1,866,379,264


Inside the userdump, we see that just under 60% of Virtual Bytes are being committed:

Virtual Memory Summary

Size of largest free VM block 167.14 MBytes
Free memory fragmentation 80.14%
Free Memory 841.67 MBytes (41.10% of Total Memory)
Reserved Memory 0 Bytes (0.00% of Total Memory)
Committed Memory 1.18 GBytes (58.90% of Total Memory)
Total Memory 2.00 GBytes
Largest free block at 0x00000000`4e81c000


But the Memory doesn’t really go anywhere (to what I can see). Sometimes this is a indicator of LSASS fragmentation so we implemented /3GB on the box but made no difference.

So what do I know so far? I know that LSASS itself works similar to store.exe and sqlservr.exe. The intentions of these DB applications are to continue to store as much in cache for performance aspects (of course Store and SQLSERVR are tunable).

So why did I originally think this is a leak and not something by design?

Well the way I understand how LSASS works (before the case was solved) is similar to how Store.exe and sqlservr.exe works. Hide everything in the Working set right away an then release back to the O/S.

Now what I'm seeing in this Server is that LSASS grows every 12 hours (Stair Step). I never, ever see a drop in memory even after a course of 3 days (you would figure O/S would drop in memory when backup job hits or something along those lines).

In this case I see Virtual Bytes and Private Bytes continuously grow.

So I understand that LSASS behavior is to consume memory and give back to the O/S, but I thought it would work as other DB Process’s work and Consume all the memory right away instead of accumulating it over time (simulating a leak).

So I opened a case with Microsoft, and that when the Horror began…..

Given I’ve worked in PSS and worked with a lot of great engineers at MS, this is in no way of shooting down PSS as a whole, but you always have those “select” engineers that really don’t cut it when it comes to support (well atleast in my books).

So given this is a Performance Issue with LSASS this will have to goto the Directory Services (DS) team. Problem is, our willing DS engineers aren’t exactly trained in reading performs or taking dumps (luckly I came from Perf group).

The case was Opened on July 7th, 2006 and finally got resolved (resolved meaning showing its by design) by August 21st, 2006. Pretty sad when I had the case opened with perfmons, MPS Reports and userdumps attached to the case.

So the engineer not only used the wrong troubleshooting template (troubleshoot LSASS CPU Spikes) but asked me to analyze the data! (Big NO NO!, never ask the customer to analyze the data, regardless of their background). If I was going to sit their analyze data that HE asked for, I wouldn’t’ have opened a case. Of course my TAM got a call after that.

Finally I turned to the only person I know who could tell me if its by design, an that’s my MAIN man from Pacific, Will Swanda!. Will spent a great deal of time in both Performance Team and Directory Services Team, so he has the perfect “hybrid” of knowledge to crack this open.

Here’s the answer:..

From: Will Swanda
Sent: Tuesday, August 15, 2006 3:53 PM
To: Dhillon, Billy

So I just confirmed with my contacts in CPR. LSASS will take as much memory
as it can, and it will appear like a stair-stepping memory leak. It is
supposed to free it up when needed by other apps/processes, and it should
also not bring the box to a point where the performance is degraded. If it
doesn't free up the memory, PSS should look into this, and if the
performance is degraded, PSS should look at it also.

I read the case notes and it looks like they're on the right track. I also
asked my CPR contacts to have a look at the case because they are working
another issue where they have a 12-hour stair-steps.

Will

Big thanks to Will who did the right think and just ask a Critical Problem Resolution engineer (CPR) right away (something our DS engineer could have done in the beginning in of the case since all the data was already collected).

Lo and Behold, after obtaining this information, the MS Engineer found an RFC and a KB Article explaining of the LSASS growing in Virtual Memory.

In the end there is no performance degradation, just a scary memory value consumed by LSASS.

Closing email is as followings:

PROBLEM:

LSASS process committing Virtual Bytes over time. Memory allocation appears to be occurring two per day, approximately every twelve hours.

CAUSE:

By design.

RESOLUTION:

After analysis of Dumps from domain controller, this occurrence correlates with the Windows RFC 53587, where memory is allocated to Domain Controllers upon execution of online defragmenting of Active Directory database. This memory is not released to the system until resources request this memory as no other memory is available. Following is some information pulled from the RFC. If the server should hang, Billy will grab a dump from this and send this in for further analysis, as there may be an application running to is causing additional issue that is not discoverable currently:

Memory usage by the Lsass.exe process on domain controllers that are running Windows Server 2003 or Windows 2000 Server
http://support.microsoft.com/?id=308356

LSASS memory usage on domain controllers has two major components: one fixed and one variable. The fixed component is made up of the code, the stacks, the heaps, and various fixed size data structures (for example, the schema cache). The amount of memory that Lsass uses may vary, depending on the load on the computer. As the number of running threads increases, so does the number of memory stacks. Lsass.exe usually uses 100 MB to 300 MB of memory. Lsass.exe uses the same amount of memory no matter how much RAM is installed in the computer. However, when a larger amount of RAM is installed, Lsass can use more RAM and less virtual memory. The variable component is the database buffer cache. The size of the cache can range from less than 1 MB to the size of the entire database. Because a larger cache improves performance, the database engine for AD (ESENT) attempts to keep the cache as large as possible. While the size of the cache varies with memory pressure in the computer, the maximum size of the cache is limited by both the amount of physical RAM installed in the computer and by the amount of available virtual address space (VA). AD uses only a portion of total VA space for the cache. The maximum amount of VA space that AD can use is determined by the following formula:
((totalVA - 1GB) / 2)

Note This formula only applies to Windows 2000. In Windows Server 2003, the memory model for LSASS is different and the amount of memory that is used by the cache is dynamic. Memory usage has grown as large as 2.6 GB, but this is based on the assumption that other processes in LSASS do not need the memory. This means that on an x86 machine without the /3GB switch, the cache size is limited either to 512 MB or to the amount of physical RAM, whichever is smaller. With the /3GB switch, the cache size is limited to either 1 GB or to the amount of physical RAM, whichever is smaller. Note that this means that the /3GB switch begins to help as soon as the amount of physical RAM is greater than approximately 600MB (500 MB for the cache, plus approximately 100 MB for the fixed component). On 64-bit systems, such as the IA64, cache size is effectively limited only by RAM, and Microsoft Development has test systems with over 9GB of cache in use.

Wednesday, June 21, 2006

DebugDiag the Ultimate Debugger!


The (very) few people that read my blog are probably familiar with debugging user/memory dumps. There are plenty of tools and debuggers out there that try to make debugging easy but I found that IIS Debug Diagnostics tops them all.

In my “many” adventures at CIBC, I found that debugging is quite useful in all the crazy escalated cases that come my way, In particular our mis-behaving ASP Page.

Guarded with my “Perf Skills”, we we’re able to find out what’s going on to a “point”.  But our Point was reached and that’s when I needed to engage Microsoft.

The following is my Opening case details to MS:

“We have Windows 2000 Server (SP4) running IIS5 hosting ASP web applications. This Site has numerous web app pages calling out many data calls, via one COM MTS data object to an Oracle Database hosted on a Unix Server. The 40 (or so) web applications within this one website use MSDTC to create COM Calls (using VB Runtime MSVCRT.DLL).

The Issue we’re seeing is that we reach a Maximum connection limit to our Database backend. This in turn stops our all applications from running within the Site.

Within Performance Monitor, we can see the “offending” DLLHOST spawning over 6,000 handles so we know it’s a case of a page in the website not cleaning up its unused connections. When we watch the Thread inside Process Explorer, we see that MSVCRT.DLL is making COM Calls using MSTDC, nothing much more then that.

Since the DLLHost houses many pages, we are unsure of which one is doing it. If we dump the leaking DLLHOST using Userdump (With a corresponding perfmon), is it possible that Microsoft can use the MSVCRT Symbols to tie the offending Process/Page/Code?  “

For those of you who don’t know what MSVCRT.dll is, this is the VB Runtime which is usually loaded with any application written in Visual Studio/Basic. Since our web applications are developed “in-house” its common to see this DLL. But unfortunately this doesn’t tell us which web app is causing the issue (Since MSVCRT symbols are not publicly available, it’s hard to interpret the stack since we stop at that DLL).

Moving on, we finally had a failure and dumped the offending DLLHOST. Trying to go through a debugger couldn’t really tell me much simply because #1 it’s a IIS Dump and #2 we stop again at MSVCRT.dll.

Back in the PSS Days, I heard that “DebugDiag” was supposed to be the tool to replace UMDH and LeakDiag. I was leaning to use 1 of those tools since this is a slow leak. I went ahead and did a Google search and found that DebugDiag was released publicly, so I decided to load it up.

It had a feature of analyzing Userdumps (Hangs/Crashes) so I decided to give it a shot. I loaded up the dmp file and WOW did it come back with some very Interesting Info. I was amazed on how nicely it parsed out the dump in a MHT file and the detail it gave (mostly in English).  MS Took a look at the dump and saw a 3rd party ISAPI Filter (yes yes, blame 3rd party). But when I took a look, I had a lot more different results.

Below is my Reply (Edited of course):

Hey Bob,

I was just doing some "amateur debugging" and I see some very strange things. I’m looping the local developer here since he had some suspicions on "XXXXX" webapp being the culprit.


Looking at the locks in the dump, we see LOTS of threads waiting on 76c (which is thread 19):

CritSec +2766bdc at 02766BDC
LockCount          1
RecursionCount     2
OwningThread       76c
EntryCount         0
ContentionCount    0
*** Locked


19  Id: 398.76c Suspend: 0 Teb: 7ffa0000 Unfrozen



Running the Hang inside DebugDiag, it shows some interesting Info. It shows Thread 19
is waiting on XXXX which is blocking all ASP Requests.



Thread 19 - System ID 1900

This thread is processing a client request and is/are not fully resolved. Further analysis of these threads is recommended in order to determine what may be blocking the request(s).



Function   Source
OLEAUT32!APP_DATA::FreeCachedMem+12    
OLEAUT32!SysFreeString+57    
OLEAUT32!VariantClear+72    
vbscript!DllCanUnloadNow+c062    
vbscript!DllCanUnloadNow+1f99b    



ASP request executing on thread   19


GET request for   /xxxxx/viewFails.asp
Request alive for   0 day(s) 00:53:07
QueryString   search
Request mapped to   D:\xxxxxx\VIEWFAILS.ASP
ASP Application   /LM/W3SVC/1/Root
ASP Template   D:\xxxxxxx\VIEWFAILS.ASP



Script call stack for thread 19

Function Scope   Line Of Code    Source File   Line Number
Global Scope           varSelected =  " selected "   D:\xxxxxxx\VIEWFAILS.ASP   434


Thanks,

Billy


I couldn’t get any better Info then this. DebugDiag told me the Actual failing ASP Site, along with the Line of Code and Line Number! I sent this off to the developer who was troubleshooting this issue for months now (since he inherited this code from a vendor with no documentation).

Now any crash/hang or Leak I get, I make sure DebugDiag is 1 of the many debuggers I pass the dump through (only userdumps). It not only does IIS related processes but other userdumps as well.

That’s only 1 of many great features it has. So grab a copy and check it out!

http://www.microsoft.com/windowsserver2003/iis/diagnostictools/default.mspx


Thursday, March 30, 2006

Citrix / Terminal Server + IBM Blades = Pause!

I’m not sure how many of you remember, but when MS first came out with Windows 2000 Server, their sense of session handling within Terminal Services wasn’t the “greatest”. Their have been numerous articles since then to handle the dreaded “Terminal Server Pause” The most famous being Q324446.

In my particular scenario, our pausing was a bit unique. Oddly, this was unique to Hardware rather then Windows architecture.

It all started when we brought in the wonderful IBM HS20 Blade Servers (8843). These servers we’re packed with 3 Gig’s of Ram, Dual HT 3.2Ghz Xeon Processor and Ultra320 SCSI Drives, all fitting in a tiny space. So whats the issue?

Back in December 05, when we put these Blades in our Citrix Metaframe XP environment for the very first time, we saw the dreaded TS Pause. These servers we’re Windows 2003 (No Service Pack) and pausing when anyone logs off. Typically this is the same symptoms of 324446, but that didn’t seem to fix the issue.

I took a Perfmon of the Pause and here’s what we saw:


Looking at the above screen shot we have we see our Avg. Disk Queue Length jumps up, our Disk Idle Time (% Idle Time) hits the floor and we have a “gap” within our %Processor Time (this is till a mystery to this day on why our Poc Time counter shows a gap with no other counters missing information in the log).

In a nutshell, we are saturating our Disk I/O with something. Later we find out this occurs during users logging off.

We ended up opening a Ticket with MS and we see how this works is that our Registry does a “critical” flush when we log off, This requires our registry to write to the disk an suspend everything until we’re done.

So how do we get Windows to handle this properly in a TS environment? By doing 2 things

Enable write caching on the disk AND turning on Power Protect on the Disks. 2 VERY important steps to dissipate our flushing to the disk.

So how does it work? Well no one can explain it better then my main man Jason Hall! (He made up a great little PPT on how Disk Caching an TS Pausing works)

When Write Caching and Power Protected options are turned on, the disk driver does not:

Issue the SYNCHRONIZE CACHE (instructs the disk to write all cached data to the disk immediately) command to the disk, or Set the ForceUnitAccess bit (instructs the disk to write the current packet of data to the disk immediately, bypassing the onboard write cache) when performing writes with FILE_FLAG_WRITE_THROUGH

Essentially (in the context of TS issues), we prevent our registry flushes from being immediately written to the disk, they are instead handled by the hardware cache just like any other file I/O.

Having the hardware cache handle the I/O normally prevents the blocking of other I/O calls and access to registry resources that are manifested as temporary “session freezes”. Setting the Power Protect write caching option increases server responsiveness during these registry flushes.


Sounds Simple enough. Let’s turn on Power Protect? And here’s where the disappointment came.
We opened a Ticket with IBM and we we’re sadly told the following:

IBM Blades do not have a Battery Pack Power Protect Option!

Sad as it is, the Giant of IBM Pc’s did not have the 1 an single important option that other competitors like HP comes standard with. You would think of Blades supporting Large Scale capacity servers (such as Citrix, SQL, Oracle, etc), we would add Write Cache along with Power Protect to prevent any data corruption and maximize performance.

After numerous troubleshooting steps and continuous conference calls to educate them on how Windows Architecture works with Registry flushes, it simply came down to the Blades can’t do what we want.

Although IBM claims to have supported many Citrix Farms that run IBM Blades, this is the “first” they’ve heard of our situation.

But checking through Brianmadden.com, it seems I’m not the only person out there showing Performance issues with blades and “gaps” in Perfmon:

W2K3 Terminal services freezes and Perfmon stops recording data http://www.brianmadden.com/forum/tm.aspx?m=11302

We had Citrix come in the mix only to complicate things more buy requesting to break our Mirror on the Blades since this could add extra I/O and our redundancy should rely on the Farm, not on the server (Server goes down, session gets redirected to another server in the farm). To our surprise, this is actually documented in a Citrix Article:

5.2.1 Disk configuration
In our blade server configuration, we used the onboard IDE controller with the 40 GB drives and without software mirroring. Note that other drive configuration options include SCSI with the expansion option or Fibre to an external disk.

So our final solution? Well Citrix and IBM are still willing to troubleshoot the issue and provide workarounds. Fortunately I think our director is pulling the plug on Blades for our Citrix Environment (Good Move! ( ).

Tuesday, February 28, 2006

Yet Another Server Hang?

We have a few servers in our environment that host our Sharepoint site. Our Sharepoint Admin’s notice their servers continuously start loosing performance when their running their stress testing tools (that simulate user logons).

We took a look at the server and we see the following “all too familiar” Event ID:

Event ID: 2020
Source: Srv

Description: The server was unable to allocate from the system paged pool because the pool was empty.

I’m sure some of you are wondering “What the Heck is Paged Pool”?. Paged Pool is a region of virtual memory in system space that can be paged in and out of the system. Theoretically, Windows 2003 Servers can have a pool of paged memory that can be up to 650 MB’s, but in most cases we usually see the average pool to be around 400 megs. So if you start exhausting that memory, your server will unable to allocate pages, and you end up with Poor performance which likely will result in a server hang.

So we “Break out” trusty o’l Perfmon and take alook at some common Values.

So we see in our “text” outputted Perfmon is that our Pool Paged Bytes is definitely exhausted at 374 megs. Clearly indicated on the resource we’re loosing is Pool memory.

Object: Memory

Available MBytes 2,670.615
Pool Nonpaged Bytes 59,496,686.446
Pool Paged Bytes 374,892,130.158

We’re not able to tie down an actual process consuming pool memory directly but we see that our Handle count for LSASS and W3WP are through the roof! So we now know that since LSASS is involved so our Virtual Server instance is doing some crazy authentication (which kinda makes sense since our stress test is logging users on).

Object: Process
lsass w3wp
Handle Count 243,794.619 243,997.000


We’ll since every Handle is tied to an object, we took a look at the Objects counter, but everything seems clean?....Hmm….that could only mean it’s a Kernel Mode object.

When we deal with Kernel Mode objects, you can watch them in 1 of 3 ways. 1st is Poolmon , second is Process Explorer and the 3rd is (if you want to get really fancy) is a Live Debug.
My weapon of choice? I used Process Explorer simply because it was available. So looking at the LSASS handles, we get tons and tons of authentication TOKEN objects (no big surprise here).

Interesting thing is, it was all the same user.

TOKEN Domain\user
TOKEN Domain\user
TOKEN Domain\user
TOKEN Domain\user
TOKEN Domain\user

So we see that someone is not cleaning up their logoff’s. Luckily our Sharepoint admin’s works on the same floor is as the developer and that was resolved quickly.

Unfortunately, we resulted in new problems.

The Performance degradation still occurred, but not with the O/S, it was simply with IIS at this point. Now back at PSS, I would have just “brushed my hands” and tossed this to an IIS engineer, but we’re not, so I decided to stay engaged.

Thinking back to my Debug Class, our Instructor showed us a dump of an IIS process hang. Even though its not an O/S component you can still technically look at any process requiring a resource, it’s the stack that’s the tricky part (Which is no big deal since the developers we’re available to look at the stack).

So I got brave and took a Adplus dump of the hung IIS process. Here’s the outcome:

Do a !locks and we see we have 3 locks, 1 showing a LockCount of 25 and a ContentionCount of 19.

0:000> !locks

CritSec +26a53c at 0026A53C
LockCount 1
RecursionCount 1
OwningThread b5c
EntryCount 1
ContentionCount 1
*** Locked

CritSec +e1ec8 at 000E1EC8
LockCount 25
RecursionCount 1
OwningThread 1528
EntryCount 19
ContentionCount 19
*** Locked

CritSec +f9560 at 000F9560
LockCount 1
RecursionCount 1
OwningThread 13a8
EntryCount 1
ContentionCount 1
*** Locked

We switch into the Owning thread context and we get the following stack

0:088> ~58s
eax=c0000022 ebx=050fe1b8 ecx=050fe16c edx=00000000 esi=050fe1c4 edi=7ffdf000
eip=7ffe0304 esp=050fe16c ebp=050fe214 iopl=0 nv up ei pl zr na po nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
SharedUserData!SystemCallStub+0x4:
7ffe0304 c3 ret

0:058> kv
ChildEBP RetAddr Args to Child
050fe168 77f4372d 77e41bfa 00000004 050fe1b8 SharedUserData!SystemCallStub+0x4 (FPO: [0,0,0])
050fe16c 77e41bfa 00000004 050fe1b8 00000001 ntdll!NtWaitForMultipleObjects+0xc (FPO: [5,0,0])
050fe214 77e4b0e4 00000004 050fe350 00000000 kernel32!WaitForMultipleObjectsEx+0x11a (FPO: [Non-Fpo])
050fe22c 79201583 00000004 050fe350 00000000 kernel32!WaitForMultipleObjects+0x17 (FPO: [4,0,0])
050fe464 792016c3 00000001 000f6458 00000000 mscorsvr!Thread::SysSuspendForGC+0x248 (FPO: [Non-Fpo])
050fe47c 79205b41 00000001 000f6458 00000000 mscorsvr!GCHeap::SuspendEE+0xcf (FPO: [Non-Fpo])
050fe498 79205c17 00000000 00000000 0000040c mscorsvr!GCHeap::GarbageCollectGeneration+0x13f (FPO: [Non-Fpo])
050fe4cc 791c0283 0b82eea8 0000040c 00000000 mscorsvr!gc_heap::allocate_more_space+0x17a (FPO: [Non-Fpo])
050fe6f4 791b6930 0b82eea8 0000040c 00000000 mscorsvr!GCHeap::Alloc+0x7b (FPO: [Non-Fpo])
050fe708 791b6faa 0000040c 00000000 00000000 mscorsvr!Alloc+0x3a (FPO: [Non-Fpo])
050fe728 791b6f4b 02552c3c 00000400 00000000 mscorsvr!FastAllocatePrimitiveArray+0x45 (FPO: [Non-Fpo])050fe7a8 799bb4dc 23b65d00 0000020a 79b9f914 mscorsvr!JIT_NewArr1+0xbb (FPO: [Non-Fpo])
WARNING: Stack unwind information not available. Following frames may be wrong.
050fe7c8 799bb64d 050fe84c 00000028 23b65ec8 mscorlib_79990000+0x2b4dc
00000000 00000000 00000000 00000000 00000000 mscorlib_79990000+0x2b64d

Sow what does this mean to me? Well it looks like our thread goes into a Wait State and waiting on Multiple Objects. But we can’t’ see the object its waiting on because this is a Userdump. We will need a full memory dump to identify the object(s) it’s waiting on (Which I wasn’t going to start doing).

But we can take a step back. Our LockCount showed 25. Meaning, that 25 threads are waiting on this single thread, an this single thread is waiting on a kernel mode Object. So lets dump 1 of the threads that hold a critsec to this thread.

0:058> ~88s

eax=00000fc9 ebx=00000000 ecx=00000000 edx=00000000 esi=000e1ec8 edi=7ff25000
eip=7ffe0304 esp=0b2cf654 ebp=0b2cf690 iopl=0 nv up ei pl zr na po nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
SharedUserData!SystemCallStub+0x4:
7ffe0304 c3 ret

0:088> kv
ChildEBP RetAddr Args to Child
0b2cf650 77f43741 77f5d64e 00001f14 00000000 SharedUserData!SystemCallStub+0x4 (FPO: [0,0,0])
0b2cf654 77f5d64e 00001f14 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc (FPO: [3,0,0])
0b2cf690 77f42044 00001f14 791b3090 000e1ec8 ntdll!RtlpWaitForCriticalSection+0x126 (FPO: [Non-Fpo])
0b2cf698 791b3090 000e1ec8 0b2cf6cc 79276ed2 ntdll!RtlEnterCriticalSection+0x46 (FPO: [1,0,0])
0b2cf6a4 79276ed2 000e1ec8 00000000 0bef0920 mscorsvr!EE_EnterCriticalSection+0xc (FPO: [Non-Fpo])
0b2cf6b8 791db6b7 77e42bcb 0bef0920 77e419f7 mscorsvr!BaseCrst::Enter+0x8a (FPO: [0,0,0])
0b2cf6cc 7924db29 00000000 00000001 0bef0920 mscorsvr!ThreadStore::LockThreadStore+0x9d (FPO: [Non-Fpo])
0b2cf6e0 7924dfd3 0bef0920 00000000 77e4193e mscorsvr!ThreadStore::AddThread+0xd (FPO: [Non-Fpo])
0b2cf708 7920a1d3 02c788d0 0b2cf7c8 791b59e8 mscorsvr!SetupThread+0x102 (FPO: [Non-Fpo])
0b2cf714 791b59e8 00000000 02c788d0 02c788d0 mscorsvr!SetupThreadPoolThread+0x15 (FPO: [Non-Fpo])
0b2cf7c8 791b59ce 02c786e0 00000001 00000001 mscorsvr!AddTimerCallbackEx+0x16 (FPO: [Non-Fpo])
0b2cf7dc 791b59a5 02c786e0 00000001 791b5996 mscorsvr!AddTimerCallback+0x10 (FPO: [Non-Fpo])
0b2cf7f0 791b5802 02c788d0 0bb6dbe8 793ea988 mscorsvr!ThreadpoolMgr::AsyncTimerCallbackCompletion+0xf (FPO: [Non-Fpo])
0b2cf804 791b57c1 0bb6dbe8 00000000 791b5751 mscorsvr!ThreadpoolMgr::ExecuteWorkRequest+0x19 (FPO: [Non-Fpo])
0b2cf824 791dbe6e 00000000 00000000 00000000 mscorsvr!ThreadpoolMgr::WorkerThreadStart+0x129 (FPO: [Non-Fpo])
0b2cffb8 77e4a990 02caa100 00000000 00000000 mscorsvr!ThreadpoolMgr::intermediateThreadProc+0x44 (FPO: [Non-Fpo])
0b2cffec 00000000 791dbe2d 02caa100 00000000 kernel32!BaseThreadStart+0x34 (FPO: [Non-Fpo])


Dumping the Critical Section we see our famous lock that points to our faulting stack (that waiting on something in the kernel).

0:088> !critsec 000e1ec8

CritSec +e1ec8 at 000E1EC8
LockCount 25
RecursionCount 1
OwningThread 1528
EntryCount 19
ContentionCount 19
*** Locked

So we know this is 1 of 25 threads that are holding a critsec to our waiting thread. Looking at this stack (and others) is a good pointer to our developers that ended up fixing the issue.

Don’t you love happy endings? :)

Tuesday, January 24, 2006

Perf Skills Golden In the Real World!

Well I started a new job and coming in with a lengthy Performance background, I wasn't sure how "valuable" it was in the “real world”. Well I was wrong. Anyone who tells you that Perfmon/IE/User & Memory Dumps (and the 1000 other things PSS Perf supports) isn't used in the real world? Their Wrong!. Why? Because in my first month, I had a Terminal Server, DCOM, App Deployment and a Server Hang issue.

My very first assignment was to shadow another engineer to install a checking application. Our first task (usually common in most large environments) is to implement a Test server in our development environment. This will allow us to test the application and work out the logistics before implementing it in our Production environment.

Working with the Vendor we eventually installed the application and we started testing on this server.

Soon after, the Server started to show some issues. Officially I’m out of the loop at this point since my only task was to build this server and assist the vendor. But since I’m a nice guy, I did a few checks of the server (Event Logs, Task Manager etc) and thought all was fine since it just needed a reboot. After a few days, I found it needed a reboot because the server would actually hang.

Well, This looks like a Performance Problem (Yippie!). I set it up for Performance Monitoring and a Full Memory Dump.

The very next day, the Server hung again. I stopped the Perfmon and went over to the data center to force the dump.

Let's take a Look a look at the data. (Text Version of Perfmon Values)

The Memory and Processor looked fine, but taking a look at some other counters, we found an obvious leak.

Reported on \\XXXXXXXXXXXX
Date: 1/6/2006
Time: 11:30:30 AM
Value: Maximum
Data: X:\PerfLogs\PerfWiz - 30 second interval_XXXXXXX.blg
Start: 1/5/2006 4:59:35 PM
Stop: 1/6/2006 11:24:12 AM

Computer: \\XXXX-XX-XXXX
Object: Memory

Available MBytes 2,705.000

But our Non-Paged Pool showed a clear exhaustion (definitely the culprit to hang the server)

Pool Nonpaged Bytes 267,407,360.000

So I took a look at who's exhausting Non-Paged Pool. Looking at the Objects Counter we see that someone is Spawning over 300,000 Threads.


Object: Objects
Threads 342,356.000


Next I wanted to see who had a handle to all those threads? I brought up the Process Object and looked at the Handle Counter. Unsurprisingly, I found it was the Vendors application. As you can see he is holding close to the amount of Threads as his Handle count

(For legal reasons, I changed the actual process name to VendorApp)

Object: Process
VendorApp
Handle Count 341,470.000


So we identified our server is hanging because we exhausted our Non-Paged Pool. Our Non-Paged Pool was consumed by a Thread leak which our Vendor Application had handles open to.

So for "shits and giggles" I took a look at the dump. Taking a look at the stack for the VendorApp we see that we're doing some sort of Message Box then our thread goes into a wait state (KeWaitForSingleObject). Again, I’m not too sure what that means, I just know that if I send this stack to the developers, hopefully they'll know what it means better then me.

!THREAD 8984bc98 Cid 0148.014c Teb: 7ffde000 Win32Thread: e31b53d8 WAIT: (WrUserRequest) UserMode Non-Alertable
89f5fb18 SynchronizationEvent
Not impersonating
DeviceMap e1873e80
Owning Process 89fca568 Image: VendorApp.exe
Wait Start TickCount 3838771 Ticks: 244696 (0:01:03:43.375)
Context Switch Count 1654454 LargeStack
UserTime 00:00:00.0031
KernelTime 00:00:00.0031
Start Address 0x77e4f35f
Win32 Start Address 0x0041d987
Stack Init b8980000 Current b897fbc4 Base b8980000 Limit b897c000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 0
Kernel stack not resident.
ChildEBP RetAddr Args to Child
b897fbdc 804edb2b 8984bd38 8984bc98 89f5fb18 nt!KiSwapContext+0x26
b897fc04 804ed790 00000000 e31b53d8 00000000 nt!KiSwapThread+0x280
b897fc38 bf804d91 89f5fb18 0000000d 00000001 nt!KeWaitForSingleObject
b897fc94 bf803d18 000025ff 00000000 00000001 win32k!xxxSleepThread+0x1be
b897fd4c 804dfd24 0012ff1c 00000000 00000000 win32k!NtUserGetMessage+0x27

b897fd4c 7ffe0304 0012ff1c 00000000 00000000 nt!KiSystemService+0xd0
0012fefc 00000000 00000000 00000000 00000000 SharedUserData!SystemCallStub+0x4

PROCESS 89fca568 SessionId: 0 Cid: 0148 Peb: 7ffdf000 ParentCid: 00f8
DirBase: b2c6f000 ObjectTable: e18995b0 HandleCount: 341470.
Image: VendorApp.exe

All in all, I'm happy to say that my 3 years at Perf helped me to identity the cause of the Hang, and we didn't have to go and open a ticket with Microsoft to eventually find out the same answer.