A torn-down TCP session.

We recently fixed a problem where a user in London connected to an Oracle database server (using SQL Developer) in Atlanta.  She ran a query and then let the connection sit idle for 12 minutes while she answered email.  When she went back to run another query, SQL Developer just spun, ultimately failing with a connection timeout.

We broke out our trusty copy of Wireshark, and started sniffing on her workstation, hoping to see who was cutting off the conversation: the client or the server.

As you can see in the screen capture below, we have a normal Oracle TNS protocol conversation of PSH, ACK’s between her client at 172.25.4.29 and the server at 172.27.10.219.  At 72 seconds (1 minute) into the trace, everything is fine.  However, at 716 seconds (12 minutes) into the capture, the client tries to send a new SQL query to the server, and gets no response. It tries retransmitting its request several times-highlighted below as TCP Retransmissions, even marking one packet with the URG (urgent) flag to force the server to pay attention.

What is going on here?  Clearly the Oracle client still thinks it has a TCP-level connection with the Oracle server, but in reality the TCP session has already been torn down. We ultimately traced the problem back to a BlueCoat network appliance sitting between the user and the server that was disconnecting the session based on its own timeout value.

Debugging high CPU usage on ForeFront Security for SharePoint

A number of users were receiving a “request timeout” error message when using a SharePoint workspace:

After logging into the SharePoint web front-end, you could immediately feel the high CPU usage, as the Windows desktop felt sluggish and unresponsive.  A glance at Task Manager confirmed the high CPU usage from a w3wp process.

The event log recorded slightly more detail about the timeouts, giving each one a generic “ApplicationException” label with a message of: ” This operation returned because the timeout period expired. (Exception from HRESULT: 0x800705B4).”  Since neither of these items were particularly informative, I turned my attention to top of the stack trace from the error:

Stack trace:    at System.Threading.ReaderWriterLock.AcquireReaderLockInternal(Int32 millisecondsTimeout)
at System.Threading.ReaderWriterLock.AcquireReaderLock(Int32 millisecondsTimeout)

The ReaderWriterLock line was interesting;  a reader-writer lock allows multiple threads to read data concurrently, while any thread that needs to write must acquire an exclusive lock to write data.  And when a writer is writing data, readers will be blocked until the writer is finished writing.

This error is often associated with a deadlock on the database server.  My next step was to enable deadlock tracing on the SQL Server via the “DBCC TRACEON” query, and watch the error logs to see if any deadlocks were recorded.  After a full day of normal usage, no deadlocks were found.  I turned my attention back to the web server…

I took a dump of the W3WP process that was chewing up all the CPU cycles with the “adplus” command, and opened it in Windbg with the sosex.dll loaded. I then ran the !rwlock command to list all the ReaderWriterLock instances:

0:022> !rwlock
Address    ReaderCount   WaitingReaderCount     WriterThread   WaitingWriterCount
-----------------------------------------------------------------------------------------
1437fcb0             0                    7            0x2e                     1

The command revealed that thread 46 (hex value 0x2e) owned the WriterThread, which meant no other thread could acquire a ReaderLock or WriterLock.  You can also see that seven other Reader threads and one Writer thread were queued, waiting for this lock to clear.

Digging into thread 46 provided more detail about the other threads connected to it:

0:033> !rwlock 1437fcb0
WriterThread:           0x2e
WriterLevel:            1
WaitingWriterCount:     1
WriterEvent:            1e40
WaitingWriterThreadIds: 0x5
ReaderCount:            0
CurrentReaderThreadIds: None
WaitingReaderCount:     7
ReaderEvent:            2cb8
WaitingReaderThreadIds: 0xa,0x28,0x29,0xe,0x2d,0x30,0x31

I decided to have a deeper look at thread 46, and issued the “kb” command to print its call stack:

0:046> kb
ChildEBP RetAddr  Args to Child
30aefe9c 7c822114 77e6bb08 000022e8 00000000 ntdll!KiFastSystemCallRet
30aefea0 77e6bb08 000022e8 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc
30aeff10 77e6ba72 000022e8 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
30aeff24 36232fd0 000022e8 ffffffff 30aeff68 kernel32!WaitForSingleObject+0x12
30aeff34 36233707 ffffffff 6fde9fa0 00000000 STSWEL!VvirusCheckUploadStatus::addDocument+0xf1
30aeff68 35f9b408 00000000 00000000 30aeffb0 STSWEL!VvirusCheckUploadStatus::addDocument+0x828
30aeff78 781329bb 2c44cbe8 b4845c59 00000000 ONETUTIL!Vthread::kill+0x86
30aeffb0 78132a47 00000000 77e660b9 0663acb8 msvcr80!_endthreadex+0x3b
30aeffb8 77e660b9 0663acb8 00000000 00000000 msvcr80!_endthreadex+0xc7
30aeffec 00000000 781329e1 0663acb8 00000000 kernel32!BaseThreadStart+0x34

Note the presence of the “virusCheckUploadStatus::addDocument” method.  By its name, I guessed that this might be related to the ForeFront Server Security functionality that scans documents on upload to SharePoint. I decided to have a look at thread 48 (hex value 0x30), which was listed as a WaitingReader thread:

48  Id: 2d4.1e90 Suspend: 1 Teb: 7ff0d000 Unfrozen
ChildEBP RetAddr  Args to Child
312efe9c 7c822114 77e6bb08 00002254 00000000 ntdll!KiFastSystemCallRet
312efea0 77e6bb08 00002254 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc
312eff10 77e6ba72 00002254 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
312eff24 36232fd0 00002254 ffffffff 312eff68 kernel32!WaitForSingleObject+0x12
312eff34 36233707 ffffffff 6e5e9fa0 00000000 STSWEL!VvirusCheckUploadStatus::addDocument+0xf1
312eff68 35f9b408 00000000 00000000 312effb0 STSWEL!VvirusCheckUploadStatus::addDocument+0x828
312eff78 781329bb 2c44d0b8 b5045c59 00000000 ONETUTIL!Vthread::kill+0x86
312effb0 78132a47 00000000 77e660b9 0663acb8 msvcr80!_endthreadex+0x3b
312effb8 77e660b9 0663acb8 00000000 00000000 msvcr80!_endthreadex+0xc7
312effec 00000000 781329e1 0663acb8 00000000 kernel32!BaseThreadStart+0x34

As you can see, it also had the “virusCheckUploadStatus::addDocument” method in its stack.  The other waiting threads looked the same. Based on this data, I decided to disable upload scanning in SharePoint, and see if it impacted CPU usage:

It did! Over the course of the next few business days, CPU usage returned to normal levels and subsequent memory dumps of the w3wp process revealed no ReaderWriter locks.  However, since scanning documents on upload is the primary reason we use ForeFront, we had to figure out what was causing the problem.  We dug into the SecurityAdministrator console, and reviewed the “Report” section.  There we found ForeFront having repeated problems with quarantine of documents containing the “MSWord/Dropper.B!Camelot” virus.

We disabled quarantine of files for the Realtime Scan Job, which is a Microsoft recommended best practice.  This allowed us to re-enable upload document scanning while avoiding the high CPU problem.

Web application proxy settings in WSS

Setting the proxy settings for a web application in MOSS is very straightforward, as an existing <defaultProxy> element exists in a default web.config file.  Simply browse for the XML tag and make your edits as needed.

Its a bit different with Windows SharePoint Services (WSS).  The default web.config file does not have an entry for <defaultProxy> and adding it in the wrong place can get you an unwelcome YSOD.  The trick is to add  the proxy block after the </system.web> closing element, like this:

</system.web>
<system.net>
<defaultProxy>
<proxy autoDetect=”false” />
</defaultProxy>
</system.net>

Save the file and your web application will restart, picking up your new proxy settings.

Sharepoint database error: "Unexpected query execution failure, error code 266"

I recently discovered our SharePoint 2007 ULS log had multiple instances of this error:

“Unexpected query execution failure, error code 266. Additional error information from SQL Server is included below. “Transaction count after EXECUTE indicates that a COMMIT or ROLLBACK TRANSACTION statement is missing.”

After reviewing several iterations of this error, a pattern stood out; the names of several PDF documents were intertwined with the error message itself.  This error seemed confined to a single content type (PDF) as other document types such as XLS were not mentioned. This was a clue, given that PDF files are not indexed by default in SharePoint, but require extra configuration. I turned my attention to the Search service.

As this was a WSS server, I opened regedit and navigated to the registry key for the  .PDF iFilter:  HKEY_LOCAL_MACHINESOFTWAREMicrosoftShared ToolsWeb Server Extensions12.0SearchSetupContentIndexCommonFiltersExtension

The value of this registry key was {4C904448-74A9-11D0-AF6E-00C04FD8DC02}, which is the GUID for Adobe PDF IFilter 6.  Since we only had the Adobe 9.x version installed, I changed this value to the GUID for the 9.x iFilter:  {E8978DA6-047F-4E3D-9C78-CDBE46041603}.  I then stopped the search service with the “net stop spsearch” command, and restarted it with “net start spsearch” to ensure the registry change was picked up.

After this change, the error disappeared from the ULS log, and the 9.x iFilter was able to successfully index the PDF content.