Database consistency errors
At this point, we have covered three major areas of data recovery: issues with accessing system databases, problems accessing user databases, and backup/restore failures. Now let's talk about detecting database consistency problems during the normal database operation. I classify database consistency detection (errors) into two types:
- Runtime error, errors detected by the engine during standard usage of the database (SELECT, UPDATE, background thread such as lazywrite, and so on)
- Errors detected by DBCC CHECKDB
Handling Database Consistency Runtime Errors
The SQL Server engine contains code to perform some validation of the data (and transaction) log as you execute commands such as SELECT, INSERT, and so on against a database. The following is not a complete list of these errors, but these are the ones I anticipate you may encounter the most.
Msg 823 and 824
The SQL Server engine has two methods to report problems reading database pages (and transaction log blocks):
- If the Windows API calls return an error when reading the page, Msg 823 is raised.
- If the Windows API call is successful, the engine performs a series of logical checks on the page (based on database options configured). If any of these checks fails, an Msg 824 is raised.
The following is an example of a Msg 823 error:
2005-09-07 10:51:05.16 spid18s Error: 823, Severity: 24, State: 6.
2005-09-07 10:51:05.16 spid18s The operating system returned
error 1117(The request could not be performed because of an I/O device error.)
to SQL Server during a read at offset 0000000000000000 in
file 'E:EdgeTachyonDatabaseFilesEdgePerfTachyonDB.mdf'.
Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
As you can see from the message, the SQL Server development team included the Windows error returned from the Windows API call to read the page from disk. Almost every 823 error is caused by a system problem. In this case, SQL Server reports only the error the operating system reported when reading from the file. Look at preceding the example. The text of the message even indicates there is a problem with a disk storage device. The one exception to this guideline that I've seen is error 6, "Invalid handle." In this situation, it is possible SQL Server passed an invalid file handle to the Windows API call, so this should be investigated with technical support.
If the Windows API call to read the page is successful, the following logical checks are applied in this order:
- Is the page torn (torn page detection)?
- Did a checksum fail when reading the page (checksum error)?
- Is the pageid on the page different from the page we expected to read from disk (bad pageid)?
The following is an example of a Msg 824 due to a database checksum validation check failure:
2006-08-25 14:38:39.34 spid53 Error: 824, Severity: 24, State: 2.
2006-08-25 14:38:39.34 spid53 SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x6cbf615; actual: 0x6dacdc1). It occurred during a read of page (1:177) in database ID 13 at offset 0x00000000162000 in file 'C:Program FilesMicrosoft SQL ServerMSSQL.1MSSQLDATAtest_checksum.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
Should you see an 824 error in the ERRORLOG, take immediate action, because this means the engine has detected a consistency problem in the database. The damage may be isolated to the page reported in the log, but it could go beyond that. Therefore, I recommend you run DBCC CHECKDB on the database as soon as possible. But which database? The server cannot provide you database context for this error, but it does provide you the filename associated with the physical filename it was reading for this page. If you can't recognize the database associated with the filename in the error, take the filename from the message and use it in the catalog view query:
select * from master.sys.master_files where physical_name = ''
At minimum, I expect DBCC CHECKDB to report an error for the page associated with Msg 824. At this point, if only a single page is damaged, you could restore from a backup (remember page-level restore here), or you can repair the database WITH REPAIR_ALLOW_DATA_LOSS. If more errors exist, your options may be more limited, or your data loss may be more severe. The point is, don't ignore 824 errors. It could be the tip of the iceberg of larger damage to your database.
Msg 825 (Read Retry)
While investigating a series of Exchange storage consistency problems, the Exchange development team discovered that if they were to retry failed read operations from disk, in some cases the second, third, or even fourth read attempt would succeed. So, instead of taking a beating from their customers for Exchange store corruption, they decided to add in logic to retry database read failures.
The SQL Server development team decided to adopt this same strategy for SQL Server 2005. Any read failure (823 or 824) will first be retried up to four times by the SQL Engine, before reporting the failure. If at any point the read succeeds, the following message is written to the ERRORLOG:
2005-04-27 16:06:35.98 spid346 A read of the file 'f:mssqlmssql.1mssqldatastressdb2.ndf'
at offset 0x00000000302000 succeeded after failing 1 time(s) with error:
incorrect pageid (expected 4:385; actual 0:0).
Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be
corrected. Complete a full database consistency check (DBCC CHECKDB).
This error can be caused by many factors; for more information, see SQL Server Books Online.
Remember that because the retry succeeded, the user will not see any error. The only difference will be a slight delay in performance (which may or may not truly be seen by the application user). But this does not mean you do not need to address this problem. Think of this message the same as you would an 823 or 824 error, but perhaps at a slightly less-urgent pace. The engine avoided a failure, but this doesn't mean this retry will work all the time. Some disk system issue caused at least one read to fail and therefore should be investigated.
Msg 5242 and 5243 (RecBase Error)
The SQL Server engine also has logical checks to detect consistency problems for a row within a database page. In previous versions of SQL Server, this check would result in an assertion (a fatal error that results in a stack dump) with the expression containing the word RecBase. RecBase is the name of a class used to logically read the elements of a row on a page. See Microsoft Knowledge Base article 828337 for more information.
In SQL Server 2005, two errors were introduced to report all row consistency errors detected while reading a row as part of a standard SELECT, UPDATE, and so on, Msg 5242 and 5243. The following is an example of an ERRORLOG containing a Msg 5242 error (with some sections omitted for brevity):
2006-02-18 19:33:09.40 spid51 ex_raise2: Exception raised, major=52, minor=42, state=1, severity=22, attempting to create symptom dump
2006-02-18 19:33:09.58 spid51 Using 'dbghelp.dll' version '4.0.5'
2006-02-18 19:33:09.59 spid51 **Dump thread - spid = 51, PSS = 0x048B9278,
EC = 0x048B9280
2006-02-18 19:33:09.59 spid51 *
2006-02-18 19:33:09.59 spid51 * User initiated stack dump. This is not a server exception dump.
2006-02-18 19:33:09.59 spid51 *
2006-02-18 19:33:09.61 spid51 ***Stack Dump being sent to C:Program FilesMicrosoft SQL ServerMSSQL.1MSSQLLOGSQLDump0001.txt
2006-02-18 19:33:09.61 spid51 * *******************************************************************************
2006-02-18 19:33:09.61 spid51 *
2006-02-18 19:33:09.61 spid51 * BEGIN STACK DUMP:
2006-02-18 19:33:09.61 spid51 * 02/18/06 19:33:09 spid 51
2006-02-18 19:33:09.61 spid51 *
2006-02-18 19:33:09.61 spid51 * ex_raise2: Exception raised, major=52, minor=42, state=1, severity=22
2006-02-18 19:33:09.61 spid51 *
*****************************************************************************
2006-02-18 19:33:09.61 spid51 * -------------------------------------------------------------------------------
2006-02-18 19:33:09.61 spid51 * Short Stack Dump
2006-02-18 19:33:09.62 spid51 77E55DEA Module(kernel32+00015DEA)
2006-02-18 19:33:09.62 spid51 02172CE4 Module(sqlservr+01172CE4)
2006-02-18 19:33:09.62 spid51 02176BA0 Module(sqlservr+01176BA0)
2006-02-18 19:33:09.62 spid51 0217674D Module(sqlservr+0117674D)
2006-02-18 19:33:09.62 spid51 01597EA2 Module(sqlservr+00597EA2)
2006-02-18 19:33:09.62 spid51 01095FFE Module(sqlservr+00095FFE)
2006-02-18 19:33:09.62 spid51 02389FB0 Module(sqlservr+01389FB0)
2006-02-18 19:33:09.62 spid51 0158E526 Module(sqlservr+0058E526)
2006-02-18 19:33:09.62 spid51 0100C833 Module(sqlservr+0000C833)
2006-02-18 19:33:09.62 spid51 0100BF0F Module(sqlservr+0000BF0F)
2006-02-18 19:33:09.62 spid51 010106EB Module(sqlservr+000106EB)
2006-02-18 19:33:09.62 spid51 01010B3A Module(sqlservr+00010B3A)
2006-02-18 19:33:09.62 spid51 01010932 Module(sqlservr+00010932)
2006-02-18 19:33:09.62 spid51 010C0E2B Module(sqlservr+000C0E2B)
2006-02-18 19:33:09.62 spid51 01210454 Module(sqlservr+00210454)
2006-02-18 19:33:09.62 spid51 013F65A8 Module(sqlservr+003F65A8)
2006-02-18 19:33:09.62 spid51 013B3E42 Module(sqlservr+003B3E42)
2006-02-18 19:33:09.62 spid51 013B35EC Module(sqlservr+003B35EC)
2006-02-18 19:33:09.62 spid51 012101C2 Module(sqlservr+002101C2)
2006-02-18 19:33:09.62 spid51 01188FC9 Module(sqlservr+00188FC9)
2006-02-18 19:33:09.62 spid51 01189021 Module(sqlservr+00189021)
2006-02-18 19:33:09.62 spid51 01330A25 Module(sqlservr+00330A25)
2006-02-18 19:33:09.62 spid51 01330421 Module(sqlservr+00330421)
2006-02-18 19:33:09.62 spid51 01332C55 Module(sqlservr+00332C55)
2006-02-18 19:33:09.62 spid51 0100889F Module(sqlservr+0000889F)
2006-02-18 19:33:09.62 spid51 010089C5 Module(sqlservr+000089C5)
2006-02-18 19:33:09.62 spid51 010086E7 Module(sqlservr+000086E7)
2006-02-18 19:33:09.62 spid51 010D764A Module(sqlservr+000D764A)
2006-02-18 19:33:09.62 spid51 010D7B71 Module(sqlservr+000D7B71)
2006-02-18 19:33:09.62 spid51 010D746E Module(sqlservr+000D746E)
2006-02-18 19:33:09.62 spid51 010D83F0 Module(sqlservr+000D83F0)
2006-02-18 19:33:09.62 spid51 781329AA Module(MSVCR80+000029AA)
2006-02-18 19:33:09.64 spid51 78132A36 Module(MSVCR80+00002A36)
2006-02-18 19:33:09.65 spid51 * -------------------------------------------------------------------------------
2006-02-18 19:33:09.65 spid51 Stack Signature for the dump is 0xE09D5555
2006-02-18 19:33:10.48 spid51 External dump process return code 0x20000001.
External dump process returned no errors.
2006-02-18 19:33:10.58 spid51 Error: 5242, Severity: 22, State: 1.
2006-02-18 19:33:10.58 spid51 An inconsistency was detected during an
internal operation in database 'master'(ID:1) on page (1:306). Please contact technical support. Reference number 3.
The error message says to contact technical support (I sure hate when our errors say that), but in reality you need to first run a DBCC CHECKDB in the database listed in the message. If it returns errors indicating row damage, you need to address those problems as a true database corruption problem. If CHECKDB does not report errors, it may be a good idea to contact technical support. Have the stack dump and ERRORLOG available for them to review.
Msg 605
Msg 605 is a legacy carryover from the original SYBASE code for SQL Server. In fact, it was pretty much the only way to know about a database consistency problem aside from DBCC CHECKDB. This error check still remains in the SQL Server 2005 code. You should treat this error much like the row-level consistency errors 5242 and 5243. If CHECKDB reports errors, treat the problem as database corruption. If not, consider contacting technical support if the errors occur frequently. It could indicate some problem in the SQL Server code that should be addressed.
Access Violations
For performance reasons, the SQL Server development team cannot put in validation checks for every bit on the database page at every place in the engine code. Therefore, it is possible the engine will run into an access violation because of database corruption. How can you recognize this case? Well, because an access violation could occur at so many different places due to corruption, there is one stack dump signature I could show you to say the problem is corruption. My advice is that if you are getting access violations frequently, you should always ensure DBCC CHECKDB is run for all databases before pursuing the cause of the access violation as a SQL Server bug.
Failed Rollback
Another example is online rollback recovery. If the rollback of a user transaction fails, this is a critical problem. So, the engine reacts by taking the user database offline and restarting it. This allows recovery to run on the database with the hopeful effect of rolling back the transaction that couldn't roll back online. This technique actually is quite successful, because in some cases the online rollback failure is an interim problem that won't reoccur when the transactions are rolled back from the log on disk. Again, in this situation, SQL Server attempts to isolate the failure. Only the database associated with the failed rollback is affected. The rest of the engine remains intact, and other databases are not affected. As with any reliability feature, there are always exceptions. Any failure to roll back a transaction in tempdb results in a server shutdown because there is no method to run recovery on tempdb, and it is a critical shared resource for all users.
Handling DBCC CHECKDB Errors
I've recommended that you run DBCC CHECKDB if you encounter some of the errors described here. You may also decide to run DBCC CHECKDB at any point in time to check the consistency of your database. I presented earlier in the chapter some nice enhancements to CHECKDB to help provide more consistency checks (for example, DATA_PURITY and CHECKCATALOG). Let's talk about a strategy to address errors that you can encounter with DBCC CHECKDB. In this section, I do not go over every error that CHECKDB can report. I talk later about how you can look for information on the web about each of the errors (documented by the authors of DBCC CHECKDB, Ryan Stonecipher and Paul Randal).
Look at Summaries and Recommendation Firs
When you run DBCC CHECKDB and it reports errors, I recommend you do two things first:
- Look at what I call the summary messages.
- Look at the recommendation messages.
Let's look at some example output that contains errors so that I can show you what I mean:
DBCC results for 'checkdb_test'.
Service Broker Msg 9675, State 1: Message Types analyzed: 14.
Service Broker Msg 9676, State 1: Service Contracts analyzed: 6.
Service Broker Msg 9667, State 1: Services analyzed: 3.
Service Broker Msg 9668, State 1: Service Queues analyzed: 3.
Service Broker Msg 9669, State 1: Conversation Endpoints analyzed: 0.
Service Broker Msg 9674, State 1: Conversation Groups analyzed: 0.
Service Broker Msg 9670, State 1: Remote Service Bindings analyzed: 0.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:153) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:154) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:155) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:156) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:157) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:158) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
CHECKDB found 0 allocation errors and 6 consistency errors not associated with any single object.
DBCC results for 'sys.sysrowsetcolumns'.
There are 544 rows in 5 pages for object "sys.sysrowsetcolumns".
DBCC results for 'sys.sysrowsets'.
There are 81 rows in 1 pages for object "sys.sysrowsets".
DBCC results for 'sysallocunits'.
There are 92 rows in 1 pages for object "sysallocunits".
DBCC results for 'sys.sysfiles1'.
There are 2 rows in 1 pages for object "sys.sysfiles1".
DBCC results for 'sys.syshobtcolumns'.
There are 544 rows in 5 pages for object "sys.syshobtcolumns".
DBCC results for 'sys.syshobts'.
There are 81 rows in 1 pages for object "sys.syshobts".
DBCC results for 'sys.sysftinds'.
There are 0 rows in 0 pages for object "sys.sysftinds".
DBCC results for 'sys.sysserefs'.
There are 92 rows in 1 pages for object "sys.sysserefs".
DBCC results for 'sys.sysowners'.
There are 14 rows in 1 pages for object "sys.sysowners".
DBCC results for 'sys.sysprivs'.
There are 120 rows in 1 pages for object "sys.sysprivs".
DBCC results for 'sys.sysschobjs'.
There are 50 rows in 1 pages for object "sys.sysschobjs".
DBCC results for 'sys.syscolpars'.
There are 423 rows in 7 pages for object "sys.syscolpars".
DBCC results for 'sys.sysnsobjs'.
There are 1 rows in 1 pages for object "sys.sysnsobjs".
DBCC results for 'sys.syscerts'.
There are 0 rows in 0 pages for object "sys.syscerts".
DBCC results for 'sys.sysxprops'.
There are 0 rows in 0 pages for object "sys.sysxprops".
DBCC results for 'sys.sysscalartypes'.
There are 27 rows in 1 pages for object "sys.sysscalartypes".
DBCC results for 'sys.systypedsubobjs'.
There are 0 rows in 0 pages for object "sys.systypedsubobjs".
DBCC results for 'sys.sysidxstats'.
There are 106 rows in 1 pages for object "sys.sysidxstats".
DBCC results for 'sys.sysiscols'.
There are 218 rows in 1 pages for object "sys.sysiscols".
DBCC results for 'sys.sysbinobjs'.
There are 23 rows in 1 pages for object "sys.sysbinobjs".
DBCC results for 'sys.sysobjvalues'.
There are 104 rows in 16 pages for object "sys.sysobjvalues".
DBCC results for 'sys.sysclsobjs'.
There are 14 rows in 1 pages for object "sys.sysclsobjs".
DBCC results for 'sys.sysrowsetrefs'.
There are 0 rows in 0 pages for object "sys.sysrowsetrefs".
DBCC results for 'sys.sysremsvcbinds'.
There are 0 rows in 0 pages for object "sys.sysremsvcbinds".
DBCC results for 'sys.sysxmitqueue'.
There are 0 rows in 0 pages for object "sys.sysxmitqueue".
DBCC results for 'sys.sysrts'.
There are 1 rows in 1 pages for object "sys.sysrts".
DBCC results for 'sys.sysconvgroup'.
There are 0 rows in 0 pages for object "sys.sysconvgroup".
DBCC results for 'sys.sysdesend'.
There are 0 rows in 0 pages for object "sys.sysdesend".
DBCC results for 'sys.sysdercv'.
There are 0 rows in 0 pages for object "sys.sysdercv".
DBCC results for 'sys.syssingleobjrefs'.
There are 133 rows in 1 pages for object "sys.syssingleobjrefs".
DBCC results for 'sys.sysmultiobjrefs'.
There are 102 rows in 1 pages for object "sys.sysmultiobjrefs".
DBCC results for 'sys.sysdbfiles'.
There are 2 rows in 1 pages for object "sys.sysdbfiles".
DBCC results for 'sys.sysguidrefs'.
There are 0 rows in 0 pages for object "sys.sysguidrefs".
DBCC results for 'sys.sysqnames'.
There are 91 rows in 1 pages for object "sys.sysqnames".
DBCC results for 'sys.sysxmlcomponent'.
There are 93 rows in 1 pages for object "sys.sysxmlcomponent".
DBCC results for 'sys.sysxmlfacet'.
There are 97 rows in 1 pages for object "sys.sysxmlfacet".
DBCC results for 'sys.sysxmlplacement'.
There are 17 rows in 1 pages for object "sys.sysxmlplacement".
DBCC results for 'sys.sysobjkeycrypts'.
There are 0 rows in 0 pages for object "sys.sysobjkeycrypts".
DBCC results for 'sys.sysasymkeys'.
There are 0 rows in 0 pages for object "sys.sysasymkeys".
DBCC results for 'sys.syssqlguides'.
There are 0 rows in 0 pages for object "sys.syssqlguides".
DBCC results for 'sys.sysbinsubobjs'.
There are 0 rows in 0 pages for object "sys.sysbinsubobjs".
DBCC results for 'sys.queue_messages_1977058079'.
There are 0 rows in 0 pages for object "sys.queue_messages_1977058079".
DBCC results for 'sys.queue_messages_2009058193'.
There are 0 rows in 0 pages for object "sys.queue_messages_2009058193".
DBCC results for 'sys.queue_messages_2041058307'.
There are 0 rows in 0 pages for object "sys.queue_messages_2041058307".
DBCC results for 'table1'.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:153) could not be processed. See other errors for details.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:154) could not be processed. See other errors for details.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:155) could not be processed. See other errors for details.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:156) could not be processed. See other errors for details.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:157) could not be processed. See other errors for details.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:158) could not be processed. See other errors for details.
There are 88 rows in 44 pages for object "table1".
CHECKDB found 0 allocation errors and 6 consistency errors in table 'table1' (object ID 2073058421).
DBCC results for 'table2'.
Msg 2511, Level 16, State 1, Line 1
Table error: Object ID 2089058478, index ID 2, partition ID 72057594038452224, alloc unit ID 72057594042449920 (type In-row data). Keys out of order on page (1:210), slots 157 and 158.
There are 1000 rows in 3 pages for object "table2".
CHECKDB found 0 allocation errors and 1 consistency errors in table 'table2' (object ID 2089058478).
CHECKDB found 0 allocation errors and 13 consistency errors in database 'checkdb_test'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (checkdb_test).
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
Go to the bottom of the output and look at the summary and recommendation messages for the database:
CHECKDB found 0 allocation errors and 13 consistency errors in database 'checkdb_test'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (checkdb_test).
You know from this message that 13 errors related to the consistency of pages or indexes have been detected (as opposed to allocation errors that are specific to problems with allocation structures found by the CHECKALLOC phase). You also know that in order to repair all 13 errors, you must use the REPAIR_ALLOW_DATA_LOSS repair option. You could just move forward with this recommendation and check the results. However, if you want to know what the repair recommendation is for each table found to have problems, you could check each one. Which tables do I check? Look at the summary messages for each table:
CHECKDB found 0 allocation errors and 6 consistency errors not associated with any single object.
CHECKDB found 0 allocation errors and 6 consistency errors in table 'table1' (object ID 2073058421).
CHECKDB found 0 allocation errors and 1 consistency errors in table 'table2' (object ID 2089058478).
First, you should know that CHECKDB won't produce this summary message for a table unless it detects problems with that table. Second, notice the first summary message says "not associated with any single object." This message indicates that CHECKDB found errors associated with pages that don't appear to belong to a known table.
In this situation, what is happening is that CHECKDB is in a way "double-reporting" errors, but that is to your advantage in this situation. Let's look at the errors "not associated with any single object."
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:153) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
This message indicates a problem with the page ID as found on the page header. The page ID was expected to be 153, but on the page it was found to be 0:0. The routine that detects this problem is used throughout the CHECKDB code to read a page. In this routine, the table associated with the page is found by looking up the allocation unit ID on the page header. As you can see from the error message, that value is 0, and therefore CHECKDB doesn't know what table this page belongs to.
Now look at the errors associated with table1:
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:153) could not be processed. See other errors for details.
If you look closely, you will notice this error is for the same page as the one I just talked about for 8909. In fact, all the pages in errors 8909 and 8928 are the same. Why does the server report two different errors for the same page? I just mentioned that the 8909 error is reported in a general routine to read a page. This routine is self-contained and so relies on finding the object from the allocation unit id on the page. The code that raises the 8928 error is part of a routine that is checking "facts" about pages associated with an index. (In this case, it was actually just a heap for table1.) This code already "knows" what index it was checking (table1) and knows what page it was trying to read, so it can accurately report that it had a problem reading a page and what table it thought the page was associated with. You will notice this error says, "See other errors for details." What this message is saying is this: "I've detected a problem with this page, but there should be another error describing the actual problem." In this case, the other error is 8909, but it is not associated with table1 for the reasons I just described. Perhaps by now you can guess the problem in this scenario with some of these pages. A quick glance at page 1:153 using DBCC PAGE would tell you. (I didn't include the entire output because it is not important for this situation.)
dbcc traceon(3604)
go
dbcc page(10, 1, 153, 2)
go
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
PAGE: (0:0)
BUFFER:
BUF ((Content component not found.)) x02BF096C
bpage = 0x05052000 bhash = 0x00000000 bpageno = (1:153)
bdbid = 10 breferences = 1 bUse1 = 50447
bstat = 0xc00809 blog = 0x32159 bnext = 0x00000000
PAGE HEADER:
Page ((Content component not found.)) x05052000
m_pageId = (0:0) m_headerVersion = 0 m_type = 0
m_typeFlagBits = 0x0 m_level = 0 m_flagBits = 0x0
m_objId (AllocUnitId.idObj) = 0 m_indexId (AllocUnitId.idInd) = 0 Metadata: AllocUnitId = 0
Metadata: PartitionId = 0 Metadata: IndexId = -1 Metadata: ObjectId = 0
m_prevPage = (0:0) m_nextPage = (0:0) pminlen = 0
m_slotCnt = 0 m_freeCnt = 0 m_freeData = 0
m_reservedCnt = 0 m_lsn = (0:0:0) m_xactReserved = 0
m_xdesId = (0:0) m_ghostRecCnt = 0 m_tornBits = 0
Allocation Status
GAM (1:2) = ALLOCATED SGAM (1:3) = NOT ALLOCATED
PFS (1:1) = 0x64 MIXED_EXT ALLOCATED 100_PCT_FULL DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED
DATA:
Memory Dump ((Content component not found.)) x44C7C000
44C7C000: 00000000 00000000 00000000 00000000 †................
44C7C010: 00000000 00000000 00000000 00000000 †................
44C7C020: 00000000 00000000 00000000 00000000 †................
44C7C030: 00000000 00000000 00000000 00000000 †................
44C7C040: 00000000 00000000 00000000 00000000 †................
44C7C050: 00000000 00000000 00000000 00000000 †................
44C7C060: 1000a80f 02000000 4d595441 42202020 †........MYTAB
44C7C070: 20202020 20202020 20202020 20202020 †
44C7C080: 20202020 20202020 20202020 20202020 †
44C7C090: 20202020 20202020 20202020 20202020 †
44C7C0A0: 20202020 20202020 20202020 20202020 †
44C7C0B0: 20202020 20202020 20202020 20202020 †
44C7C0C0: 20202020 20202020 20202020 20202020 †
44C7C0D0: 20202020 20202020 20202020 20202020 †
44C7C0E0: 20202020 20202020 20202020 20202020 †
The first 96 bytes of any page is the page header. I think you can easily see that just about the entire header has all values of 0. If in this situation the page ID were the only problem with the page, Msg 8909 would have correctly shown up as a problem for table1. So, in this example, you can see that of the 13 errors, six of them are reported more than once. In reality, therefore, only seven pages have problems from this database.
Let's take a look at the errors for table2:
Msg 2511, Level 16, State 1, Line 1
Table error: Object ID 2089058478, index ID 2, partition ID 72057594038452224, alloc unit ID 72057594042449920 (type In-row data). Keys out of order on page (1:210), slots 157 and 158.
In this case, there is only one error, and as you can see it is associated with a nonclustered index (index ID 2). You saw that the 8928 errors were associated with data pages (index ID 0). Can you guess what the recommended repair level for table2 would be? Because the output may not always be this simple, let's use DBCC CHECKTABLE on each table to know for sure.
Here are the results for CHECKTABLE for table1:
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:153) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:154) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:155) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:156) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:157) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:158) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
CHECKTABLE found 0 allocation errors and 6 consistency errors not associated with any single object.
DBCC results for 'table1'.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:153) could not be processed. See other errors for details.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:154) could not be processed. See other errors for details.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:155) could not be processed. See other errors for details.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:156) could not be processed. See other errors for details.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:157) could not be processed. See other errors for details.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:158) could not be processed. See other errors for details.
There are 88 rows in 44 pages for object "table1".
CHECKTABLE found 0 allocation errors and 6 consistency errors in table 'table1' (object ID 2073058421).
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKTABLE (checkdb_test.dbo.table1).
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
Notice the repair recommendation is still REPAIR_ALLOW_DATA_LOSS. Because these six pages are all data pages, a repair will actually result in lost data, because the pages will be de-allocated.
Here are the results of CHECKTABLE for table2:
DBCC results for 'table2'.
Msg 2511, Level 16, State 1, Line 1
Table error: Object ID 2089058478, index ID 2, partition ID 72057594038452224, alloc unit ID 72057594042449920 (type In-row data). Keys out of order on page (1:210), slots 157 and 158.
There are 1000 rows in 3 pages for object "table2".
CHECKTABLE found 0 allocation errors and 1 consistency errors in table 'table2' (object ID 2089058478).
repair_rebuild is the minimum repair level for the errors found by DBCC
CHECKTABLE (checkdb_test.dbo.table2).
DBCC execution completed. If DBCC printed error messages, contact your system
administrator.
You can see that the repair recommendation is REPAIR_REBUILD, which means a rebuild of an index will correct any errors, and no data loss should occur.
Let's see what the output of CHECKDB looks like in this situation (because CHECKDB will report messages on what actions it took to repair any error it can fix):
DBCC results for 'checkdb_test'.
Service Broker Msg 9675, State 1: Message Types analyzed: 14.
Service Broker Msg 9676, State 1: Service Contracts analyzed: 6.
Service Broker Msg 9667, State 1: Services analyzed: 3.
Service Broker Msg 9668, State 1: Service Queues analyzed: 3.
Service Broker Msg 9669, State 1: Conversation Endpoints analyzed: 0.
Service Broker Msg 9674, State 1: Conversation Groups analyzed: 0.
Service Broker Msg 9670, State 1: Remote Service Bindings analyzed: 0.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:153) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
The error has been repaired.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:154) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
The error has been repaired.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:155) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
The error has been repaired.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:156) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
The error has been repaired.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:157) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
The error has been repaired.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:158) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
The error has been repaired.
CHECKDB found 0 allocation errors and 6 consistency errors not associated with any single object.
CHECKDB fixed 0 allocation errors and 6 consistency errors not associated with any single object.
DBCC results for 'sys.sysrowsetcolumns'.
There are 544 rows in 5 pages for object "sys.sysrowsetcolumns".
DBCC results for 'sys.sysrowsets'.
There are 81 rows in 1 pages for object "sys.sysrowsets".
DBCC results for 'sysallocunits'.
There are 92 rows in 1 pages for object "sysallocunits".
DBCC results for 'sys.sysfiles1'.
There are 2 rows in 1 pages for object "sys.sysfiles1".
DBCC results for 'sys.syshobtcolumns'.
There are 544 rows in 5 pages for object "sys.syshobtcolumns".
DBCC results for 'sys.syshobts'.
There are 81 rows in 1 pages for object "sys.syshobts".
DBCC results for 'sys.sysftinds'.
There are 0 rows in 0 pages for object "sys.sysftinds".
DBCC results for 'sys.sysserefs'.
There are 92 rows in 1 pages for object "sys.sysserefs".
DBCC results for 'sys.sysowners'.
There are 14 rows in 1 pages for object "sys.sysowners".
DBCC results for 'sys.sysprivs'.
There are 120 rows in 1 pages for object "sys.sysprivs".
DBCC results for 'sys.sysschobjs'.
There are 50 rows in 1 pages for object "sys.sysschobjs".
DBCC results for 'sys.syscolpars'.
There are 423 rows in 7 pages for object "sys.syscolpars".
DBCC results for 'sys.sysnsobjs'.
There are 1 rows in 1 pages for object "sys.sysnsobjs".
DBCC results for 'sys.syscerts'.
There are 0 rows in 0 pages for object "sys.syscerts".
DBCC results for 'sys.sysxprops'.
There are 0 rows in 0 pages for object "sys.sysxprops".
DBCC results for 'sys.sysscalartypes'.
There are 27 rows in 1 pages for object "sys.sysscalartypes".
DBCC results for 'sys.systypedsubobjs'.
There are 0 rows in 0 pages for object "sys.systypedsubobjs".
DBCC results for 'sys.sysidxstats'.
There are 116 rows in 2 pages for object "sys.sysidxstats".
DBCC results for 'sys.sysiscols'.
There are 228 rows in 1 pages for object "sys.sysiscols".
DBCC results for 'sys.sysbinobjs'.
There are 23 rows in 1 pages for object "sys.sysbinobjs".
DBCC results for 'sys.sysobjvalues'.
There are 114 rows in 17 pages for object "sys.sysobjvalues".
DBCC results for 'sys.sysclsobjs'.
There are 14 rows in 1 pages for object "sys.sysclsobjs".
DBCC results for 'sys.sysrowsetrefs'.
There are 0 rows in 0 pages for object "sys.sysrowsetrefs".
DBCC results for 'sys.sysremsvcbinds'.
There are 0 rows in 0 pages for object "sys.sysremsvcbinds".
DBCC results for 'sys.sysxmitqueue'.
There are 0 rows in 0 pages for object "sys.sysxmitqueue".
DBCC results for 'sys.sysrts'.
There are 1 rows in 1 pages for object "sys.sysrts".
DBCC results for 'sys.sysconvgroup'.
There are 0 rows in 0 pages for object "sys.sysconvgroup".
DBCC results for 'sys.sysdesend'.
There are 0 rows in 0 pages for object "sys.sysdesend".
DBCC results for 'sys.sysdercv'.
There are 0 rows in 0 pages for object "sys.sysdercv".
DBCC results for 'sys.syssingleobjrefs'.
There are 133 rows in 1 pages for object "sys.syssingleobjrefs".
DBCC results for 'sys.sysmultiobjrefs'.
There are 102 rows in 1 pages for object "sys.sysmultiobjrefs".
DBCC results for 'sys.sysdbfiles'.
There are 2 rows in 1 pages for object "sys.sysdbfiles".
DBCC results for 'sys.sysguidrefs'.
There are 0 rows in 0 pages for object "sys.sysguidrefs".
DBCC results for 'sys.sysqnames'.
There are 91 rows in 1 pages for object "sys.sysqnames".
DBCC results for 'sys.sysxmlcomponent'.
There are 93 rows in 1 pages for object "sys.sysxmlcomponent".
DBCC results for 'sys.sysxmlfacet'.
There are 97 rows in 1 pages for object "sys.sysxmlfacet".
DBCC results for 'sys.sysxmlplacement'.
There are 17 rows in 1 pages for object "sys.sysxmlplacement".
DBCC results for 'sys.sysobjkeycrypts'.
There are 0 rows in 0 pages for object "sys.sysobjkeycrypts".
DBCC results for 'sys.sysasymkeys'.
There are 0 rows in 0 pages for object "sys.sysasymkeys".
DBCC results for 'sys.syssqlguides'.
There are 0 rows in 0 pages for object "sys.syssqlguides".
DBCC results for 'sys.sysbinsubobjs'.
There are 0 rows in 0 pages for object "sys.sysbinsubobjs".
DBCC results for 'sys.queue_messages_1977058079'.
There are 0 rows in 0 pages for object "sys.queue_messages_1977058079".
DBCC results for 'sys.queue_messages_2009058193'.
There are 0 rows in 0 pages for object "sys.queue_messages_2009058193".
DBCC results for 'sys.queue_messages_2041058307'.
There are 0 rows in 0 pages for object "sys.queue_messages_2041058307".
DBCC results for 'table1'.
Repair: The page (1:153) has been deallocated from object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data).
Repair: The page (1:154) has been deallocated from object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data).
Repair: The page (1:155) has been deallocated from object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data).
Repair: The page (1:156) has been deallocated from object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data).
Repair: The page (1:157) has been deallocated from object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data).
Repair: The page (1:158) has been deallocated from object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data).
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:153) could not be processed. See other errors for details.
The error has been repaired.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:154) could not be processed. See other errors for details.
The error has been repaired.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:155) could not be processed. See other errors for details.
The error has been repaired.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:156) could not be processed. See other errors for details.
The error has been repaired.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:157) could not be processed. See other errors for details.
The error has been repaired.
Msg 8928, Level 16, State 1, Line 1
Object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data): Page (1:158) could not be processed. See other errors for details.
The error has been repaired.
There are 88 rows in 44 pages for object "table1".
CHECKDB found 0 allocation errors and 6 consistency errors in table 'table1' (object ID 2073058421).
CHECKDB fixed 0 allocation errors and 6 consistency errors in table 'table1' (object ID 2073058421).
DBCC results for 'table2'.
Repair: The Nonclustered index successfully rebuilt for the object "dbo.table2,
PK_ _table2_ _7D78A4E7" in database "checkdb_test".
Msg 8945, Level 16, State 1, Line 1
Table error: Object ID 2089058478, index ID 2 will be rebuilt.
The error has been repaired.
Msg 2511, Level 16, State 1, Line 1
Table error: Object ID 2089058478, index ID 2, partition ID 72057594038452224, alloc unit ID 72057594042449920 (type In-row data). Keys out of order on page (1:210), slots 157 and 158.
The error has been repaired.
There are 1000 rows in 3 pages for object "table2".
CHECKDB found 0 allocation errors and 1 consistency errors in table 'table2' (object ID 2089058478).
CHECKDB fixed 0 allocation errors and 1 consistency errors in table 'table2' (object ID 2089058478).
CHECKDB found 0 allocation errors and 13 consistency errors in database 'checkdb_test'.
CHECKDB fixed 0 allocation errors and 13 consistency errors in database 'checkdb_test'.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
In this case, I've highlighted the important messages that are added as part of repair:
Repair: The page (1:153) has been deallocated from object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data).
Repair: The page (1:154) has been deallocated from object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data).
Repair: The page (1:155) has been deallocated from object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data).
Repair: The page (1:156) has been deallocated from object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data).
Repair: The page (1:157) has been deallocated from object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data).
Repair: The page (1:158) has been deallocated from object ID 2073058421, index ID 0, partition ID 72057594038321152, alloc unit ID 72057594042318848 (type In-row data).
You can see from these messages that the method to repair these pages is to de-allocate them. There is no method to retrieve any data that could have been on these pages:
Repair: The Nonclustered index successfully rebuilt for the object "dbo.table2,
PK_ _table2_ _7D78A4E7" in database "checkdb_test".
Msg 8945, Level 16, State 1, Line 1
Table error: Object ID 2089058478, index ID 2 will be rebuilt.
This is the error for table2. You can see that the only method needed to repair the table is to rebuild the nonclustered index:
CHECKDB found 0 allocation errors and 13 consistency errors in database 'checkdb_test'.
CHECKDB fixed 0 allocation errors and 13 consistency errors in database 'checkdb_test'.
This is the most important message to look for. Was the repair successful? If the number of messages fixed is less than those found, the repair couldn't fix all the problems.
Repair Versus Restore
How do you know whether you should restore from a database backup or use repair? One consideration is what I've described in the section "Best Practices": the primary feature for a recovery of data for the SQL Server product is RESTORE. Repair has been a great feature of DBCC CHECKDB, but you should use it only for emergency purposes. We talk shortly about what REPAIR_ALLOW_DATA_LOSS really means, but you should know that in most cases it does mean you will probably lose data. Here is the problem. Aside from knowing what table is affected by the data loss, you won't know what rows you lost. It is now your responsibility to understand what level of logical consistency you now have in your database.
If you have the proper backup strategy in place, you should almost never have to rely on repair. But the functionality is still there in case you need it. (Perhaps you couldn't predict that a disk that holds backups and is always reliable has an unexpected failure. Of course, I think you should have a plan for even this scenario.) You will find that if you call Microsoft PSS to talk about a situation where you have a damaged database and want to talk about whether repair will fix the problem, instead of just helping you run REPAIR, PSS will almost always talk to you about using a backup first.
What Does Each Error Mean?
Each error produced by DBCC CHECKDB is documented on the Events and Errors Message Center. Some of these errors that apply to both SQL Server 2000 and 2005 are documented under SQL Server 2000. Here is an example of the documentation of an error that can be encountered in CHECKDB:
| Details |
|
| Product: |
SQL Server |
| Event ID: |
2531 |
| Source: |
MSSQLServer |
| Version: |
9.00.1281.60 |
| Symbolic Name: |
DBCC_BTREE_SIBLING_LEVEL_MISMATCH |
| Message:: |
Table error: object ID %d, index ID %d, partition ID %I64d, alloc unit ID match level %d from the previous %S_PGID.
|
Explanation
Two pages are linked as immediate neighbors on a level of a B-tree. The level, LEVEL2, on the right page, P_ID2, does not match the level, LEVEL1, on the left page, P_ID1.
To determine which page is incorrect, examine the surrounding pages and the contents of the two pages in question. Also, look for MSSQLEngine_8931 errors, which indicate B-tree parent-child-level mismatches.
Possible Causes
This error can be caused by one of the following problems:
- A random page corruption.
- A bug in the B-tree manager.
- If LEVEL1 and LEVEL2 are 0 or 1 and the index is a clustered index, there might be a bug in the Access Methods code that determines page levels. In ssVersion2005, for a clustered index, page levels progress from 0, 1, 2 to X, but in ssVersion2000, for a clustered index, page levels progress from 0, 0, 1, 2 to X, where X is the maximum depth of the B-tree.
User Action
Run hardware diagnostics and correct any problems. Also examine the Microsoft Windows system and application logs and the SQL Server error log to see whether the error occurred as the result of hardware failure. Fix any hardware-related problems that are contained in the logs.
If you have persistent data corruption problems, try to swap out different hardware components to isolate the problem. Check to make sure that the system does not have write caching enabled on the disk controller. If you suspect write caching to be the problem, contact your hardware vendor.
Finally, you might find it useful to switch to a new hardware system. This switch may include reformatting the disk drives and reinstalling the operating system.
Results of Running Repair Options
Repair will rebuild the index.
This is the general format for each message:
- An explanation of what the error means with regard to what type of data is
damaged
- Possible causes of the problem
- What actions you can take
- What repair will do to correct the problem
The actions to take seem general (check the hardware and so forth), but you will see when I talk about finding the root cause of corruption that these actions make sense for many errors detected by CHECKDB.
What Does REPAIR_ALLOW_DATA_LOSS Really Mean?
This repair option means that CHECKDB has detected at least one error that could result in data loss. I've already talked about how this may not apply to every error detected. This is why the repair recommendation message says the minimum repair level:
repair_allow_data_loss is the minimum repair level…
The method for repair to correct a problem recommended by REPAIR_ALLOW_DATA_LOSS usually is to de-allocate the page or groups of pages (extent) based on the error. In some scenarios, a data row could be deleted rather than the entire page, but they are limited (for example, if the LOB pages associated with the row are damaged and must be de-allocated). If any part of the structure of the data row itself is damaged, the entire page must be de-allocated.
I know of two situations in which the recommendation is REPAIR_ALLOW_DATA_LOSS but repair actually can correct the problem without a loss of data.
Damaged index page
If DBCC CHECKDB encounters an error such as a checksum failure for a nonclustered or nonleaf page of a clustered index, it recommends REPAIR_ALLOW_DATA_LOSS. This is because it cannot trust the allocation unit ID information on the page to know for sure it is a nonclustered index page. So it must de-allocate the page. However, the repair logic in this case also rebuilds the index so that the result is simply to rebuild the nonclustered index.
PFS free space error
You may encounter the following error when running DBCC CHECKDB:
Msg 8914, Level 16, State 1, Line 1
Incorrect PFS free space information for page (1:128) in object ID 60, index ID 1, partition ID 281474980642816, alloc unit ID 71776119065149440 (type LOB data). Expected value 0_PCT_FULL, actual value 100_PCT_FULL.
In this case, the CHECKDB code recommends REPAIR_ALLOW_DATA_LOSS but repair simply fixes the free space information in the PFS page with no data loss occurring as a result.
Root Cause Analysis Before Recovering
It is perfectly understandable for you in situations where CHECKDB reports an error to want to correct the problem as soon as possible, either by restoring from a backup or by using repair. However, if you want to have the ability to determine the possible root cause of the problem, I recommend you take the following steps:
- Try to back up the damaged database, and keep it before restoring or repair.
You may not have time to do this. That is understandable in production situations. But if you have the time and disk space, please take this step. If you contact Microsoft technical support, this is one of the things they will want you to do.
- Take the time to investigate any possible hardware or system problems.
- Use some of the techniques I describe later in this chapter for root cause analysis to check your system.
What if Repair Doesn't Work?
There are some errors that CHECKDB cannot fix:
- Critical system table damage
- PFS page damage
- Data purity errors
So what if you run repair and it doesn't fix all errors? In previous versions, I would tell you to just try it again. There were some situations where repair had to be run more than once to clean up all errors. Those have been corrected for SQL Server 2005. If repair simply can't fix all of your errors, your only choice is to copy all data possible from the database using BCP or SELECT and move it to another database.
Copying Data Versus Repair
Are there situations where you should consider trying to copy data using BCP or SELECT before you try to use repair? The only example I know of is where you may want to try and copy valid rows from a page where only a single row or a few rows are damaged. Because a damaged row would cause repair to de-allocate the page, you may want to see whether you can copy valid rows from the page except for these damaged rows. Of course, this makes sense only if these specific rows are that crucial to your business.
The only method to do this is to force the use of a nonclustered index to select specific rows (using the index hint in the SELECT statement) based on values using the index to obtain rows "around" the damaged row(s). This assumes the nonclustered index itself is intact. If you don't have a valid nonclustered index, you cannot create one in this situation. This technique can work because when the engine finds a specific row based on using the nonclustered index (assuming you have the right WHERE criteria), it can avoid scanning unnecessary rows. I've included an exercise at the end of this chapter for you to walk through an example of how to do this.
Find the Root Cause of Corruption: The Checklist
So, let's say you encounter errors from DBCC CHECKDB, but this is the first time you have ever seen it. You decide you don't want to invest much time in finding the cause. A restore of the database resolves the problem, it didn't take long, and you want to move on to other problems you need to solve. That may be fine, but what if at some point a few weeks you run CHECKDB and the same database has errors again? What if a different database is now damaged?
I've put together here a checklist of what I as a PSS engineer over the years use when a customer wants me to investigate the root cause of a database corruption problem as evidence of reoccurring errors from CHECKDB (or perhaps checksum errors on various pages).
As you read through this list, keep in mind one key fact about root cause of database corruption: Almost all database corruption situations are caused by a problem with the underlying disk system (drivers, controllers, firmware, disk, and so on). I'm not just telling you this because I work for Microsoft. I give you this observation based on my experience of seeing many customer reports of database corruption through the years of supporting SQL Server 7.0, 2000, and now 2005. As I go through this list, I indicate whether a particular piece of information you collect could point to a system problem or perhaps some other possible cause.
You should also know that in my experience many corruption problems go unsolved. Typically this is because
- There may not be an obvious sign of why the corruption occurred (for example, no system event log entry for a hardware problem).
- Many corruption problems do not reoccur.
- The diagnostics and time required to find the cause can be expensive.
Time will tell, but I'm hopeful that the database checksum feature for SQL Server 2005 will help improve the first reason. This is because a checksum error (Msg 824) would indicate some alteration to a database page had occurred since the engine wrote the page to disk.
As you read through this checklist, keep in mind that just about any information I discuss for data collection can be obtained using the SQLDIAG.EXE utility. This is the primary data collection tool to ensure all information is captured at a consistent point in time.
Keep Track of Problem Details and History
The root cause of any problem is always difficult if you don't have the right information and, in some cases, the history behind the problem. For reoccurring corruption problems, this is important.
Review ERRORLOG and Event Log
I hope you know how important it is to have the SQL ERRORLOG and System/Application event logs when working on problems of this magnitude. The typical information I look for when reviewing an ERRORLOG for this type of problem is as follows:
- "Last known good" for CHECKDB of databases
- Summary message on errors and repair
- Any stack dumps or critical errors (such as Msg 824)
- The use of any extended procedures or sp_OA COM objects (memory scribblers)
The event log is important, too. First, perhaps your ERRORLOG files have wrapped, but important past information in the event log has not been cleared.
Next, the System event log may contain information about possible disk, hardware, or Windows system problems. Table 2-2 contains a few common event log entries seen by PSS in the context of corruption cases, with pointers on content to read about each type of error.
TABLE 2-2 Event log entries seen by PSS
| Source |
Error |
Notes |
| any |
The device, DeviceScsi.cpqcissm1, did not respond within the timeout period. |
See KB 259237 and 154690 |
| Disk |
The driver detected a controller error on DeviceHarddisk4DR4 |
See KB 259237 and 154690 |
| Disk |
The device, DeviceHarddisk14
DR14, is not ready for access yet.
. |
See KB 259237. |
| SaveDump |
The device, DeviceHarddisk14
DR14, is not ready for access yet. |
Engage Windows Support. |
| Disk |
An error was detected on device DeviceHarddisk3DR3 during a paging operation. |
This error indicates an I/O error during a hard page fault. Discussed in KBs 304415 and 305547 . |
| ClusSvc |
Cluster disk resource Disk J:: is corrupt. Running ChkDsk /F to repair problems. |
KB 259237: "can be the result of SCSI host adapter configuration issues" or a malfunctioning device. Also see 311081 and 259237. |
| ClusSvc |
The file system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the volume F: |
Note that there is at least one case (320866) where this error is erroneously raised |
| Ntfs |
The files system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the volume F. |
Note that there is at least one case (320866) where this error is erroneously raised |
| Disk |
Data was recovered using error correction code on device Device Harddisk5DR5. |
|
| EventLog |
The previous system shutdown at 9:45:36 AM on 9/5/2004 was unexpected |
Typically indicates a hard server cycle after a hang or a blue screen. Could also indicate something more mundane such as a power failure if the system isn't protected by UPS. |
| Ftdisk |
{Lost Delayed-Write Data} The system was attempting to transfer file data frombuffers to DeviceHarddiskVolume4. The write operation failed, and only some of the data may have been written to the file. |
Indicates a failed I/O request. Discussed in KBs 311081 and 304415. Could be anything from a firmware bug to faulty SCSI cables |
Perform Hardware Evaluation and Updates
As previously mentioned, many corruption cases turn out to be caused by fault disk systems. Yet for as many of these I've seen, I also find that many customers don't keep their disk hardware and system up-to-date. Some disk vendors provide basic diagnostics you can run regularly for the disk system, and many have updates for their drivers and firmware (much like Microsoft has for its software). So, include evaluation and updates for your disk system (plus all of your server hardware) as a regular part of your maintenance plan and strategy.
Install the Windows PAE Fix (KB 838765)
If you read anything in this section, pay attention to what I'm about to tell you. Any Windows 2000 or 2003 server using the physical addressing extensions (PAE) should make sure they have the necessary Windows fixes as described in Microsoft Knowledge Base article 8387765 applied. If you are running Windows 2003 Service Pack 1, you are covered. If not, read the article and get the fixes the article describes.
This problem in Windows can result in unpredictable behavior, including access violations and database corruption. The reason is that the bug can result in unexpected frames from Windows pages for other processes to be written on top of memory pages for the SQL Server process.
Note that on some systems, PAE is enabled even if you don't specify the /PAE switch in your boot.ini. My advice is to get the fixes for this problem installed on your server.
Run CHKDSK.EXE
This program is often forgotten when analyzing root cause of corruption problems. If the CHKDSK.EXE program from Windows shows damage to the NTFS file system for the drive where SQL Server database and log files exist, your search for a cause should stop there. Any damage to the file system could result in database or log corruption.
One common question for CHKDSK.EXE is "Can I run the repair option for SQL Server files?" The answer is yes, but you lose data if repair moves NTFS clusters where SQL Server data is stored. The actions to take before running CHKDSK repair depend on your available backups and the state of the disk system. If you believe the disk system is not a problem (perhaps you don't see any disk-related problems aside from NTFS errors in the event log) and you can restore from a backup, proceed with CHKDSK repair. If you cannot restore from a backup, however, you might want to consider copying data from the SQL database in question before you do anything else. This is because CHKDSK repair could move clusters of your SQL Server database file, and you might not know what is really damaged or available to copy. One thing you should not do is rely on DBCC CHECKDB repair when NTFS problems exist. This is the same advice I have provided for disk system problems. If the disk is damaged, running CHECKDB repair is not a wise decision. Why repair SQL Server pages based on a faulty disk system? Who knows whether repair will work or itself encounter corruption problems.
Evaluate CHECKDB Results
Saving the result of DBCC CHECKDB when errors are detected is the most important piece of information to analyze the possible root cause of corruption. If you don't know what errors were encountered, it is difficult to discuss the cause of corruption. My philosophy is that you should save any execution of DBCC CHECKDB whether it was successful or not. Keep the results of CHECKDB in the same LOG directory where ERRORLOG files are kept.
Inspect Damaged Databases
In some cases, just looking at the errors raised by DBCC CHECKDB is not enough to understand a possible root cause. It is important to look at the actual damage to a page to understand a possible cause. It may be important to also look at previous copies of the database (through older backups) to look for any special patterns of the damage.
This is one reason why it is important to back up a database before you repair it or at least dump the pages with DBCC PAGE.
Let's say you encounter a checksum error on a page like the following:
Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0xc3e99060; actual: 0x78dc1f61). It occurred during a read of page (1:152) in database ID 17 at offset 0x00000000130000 in file 'C:Program FilesMicrosoft SQL ServerMSSQL.1MSSQLDATAtest.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be
corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
Let's say you look at this page with DBCC PAGE and get the following output (I included only a portion of the page):
DBCC PAGE(17, 1, 152, 2)
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
PAGE: (1:152)
BUFFER:
BUF ((Content component not found.)) x02BEA7C0
bpage = 0x04DC0000 bhash = 0x00000000 bpageno = (1:152)
bdbid = 17 breferences = 3
bUse1 = 25322
bstat = 0xc00809 blog = 0x999a2159
bnext = 0x00000000
PAGE HEADER:
Page ((Content component not found.)) x04DC0000
m_pageId = (1:152) m_headerVersion = 1
m_type = 1
m_typeFlagBits = 0x4 m_level = 0 m_flagBits = 0x8200
m_objId (AllocUnitId.idObj) = 67 m_indexId (AllocUnitId.idInd) = 256
Metadata: AllocUnitId = 72057594042318848
Metadata: PartitionId = 72057594038321152 Metadata: IndexId = 0
Metadata: ObjectId = 2073058421 m_prevPage = (0:0) m_nextPage = (0:0)
pminlen = 12 m_slotCnt = 1 m_freeCnt = 8079
m_freeData = 111 m_reservedCnt = 0
m_lsn = (37:79:3)
m_xactReserved = 0 m_xdesId = (0:0) m_ghostRecCnt = 0
m_tornBits = -1008103328
Allocation Status
GAM (1:2) = ALLOCATED SGAM (1:3) = ALLOCATED
PFS (1:1) = 0x61 MIXED_EXT ALLOCATED 50_PCT_FULL
DIFF (1:6) = CHANGED
ML (1:7) = NOT MIN_LOGGED
DATA:
Memory Dump ((Content component not found.)) x4454C000
4454C000: 01010400 00820001 00000000 00000c00 †................
4454C010: 00000000 00000100 43000000 8f1f6f00 †........C.....o.
4454C020: 98000000 01000000 25000000 4f000000 †........%...O...
4454C030: 03000000 00000000 00000000 6090e9c3 †............`...
4454C040: 00000000 00000000 00000000 00000000 †................
4454C050: 00000000 00000000 00000000 00000000 †................
4454C060: 10000c00 01000000 01000000 0200fc00 †................
4454C070: 68656c6c 6f776f72 6c640000 00000000 †helloworld......
4454C080: 0000ba03 00000000 00000000 00000000 †................
4454C090: 00000000 00000000 00000000 00000000 †................
4454C0A0: c0000080 00000000 00000000 00000000 †................
4454D9A0: 00000000 00000000 00000000 00000000 †................
4454D9B0: 00000000 00000000 00000000 00000000 †................
4454D9C0: 00000000 00000000 00000000 00000000 †................
4454D9D0: 00000000 00000000 00000000 00000000 †................
4454D9E0: 00000000 00000000 00000000 00000000 †................
Let's say you know that this page is for a table that shouldn't contain any character columns. Notice the "helloworld" string stored on the page. This may be a hint to you that something wrote incorrectly the string helloworld on a database page in memory and then the page was written to disk. Now it is damaged. I've seen this exact type of problem occur when a buggy extended stored procedure wrote strings on database pages, causing corruption.
Use the Database/Log "Replay" Technique
Suppose you don't see any common disk system type of errors in the event log, SQLIOSTRESS doesn't encounter any errors, and CHKDSK doesn't report any problems. Now what? Well, first you should know that the underlying cause could still be the disk system. I've seen situations where the event log has no errors, CHDDSK is clean, and SQLIOSTRESS running for hours showed no errors. Yet when the customer replaced the SCSI adapter or disk drive, the problem went away. I don't have scientific explanations for these situations except to say that some problems could be timing-related and specific to IO patterns from the use of SQL Server in production. Existing tools or utilities don't expose the problem.
Having said that, I've also been in situations in which I suspected SQL Server could be the cause due to symptoms such as the following:
- The errors from CHECKDB are always with the same table and perhaps are even the same type of errors (say row damage to an index).
- A "replay" of a clean database full backup and a series of transaction log backups show the errors from CHECKDB.
I've seen situations where the first point is true, but the problem was still system-driven just because the index or table was a hotspot and was used frequently. I've even seen the second point being a system problem just because of the nature of what was recorded in the transaction log.
Having said that, if you can take a full database backup and restore a series of transaction log backups to replay errors from CHECKDB, you should engage Microsoft PSS. Be careful here. You must first ensure that the full database backup is clean. In other words, if you just restored the database backup, CHECKDB should report no errors.
The reason this technique is important is that it indicates some transaction log record can be redone or undone and cause an error to occur with CHECKDB. If you replay a backup and transaction logs and you don't find any errors from CHECKDB, the problem must have occurred on the original database or log file and is not baked into a record from the transaction log.
One possible explanation at this point is that a database page was damaged when it was written to disk or after being written to disk. Database checksum is the primary technique to discover if a page was damaged after SQL Server wrote it to disk. Let's look at another alternative and methods to detect damage to a page before it is written to disk.
Use SQL Server IO Audits
I've described in this chapter the new database checksum feature for SQL Server 2005. Now I show you some more advanced techniques used to detect certain types of damage to pages in addition to checksum. It is important to know that the use of some of these features, usually by trace flags, can have performance implications for your server. Database checksum was designed into the product and tested for performance (which is why it is the default option for a database in 2005). My recommendation is that you consult with Microsoft PSS when considering these options.
Stale Read (Trace Flag 818)
A stale read occurs when SQL Server writes a modified page to disk but the disk system returns a previous version of the page (perhaps from hardware cache). A lost write occurs when SQL Server modifies a page and writes it to disk, but the disk system never stores this modification to disk, so the previous version is returned when reading the page from disk. These differences are subtle, and the symptoms of the problem can appear to be the same: A modification to disk appears to be lost. One primary difference is that a stale read can result in correct data after something like a system reboot occurs to clear the cache.
It is important to know that this situation would not be detected by a database checksum. This is because the page itself is valid based on the checksum value; the hardware is simply giving back an unexpected version of the page.
In SQL Server 2000, the SQL Server development team added logic to detect this problem if an error for a page was encountered such as Msg 823 or 605. If you enabled trace flag 818, the engine would keep track of the latest LSN value for a database page in memory. When an error was encountered, the engine would compare the LSN value on the read page to the last known modified LSN in its "list of LSN values." If they didn't match, an error like the following was written to the ERRORLOG:
SQL Server has detected an unreported OS/hardware level read or write
problem on Page (1:75007) of database 12
LSN returned (63361:16876:181), LSN expected (63361:16876:500)
Contact the hardware vendor and consider disabling caching mechanisms to
correct the problem
In SQL Server 2000 Service Pack 4 and SQL Server 2005, the SQL Server development team enhanced the design of this trace flag to perform the LSN check on every read of a page and to store the LSN list in a more efficient hash table design. This auditing is not only by default and requires the trace flag to be enabled at server startup time. Again, you should use this trace flag only if you are having difficulty tracking down the cause of the corruption problem.
Eliminate Possible Memory Scribbler Software
If you think the problem is a damaged database page before it is written to disk, seriously consider eliminating any software running in the SQL Server process space that could cause unexpected memory damage. This includes user-written extended stored procedures, COM objects loaded by sp_OA, and linked server providers. I'm not saying this because it is a safe thing to do. I'm telling you this because I've seen things like user-written extended stored procedures have bugs that overwrite SQL Server database pages leading to corruption as they are written to disk.
If you can't disable these custom objects and procedures, considering moving them out of the process space for SQL Server. The SQL Server documentation provides information on how to run COM objects out of process with sp_OA and linked servers out of process when configuring the linked server. An extended procedure would be run out of process on the primary SQL Server by installing it on another SQL Server and executing it from the primary as a SQL remote stored procedure call.
Relocate Files to an Alternative Location
The last item on the root cause checklist may sound too simple, but I put this here because I've seen it work before. If you have tried all these options and are still scratching your head for a possible cause, it could still be your disk system. I've had customers simply move their database (or a system database such as tempdb) to another disk drive or disk system (such as a SAN or off of a SAN), and the problem has disappeared. To be fair and accurate, the move of the files might have changed the timing of the problem, and it could still be SQL Server. In many of these cases, however, I've seen customers then replace the original disk controller or disk system, and the problem will permanently go away. Don't discount this simple technique when you are thinking of all of these trace flags and tools to use.
Use the following table of contents to navigate to chapter excerpts, or click here to view Data corruption and recovery issues in its entirety.
| About the book: |
|
| In this book, bestselling author and SQL Server guru Ken Henderson has worked with the SQL Server Development team to write "the" troubleshooting SQL Server 2005 book. All of the content comes directly from the creators and developers of SQL Server. Ken has edited each chapter to ensure that the writing is clear and sound. Written by the people who know the product best - the people who built it - SQL Server 2005 Practical Troubleshooting teaches developers and dbas how to troubleshoot and diagnose problems they may encounter with SQL Server. Purchase the book from Addison-Wesley Publishing |
| ABOUT THE AUTHOR: |
|
| Ken Henderson has been a developer for more than 25 years. He has worked with SQL Server since 1990 and has built software for a number of firms throughout his career, including H&R Block, the Central Intelligence Agency, the U.S. Navy, the U.S. Air Force, Borland International, JP Morgan, and various others. He joined Microsoft in 2001 and is currently a developer in the Manageability Platform group within the SQL Server development team. He is the creator of SQL Server 2005's SQLDiag facility and spends his days working on SQL Server management tools and related technologies. He is the author of eight books on a variety of computing topics, including the popular Guru's Guide series of SQL Server books available from Addison-Wesley. He lives with his family in the Dallas area and may be reached via email at khen@khen.com. |