Software and Related Stuff: 04/01/2003

Breaking open the black box. Look beyond inputs and outputs to locate the cause of those "can't reproduce" production errors.

Recently I was involved with testing a J2EE application that failed with SQL errors when run in a simulated production environment. We tracked down the problem to a PreparedStatement that wasn't being closed, eventually causing the DBMS to run out of cursors. The reason the error didn't manifest itself in test wasn't due to the larger volume of transactions being processed or the size of the database, but differences in garbage collection. The database resources associated with a PreparedStatement are freed when the object is garbage collected, and running under the heap size used in QA garbage collection was frequent enough that the number of open cursors never exceeded the DBMS's limit. With the much larger heap size in our simulated production environment this no longer happened and the cursors associated with the PreparedStatements stayed open.

I've seen similar "works in test but fails in production" problems in other systems, such as a C program that didn't close file descriptors (and didn't test the return value of open()). Underlying this is an inherent deficiency in most QA methodologies- testing normally considers only program inputs and outputs. This is fine for traditional applications that run for some finite amount of time, at the end of which their outputs can be compared to expected results, but is insufficient for server applications that run forever. QA needs to track a third parameter - internal system state.

The SQL cursor problem I described was identified by using Oracle dynamic views (system tables that expose the current state of the Oracle server) to obtain the SQL query that was causing the problem, while the cause of the errant C program's problem was discovered when a system call trace revealed that open() was returning unusually high and steadily increasing values for its returned file descriptor (most UNIX system support either truss or strace to trace a process' system calls). Incorporate tests for these values into your normal QA process and you can stop those "unreproduceable" production errors before they happen.

Sunday, April 13, 2003