Sunday, January 04, 2004

How cheaters show what's wrong with Computer Science education. Any online programming forum gets a large number of "do my homework for me" postings. What makes it so obvious that these are homework questions reveals a great deal about what's wrong with undergraduate computer science education.

The failure to update introductory curricula to reflect changes in programming languages is responsible for much of this obviousness. The traditional syllabus of implementing data structures such as lists and trees have little pedagogical value now that these are standard language components; while students still need to know what these are and how to use them there's no point in having them code half a dozen variants of each. Data structures should be covered the same way operating systems are, as a broad survey of the subject area.
The other, more important reason that makes it obvious that a posting is homework is the complete lack of design choices associated with the problem. The requirements are precisely defined and unambiguous and usually there's one obvious choice for the program's data structures and algorithms. The challenges of real world software engineering - incomplete requirements, architectural tradeoffs, etc. - are nowhere to be found. This deficiency isn't confined to introductory computer science courses, either - for example, the ACM Programming Contest problems are little more than "identify the algorithm implied by this task and implement it".

So what should a good second-semester programming assignment look like? A typical assignment should have:

  • High-level requirements that incompletely specify the design requirements.
  • Multiple plausible (at first glance) implementations. There may be only one that actually works, but it should take some effort to discover it.
  • The completed assignment must include a design rationale document explaining the design choices and assumptions made by the student. In addition to forcing students to think through their design this should improve the pathetically low writing skills common to computer science students.

Here's an example assignment:

  • Input data will be an XML document containing U.S. personal street addresses defined by the attached DTD. A sample data file is included, but the assignment will be graded using an input file of approximately 100,000 addresses.
  • Addresses are to be validated and loaded into a database. A valid address contains at least a last name, a street address or box number, a city, a valid 2-character state abbreviation, and a 5-digit numeric zip code.
  • The address database can be printed, in a format appropriate for mailing addresses, ordered by name or zip code as specified by the user.
  • Your assignment must execute correctly using no more than reasonable-limit mb of memory and in less than reasonable-limit minutes [the objective here is to rule out pathologically bad implementations].

So what's good about this assignment? There are lots of design decisions, from minor ones like whether to provide separate programs for each function to major ones such as how to parse the XML input; there are requirements that have to be researched, like determining the set of valid state abbreviations (there are more than the 50 states); and since the student doesn't get the data used to grade the assignment he or she has to actually think about how the code should work instead of attempting to get the correct output by trial and error.

The best thing about this assignment is it's almost impossible to cheat on it using the usual "post to a forum and cut and paste the response" technique. Because of the design and requirement decisions implicit in this assignment posting it to a forum is likely to produce almost as many solutions as there are responses: some good, most bad (probably 90% will try to parse the input into a DOM tree, which will fail for the complete input file), and a lot of heated debate about what solution is best. Even if the cheater picks a good solution from the responses there's still the design rationale document he or she has to turn in. Someone who doesn't understand how to solve the assignment is unlikely to be able to describe why their plagiarized solution works.