F*cking testdata, how do they work

What is testdata?

Database backed server applications have been my main focus for the last 5 years. I have tested them manually, written automated tests and performed exploratory testing on them through client applications, web interfaces and even using tools like SoapUI. The same problem has been the bane of my existence over all those years. F*cking testdata.

When researching this post I ran into a small problem. Very few people seemed to agree on what test data actually is. Some people only considered the inputs, others included database content but not configuration data. So not alot of the information available has been really helpful. Some of it was even downright infuriating. Reading a post explaining that you only really needed testdata for the trivial cases and advocating a simple database dump for that purpose made me want to hulk-smash my ipad on the floor.

I'm going to define test data as all data that can be modified by the users of the system. That includes administrators and maintainers as well as normal users. Specifically it includes all data in the database and other datasources as well as everything in the configuration files. When testing a specific piece of functionality being able to control the configuration and database content is invaluable and in some cases indispensible.

Finally note that I'm following Michael Bolton's and James Bach's lead in referring to automated tests as automated checks. It helps clarify the purpose and capabilities of automated tests vs. manual, exploratory ones.

Problems with test data

Setting up for testing has proven to be a really difficult problem to solve. For trivial cases and certain happy path cases some small amount of data seeded in the database before each testrun might be enough. For more complex cases perhaps an anonymized subset of the production data might fit the bill. Those two methods share the problem of not allowing you to really dig in and model properly all of the interesting combinations of data that become relevant when new functionality is added or business logic changes. They also have severe problems with test isolation and repeatability. I often find myself spending much more time figuring out how to get the data I need into the database then actually testing the system with that data. In the absence of direct database access that time can grow by an order of magnitude.

Relying on ad hoc queries and database dumps or on running complex management tools makes testing really slow and brittle. This means we that it becomes slow, painful and sometimes impossible to test important real-world scenarios. The context is also very important. What is fine, even speedy, for exploratory testing can become intolerably slow in the context of automated checks.

Different approaches, their pros and cons

There are several different approaches to getting data into the database to work with. They all have their weaknesses and some might even have strengths.

  • Full system generation

    This involves using the system under test as well as it's management tools to generate the data you need. It can involve adding static-ish data with the management tools and performing several actions via the SUT both to put the system into the state you want and to have the necessary data to perform your tests. This can be an excruciatingly slow process and often it is not possible to create "invalid" data like that. This limits your testing considerably and prevents you from automating a lot of useful and important checks.

  • Database dumps

    This involves taking a snapshot of the database after you have seeded it with data for your tests, often generated by using the full system generation technique. Then to perform your tests you pipe in the database dump and restart your application and bob's your uncle. While useful for exploratory testing this techinque breaks down on two fronts. Maintaining the database dumps becomes a major headache as database schemas change and as new functionality is added to the application. On a more serious note, if you rely on automated checks, the checks have a tendancy to become extremely tightly coupled with the database dumps, and by extension, each other. This leads to highly unstable automated checks and making them re-runnable becomes a big issue, ultimately leading to lack of trust in the checks.

  • Ad hoc SQL queries

    This technique involves using pre-prepared SQL statements to add and remove the required data from the database. It has the nice property of allowing for good check isolation and making the checks rerunnable becomes a breeze. However maintaining all these disparate snippets of SQL in the face of schema and logic changes quickly becomes a nightmare. For exploratory testing though this is still my go-to option once I've exhausted the capabilities of the management tools and my patience with them.

  • Data services

    I have not seen much written about this method. Is basically involves exposing a set of services to the testers that allow them to programmatically add data to the application, no matter how badly it might break the assumptions made about it in the rest of the application. Thus solving the problem of full system generation not allowing you to add invalid or inconsistent data. I am pretty excited about it but as I have not tried it out personally I don't have any basis for making a recommendation. I will note that in theory it should help with the problem of maintainance as the application's datamodel could be reused to some extent. It's only weakness is the fact that the service calls might add an unaccepteable amount of overhead to setting up the data for lot of automated checks. It also has the disadvantage of being in the developer's domain so budgeting for it and getting management buy-in might be difficult.

  • ORMish solution (+optional DSL)

    This is my current favored solution. It involves using an ORM layer to shield the code of the automated checks from the complexities and changes in the database while allowing for full control of the data inserted there. It can be lightning fast and with a suitable API/DSL adding test data is done at the speed of thought. The downside is of course that maintainance of the ORM and the API/DSL itself can be difficult and expensive. I find that cost to be vastly outweighed by the increased expressiveness in defining my test data and the wastly increased speed by which I can write my automated checks.

So what's the big deal?

The ORM solution has improved my productivity when writing automated checks immensely. I have now come to dread testing the applications for which I have not implemented such a thing.

My first iteration involved implementing a hyper-simple orm layer in python and manually entering the table definitions. With a quick hack for linking foreign keys and sensible default values for other fields I can write surprisingly succint descriptions of my data.

For example with the definitions of a few tables:

class customer(OMObject):
    defaults = {
        "id" : Link("default.customer.id"),
        "name" : "John",
        "email" : "John@example.com",
        "username" : "jogobobo",
        "passhash" : get_default_hash("John"),
        "date_created" : datetime.datetime.now()
        "active" : True
    }

class payment_info(OMObject):
    defaults = {
        "id" : Link("default.payment_info.id"),
        "customer_f_id" : Link("default.customer.id"),  # Link to default customer.
        "payment_data" : SomeSensibleDataHere()
    }

class product(OMObject):
    defaults = {
        "id" : Link("default.product.id"),
        "name" : "Default product",
        "price" : 39.95,
        "key" : None
    }
    def __init__(self, *args, **kwargs):
        self.key = unique_str(16)  # Unique constraint on key
        super(product, OMObject).__init__(*args, **kwargs)

class payment(OMObject):
    defaults = {
        "id" : Link("default.payment.id"),
        "product_f_id" : Link("default.product.id"),
        "payment_info_f_id" : Link("default.payment_info.id"),
        "customer_f_id" : Link("default.customer.id"),
        "status" : "SETTLED",
        "date_created" : None,
        "date_completed" : None
    }
    def __init__(self, *args, **kwargs):
        self.date_created = datetime.datetime.now()
        self.date_completed = datetime.datetime.now()
        super(payment, OMObject).__init__(*args, **kwargs)

We can create a consistent datamodel that can be persisted before the check is run and cleaned up after the run automatically. We can even reference the data in the check and thus almost completely remove reliance on magic constants. The use of python's context managers greatly helps in removing obfuscating and error prone boilerplate code.

Ex:

def test_purchase_history():
    with TestData(in_model=get_purchase_history_model()) as datamodel:
        login_to_account(datamodel.customers[0])
        purchase_history_list = get_purchase_history()
        expected_history = generate_purchase_history(datamodel)
        assert purchase_history_list == expected_history,\
            "Expected purchase history to be %s but got %s"\
            % (expected_history, purchase_history)

def get_purchase_history_model():
    return DataModel(
        customer(id=Link("cust.id")),
        product(id=Link("prod1.id")),
        product(id=Link("prod2.id")),
        payment_info(id=Link(pminfo1.id),
                     customer_f_id=Link("cust.id")),  # use default payment data
        payment_info(id=Link("pminfo2.id"),
                     customer_f_id=Link("cust.id"),
                     payment_data=DefaultPaypalData()),
        payment(id=None,
                product_f_id=Link("prod1.id"),
                payment_info_f_id=Link("pminfo1.id"),
                customer_f_id=Link("cust1.id")),
        payment(id=None,
                product_f_id=Link("prod2.id"),
                payment_info_f_id=Link("pminfo1.id"),
                customer_f_id=Link("cust1.id")),
        payment(id=None,
                product_f_id=Link("prod1.id"),
                payment_info_f_id=Link("pminfo2.id"),
                customer_f_id=Link("cust1.id")),
        payment(id=None,
                product_f_id=Link("prod2.id"),
                payment_info_f_id=Link("pminfo2.id"),
                customer_f_id=Link("cust1.id")),
        payment(id=None,
                product_f_id=Link("prod1.id"),
                payment_info_f_id=Link("pminfo2.id"),
                customer_f_id=Link("cust1.id"))
    )

More often than not though, the more complex the test scenarios get, the bigger and more unwieldy the data definitions get. Furthermore, reusing the same datamodel for concurrent running of checks quickly leads to massive, hard to read data definitions that make newcomers to the project weep. The code above is already pretty impenetrable at first glance.

I'm still attempting to develop a satisfactory solution to the readability issue. Clearly the API demonstrated above is lacking in both readability and expressiveness. Wrapping the ORM with domain objects simplifies the maintainance significantly in the face of minor schema changes although, as always, when sweeping changes are made a lot of tests will need to be fixed or even rewritten.

Results so far

I have found that the models become much less hard to read as one works with them more. Furthermore, the changes made to the schemas that I have maintained an ORM against in this manner have so far not required any changes to the checks necessitating only sensible choices of default values in the definitions of the object model. The verboseness of the models has encouraged me to explore more in my testing and discard many testcases that I had written checks agains as not necessary for good regression coverage. I end up testing a much broader swathe of the services and functionality I am tasked with while keeping only a few, powerful, automated checks for the regression suite. I feel I am enjoying the best of both worlds, getting excellent coverage while maintaining a suite of automated checks which does not drive the developers and testers insane with its fragility and maintainance costs.

I have recieved excellent feedback from my coworkers about the readability of the ORM approach. I also feel the pain when revisiting check-suites where I did not excercise greater restraint in which testcases to keep as automated checks. So my current plan is to implement an inner DSL for python to allow me to represent the datamodels in a more succint manner and eliminating the requirement for explicit foreign key references with the Link() hack. I might expand it to a fuller outer DSL at some point to allow for better error handling and detection but for now an inner DSL allowing me to do the following would be quite sufficient.

def get_purchase_history_model():
    return DataModel(
        Customer(name="Emil")(
            PaymentInfo(payment_data=DefaultCreditCard())(
                Product(id=Link("prod1.id"))(
                    Payment()
                ),
                Product(id=Link("prod2.id"))(
                    Payment()
                )
            ),
            PaymentInfo(payment_data=DefaultPayPal())(
                Product(id=Link("prod1.id"))(
                    Payment(),
                    Payment()
                ),
                Product(id=Link("prod2.id"))(
                    Payment()
                )
            )
        )
    )

Which approach do you prefer? What has been your experience with that approach? Don't hesitate to chime in in the comments.