Testing any kind of software is difficult. The programmer is faced with a multitude of choices and trade-offs: what components to test and in what scenarios, how much time to spend writing test cases vs. developing new features, and so on. No matter how hard, many acknowledge that testing is paramount to producing a quality product. So important, in fact, that companies like Microsoft make software testing a full-time position.

And if testing most software is hard, testing statistical software is even harder. For most of my programming career I hardly had to deal with statistics and “big data” problems. As a result, when I embarked upon my first significant statistical project a year ago, it proved to be a major source of frustration. In this post I hope to share my experiences with testing statistical code. I will start by describing my initial sources of difficulty stemming from the differences between statistical and non-statistical programs. I will then present a number of testing guidelines which I have found to be particularly effective, and which avoid some of the pitfalls I would initially get myself into.

The initial discussion and guidelines are primarily about probabilistic software, which is just a subset of the possible statistical programs you could write. However, later sections deal with statistical software in more generality.

How Many Outcomes?!

Testing non-statistical software components often involves executing some action which can result in some fixed number of outcomes, and then verifying that the actual outcome is what we expected. For example, one might write the following code in Java:

     Set<String> hat = new HashSet<String>();

In the above case, there are two possible outcomes of an “insert” operation: either the set contains the added element, or not. Having inserted an element into the set, we check whether the outcome is what we want it to be. Of course, in real programs tests are more complicated and often involve a significantly larger number of possible outcomes. However, the outcomes still usually form a set of a manageable size.

Statistical code differs in two significant ways:

  • there are several orders of magnitude more possible outcomes of an operation
  • it is often not obvious which outcomes are correct

I will try to illustrate the impact of the two points above on an example. Consider a Maximum Likelihood Estimator (MLE) – arguably one of the simplest statistical tools. An MLE counts the occurences of an element in a sequence of observations , and uses that to estimate the probability of that element. For example, if you observe three dogs and two cats, then an MLE would estimate the probability of each as 0.6 and 0.4, respectively. Notice how the outcome of an MLE is a number – a number that we cannot conveniently restrict to a small set of possible values. Not even in the range 0-1, since we cannot make correctness assumptions about the very thing we are testing. It can be said that despite the large number of possible outcomes, it is straightforward to say which are correct. That is often not the case, however, as the dataset might be too large or the model too complex to manually estimate probabilities. And even in cases when such manual estimation is possible, it is all to easy to make mistakes.

To make matters worse, the naive MLE is an oversimplified example. Consider only a slightly more sophisticated estimator, called an Absolute Discounting Estimator (ADE). An ADE subtracts a (small) probability mass from seen events, and distributes it across unseen events. This process of transferring probability mass, so called “smoothing”, is of paramount importance in applications such as spam classification and statistical Natural Language Processing. Let’s first define an ADE mathematically: let be the sequence of observations, where each observation belongs to a set of possible events, and let be a constant which controls the amount of smoothing. An ADE is then defined as:

While only slightly more complicated than an MLE, testing the ADE on large datasets can be a nightmare if you simply compare hard-coded estimates. You have to test that the ADE behaves correctly for different values of , and possibly on different datasets. It would be a Herculean task to manually compute all those probabilities, and then test the ADE against these estimates. That’s what I used to do as a beginner, and calling my experiences frustrating would be an understatement.

Obey Probability, It’s The Law!

Luckily, when it comes to probability, there are a few simple guidelines that make testing much easier. They might sound obvious to many, but the obvious sometimes needs pointing out, especially if our mindset is from another, non-statistical domain.

A great thing about probability is its extensive mathematical foundation, and the fact that it obeys several rather simple laws. Instead of testing the returned values directly, test that your components behave properly with respect to the laws of probability. No matter how complicated your model, or how large the test dataset, these laws should always hold. Always. Some of the laws I have found particularly useful are outlined below.

Stay positive

Make sure that every probability produced by your code is non-negative. While this may sound trivial, in many cases it is not. Consider the ADE formula I showed earlier – it has a flaw that might not be immediately obvious. Namely, if , we will get negative probabilities for infrequent elements! This problem might be extremely hard to track down in production, since the wrong answer produced by an ADE will usually be propagated through a long chain of other mathematical operations.

Be normal(ized)

If your code defines a discrete probability distribution, as in the case of a classifier, make sure that the probabilities of all events add up to one. Since this test requires your code to produce probabilities for all possible arguments, it has a good chance of detecting miscalculations for any single one.

Know your maginals

I have found marginalization to be extremely useful for testing classifiers, and it applies to other kinds of programs, too. It relies on the formula:

In many cases, you know the value of (for example, it might be your prior), and you know how to compute with the component you are testing. And even if you do not know , you can often compute it using conditional probabilities:


Similarly to normalization, marginal probabilities require you to compute probabilities for all possible values of an argument, and are good at catching any single miscalculation. They also have another advantage: you only have to sum over the possible values for one argument, which is computationally cheaper than summing over possible values for all arguments, as would be required by normalization.

Of course, the above list of probability laws is by no means exhaustive, and it is always a good idea to open a probability textbook and search for formulas that apply directly to your problem. For example, there is a large body of theory behind Markov Chains, and if Markov Chains is what you are testing, the relevant formulas will probably be better at testing your code than the general laws outlined above. No matter what laws you end up using in your code, however, the message is the same: try to avoid testing exact values if possible, as they are too parameter-dependent and too hard to correctly calculate manually – and use probability laws instead.

All Your Baseline Are Belong To Us

A good way to test any statistical software is to compare its performance against some reasonable baseline. For example, suppose you are developing a very sophisticated reinforcement learning system, and initial tests show great performance. The problem is that “great” is relative, and you never truly know how good your method is until you compare it to something else. Last week I was working with my CSAIL collegues on a paper for a large Machine Learning conference. We devised a clever method to apply Multi-armed Bandit techniques to online multi-objective tuning of computer programs. We gathered a lot of data, did statistical analysis and then plotted the average performance of a benchmark program as a function of time (lower is better):

Runtime of sort

We were excited since our approach seemed to converge very fast, and reduced the runtime of the benchmark by a significant amount. However, we then compared the performance of our method to a naive, “uniform random” baseline. Here is what we saw:

Uniform random clearly outperformed our method, regardless of how good the results initially looked. This allowed us to discover a major problem with our methodology – we did not correctly optimize hyperparameters. It also emphasized how important and useful it is to test your code against a baseline. Not only will it show how well your code performs, but also whether it performs worse than you expect it to. If it does perform worse, it might indicate a serious bug, as it did in our case.

You might argue that baseline testing is not really useful as a unit test, since it involves looking at plots and making inferences about behavior, but that is not necessarily true. There are many ways you can programmatically compare the performance of one method against another, such as Student’s t-test. It was in fact a t-test analysis of our benchmark data that allowed us to conclude that our buggy program was no better than a naive approach.

It’s All Magic!

My final point is not so much about writing unit tests, but about designing for testability. A few years ago, before I took any formal courses in probability and statistics, I would often use “magic”, ad-hoc scores for various things to indicate, for example the “likelihood” that candidate is better than candidate . While in many cases using such ad-hoc scores can seem straightforward and their use tempting, they are a nightmare to test. This is due to the fact that there is usually no sound theory to support them, and if you want to evaluate their performance and/or test them you cannot rely on any laws or tried analysis tools. Because of this, try to resist the temptation to use magic scores and other such metrics, no matter how straightforward to implement they are. More often than not, as I learned all too well, they will only lead to problems later on. Whenever possible, try to re-formulate your problem in terms of probabilities and other well-founded metrics. I wish somebody had said this to a younger me – how many hours of frustration it would have saved!

Go Forth and Estimate

That’s it, folks. In this article I tried to describe why, in my opinion, statistical software is much harder to test compared to “ordinary”, non-statistical programs. I argued that it was due to the sheer number of possible outcomes that statistical components can produce, and it is not always immediately obvious which of those outcomes are the correct ones. I gave a few guidelines that I personally found very useful when testing various kinds of statistical programs – both probabilistic and non-probabilistic. Finally, I made a case why magic, ad-hoc scores should not be used in place of probability whenever possible.

I hope you found my entry useful – there’s surprisingly little about testing statistical code on the Internet. As always, please let me know your thoughts and comments. Now, go back to your statistical programs and estimate (hopefully correctly)!

Thanks to Daniel Firestone for reading drafts of this article.

43 responses

Do you want to comment?

Comments RSS and TrackBack Identifier URI ?

Testing for laws instead of specific instances is indeed a powerful mechanism (QuickCheck, for example, encourages this, though it randomly generates inputs and attempts to condense down results into a “yes” or “no” answer, which is frequently not reasonable for statistical applications). You gloss over, however, how you select input data, which I’m curious about: I guess you take sample data from the real world and use that to run tests?

February 18, 2011 3:04 am

Hi Edward,

That’s exactly what you do in most cases. You obtain a large dataset (such as the Penn Treebank if you do natural language parsing), and split it into a small training set and a larger test set. You then train on one and test on the other. However, if you’re only testing correctness and not performance/accuracy, you could probably get away with randomized input.

February 18, 2011 3:46 pm

This is a very well compiled list of things. I write statistical software as part of my job profile and ran into very similar issues and coming up with similar solutions.

I will just like to add one thing, any dataset where-in you know that your algorithm overfits is a nice software testing dataset :-). For example, if one writes an HMM code, then one can generate a synthetic dataset using just a single path in the HMM. Then, one can allow the HMM to train from this dataset using different training sizes and monitor the entire HMM (along with its transition probabilities) and how it behaves with differing training sizes. This allows us to catch some odd behaviour that can happen due to underflow (which can go unnoticed for a long time).

Doing the same for other algorithms is not very easy, but when it is possible, it comes in very handy to catch some subtle bugs.

July 31, 2012 11:33 pm

Now add in the Obamacare requirement that if one DOES NOT
purchase health coverage they are fined by law.
There should be at least two recruits working full time.

Of course your key physician being in a network or not
may in your eyes differentiate between the plans
as one being better than the other.

June 12, 2013 3:31 pm

The online world offers the ideal opportunity for you to
start this task. Just enter your zip code and the insurance quotes would be there for
you to choose from. The best way is internet to find the affordable health insurance plan.

June 12, 2013 3:55 pm

This is a growing problem for most people since by the time they figure out
that their health insurance coverage has some significant gaps in it, they have already collected
substantial unpaid medical expenses, the sort
that people who have health insurance don. The remaining 20% of applications are declined by carriers due to health history that increases the carrier’s
claims exposure in excess of the premium that can be charged
for the coverage. In this example her total out-of-pocket expense (deductible plus her share of coinsurance) is $1,200 ($1,000 deductible and $200 coinsurance).

June 12, 2013 4:08 pm

Most importantly, the report also gives future outlook on
each of the important insurance segments, which will assist
clients in making prudent investment decisions. Surgeries and treatments are so
expensive these days that they can even leave you bankrupt.
In the face of illness a student in USA could get the best healthcare facilities but these
are costly.

June 13, 2013 5:33 am

Great article.

June 15, 2013 9:56 pm

Have you ever thought about writing an e-book or guest authoring on other blogs?
I have a blog based upon on the same subjects you discuss and would love to have you share some
stories/information. I know my visitors would enjoy your work.
If you are even remotely interested, feel free to send me an email.

June 18, 2013 12:44 pm

It gives great feeling when you will be alive among our love ones after our
death. When did things get so tight that looking after
your mouth and its contents was deemed an option and not an essential.

Cats who have a urinary tract infection generally cry when they are using their litter

August 4, 2013 2:50 pm

There are a number of simple ways to reduce premiums that drivers should consider.
To avoid any kind of future unexpected circumstances which can be often not
nice, it’s great so that you can observe the small particulars that
are outlined. Being sure to protect your no claims bonus will also
help lower your 4×4 car insurance, with only a small additional cost
this is insures you against losing up to a ninety percent discount on your 4×4 car insurance premium.

September 5, 2013 2:27 am

Travel insurance policies provide a safety net for all travellers and we have a moral obligation to remai honest in our dealngs with insurance companies
that provide thse policies for us. Under this policy you are going to save
a lot of money as you are also available with many discounts.
When you decide to use this type of coverage, you should be prepared to immediately start thhe recovery process.

September 12, 2013 10:42 am

People who are plkanning to buy guaranteed issue health insurance should not buuy short term coverage first.

If you resire in the Rochester region of New York, consider this Bluee Cross and Blue Shield division.
The entry-level requirement for paramedic jobs iss a higgh schokl diploma, undergoing the formal training and certification process, and securing a license from the state.

September 15, 2013 11:19 am

The President’s signature covers the Senate’s version of health care legislation called
the “”Patient Protection and Affordable Care Act.
Indeed, there is wide regional variation as
to how the state performs. However, tis is subject to the
provision that by 2014, a child who receives an offerr for an
employee-based health plan of his own, shall avail of this
offer, even if he or sshe is under twenty-six years of age.

September 19, 2013 2:50 pm

Many will search the Worlld Wide Web by looking at individual provider sites and also
on comparison tools. The elderly are more vulnerable to
diseases and with rising medical costs it is not feasibnle to bear possible expenses iff they get hospitalized.

Specialist fees ccan also vary considerably with some
charging several hundreds of dollars if they are highly specialised and sought after.

September 22, 2013 6:09 pm

Believe it or not, it’s not impossible to get car insurance even if you don’t have a driver’s license.
If you break your promise, they break theirs and increase your premiums.

Of course you’ll want to be careful that you don’t agree to pay a
larger deductible than you can actually afford.

September 23, 2013 1:00 am

You must attend tto repair needs promptly to maintain a good reputation in the market.

Follow these tips to fibd out the best places tto shop and learn how to get yoiur purchases to your new home safely and securely.
Shopping around is another surefire way of obtaining a
highly regarded insuranche company for your car.

October 8, 2013 12:37 am

This option is perfect for people who are worried that by signing up through
the online site, thuey would gett endless spam mails.
Unions off like-minded, goal-driven people can work tto bring about
positive change and support influential decisions by the
government. However, stastistics has it that except to whatt the
insurance policy offers to the insured perszon
in a form of security or peace of mind, not evenn 1% of term life insurance policy holders
collect benefit, the reason why the insurance companies are willing to take such risks with the involved low premium.

October 8, 2013 10:09 am

simply in case, if the tenant tends to uttilize the resources provided by the real estate management tesam for a
extended period than that specified, the owner will raise the tenant tto pay additioal charges for that resource.

After the complaint was read and I was asked to respond, Romaniecki spoke up trying to
manipulate an answer to the judge. Why have many landlords abandoned security deposits altogether.

October 9, 2013 12:45 am

Hcg is really a hormones the body rears its ugly head is
daily of a HCG supplementary in Chorionic part day to daily schedule
calorie burning. HCG, a rise hormone now used to curtail appetite, can be
bought online. At least it isn’t too late to discover one of the best HCG diet clinics in Baton Rouge LA.

October 14, 2013 2:18 am

Homeowners rarely hear the mother until she has given birth.
Livkng in an expensive estate in Newton, MA, and being a
close friend of former Senator Ed Brook, the
Suttenbergs lived what Suzanne described was a jet set lifestyle.

Be sure to get a list of names and phone numbers to call and verify their relationship
after the interview.

October 28, 2013 4:53 am

” It is your pet and therefore your problem if your pet damages the house. Consumers should be certain that they aren’t overpaying for car insurance coverage’s or getting auto insurance coverage’s that they don’t need. There is no standard form of commercial lease agreement it is easily made under the negotiations between landlord and tenant.

October 28, 2013 6:05 am

Repairs to damaged property may also be covered as well
as the lost income from tthe unusabble property.

Coupoled with this there is an increase in the demand for landlord insurance woth a tred moving towards dealing with brokers.
Here are som tips ffor landlords along wiuth some answers that
many property owners ask on a regular basis about how bedst you can make sure your properties are protected.

October 28, 2013 8:00 am

I have owned a semi-detached 4 bedroom home in a small town for 3 years now.
Where a property is damaged to the point where it is uninhabitable, landlord’s insurance covers
the loss of rent for a specified time. If there
is a fire, robbery or some other disaster the landlord
probably has buildings insurrance but your possessions or student contents is your responsibility.

October 29, 2013 2:13 am

This is the case under the policy provided by Alan Boswell’s Insurance.
The difference can be impressive and help save you
a lot of money in the future, so take your time and compare the offers from different providers.
Older persons often cause a lot of day time mishaps.

October 30, 2013 3:21 am

Over the counter yeast infection medicine and treatment are available, such as ointments to help stop vaginal yeast infection.
Although many times additional testing is required to
substantiate the findings of the discount diagnostic test kits, these tests offer a place to
start and a good maintenance tool for both patient and doctor.
, who manufactures a slew of popular perfumes
and cosmetics, still uses animal testing.

January 22, 2014 5:28 am

This website was how do you say it? Relevant!! ekdfaeedaeddecde

April 23, 2014 7:07 am

What i don’t realize is actually how you are not actually
much more well-liked than you may be right now. You’re very
intelligent. You realize thus significantly relating to this subject, produced me personally consider it from numerous varied angles.
Its like men and women aren’t fascinated unless it’s one thing
to accomplish with Lady gaga! Your own stuffs great.
Always maintain it up!

May 9, 2014 12:25 pm

In an internet forum about plastic surgery, it will be possible to come in contact with other individuals about plastic surgery through the net.
This means that you can have layers of images one over
the other, and each level can have its own set of rules and effects applied to it.

The physical and external benefits are very obvious.

July 14, 2014 1:18 pm

Just how much folic acid do you need to notice a
difference in your hair growth. The US National Library of Medicine warns against ingesting this petroleum
product. We don’t want to end up old and wrinkled because we spent every other day in a tanning booth.

July 19, 2014 3:12 pm

The questions you should ask your plastic surgeon before
rhinoplasty should be about the safety of the procedure, its efficiency,
and most importantly, its appropriateness for your
body and whether it would offer results in line with
your expectations. The rhinoplasty surgeons India enhance the nose
by reshaping it by using components like cuboids, gristle (fibrous)
and a couple of fashioned elements as well. At $10,000 it was exactly one tenth cheaper than the same surgery in the United States, which would
have cost him $100,000.

August 3, 2014 2:50 am

You don’t know if you’re going to be able to attack the same
way and dive on the floor the same way.
There are quite a few qualified cosmetic surgeons overseas who
are notable for producing quality, amazing, and flawless results.
In particular, changes were noted in the ways patients perceived
their own voices, and their emotional responses to them.

September 23, 2014 5:52 pm

All fragrances are hand-poured in the French custom and also combined in our workshop.

November 20, 2014 6:49 pm

Hello There. I found your blog the use of msn. That is an extremely smartly written article.

I’ll be sure to bookmark it and return to
learn extra of your helpful information. Thank you for the post.
I’ll definitely return.

November 27, 2014 9:18 pm

Finally I got the for playstation game and I came Super Budget

September 3, 2015 2:24 am

Mira aqui que locura de portatiles

December 19, 2015 3:44 am

Que recuerdos me trae esta consola y ahora le he regalado una a mi hijo muy bien de pasta http://www.tecnobest.com

March 11, 2016 2:47 am

que guay yo me he hecho con esta TV para el veranito en este lugar http://www.elbaratisimo.com

April 8, 2016 3:32 am

Shoot, so that’s that one susopsep.

March 30, 2017 2:18 am

I have been reviewing online more than seven hours today for windows 7 starter snpc oa samsung iso
& Maciej Pacula > Unit-Testing Statistical Software,
yet I never found any interesting article like yours.

It’s pretty worth enough for me. In my view, if all webmasters and bloggers made good content
as you did, the net will be much more useful than ever before.

December 11, 2017 11:42 am

Comment now!