Monday, July 26, 2010

Statistical Sampling and the Long Form Census

In the ongoing controversy over the federal government's intention to make the census long form optional there is an exceptionally high noise level over issues of governance, political interference in the civil service, tax burden, privacy, value of the long form to government and outside agencies, and much more. I will talk about none of that, and will thus casually sweep from the table all politically-related issues encompassed by the discussion. I do this so that I can focus on the technical aspects of the issue, in particular whether the long form census delivers reliable statistical data when it is mandatory, and when it is optional. The political questions are interesting of course, but these are not my focus in this post.

Stats Can has three broad options on how to go about the long form census:
  1. Mandatory
  2. Optional
  3. Scrap it entirely
From here it is possible to formulate a few simple questions about the entire census exercise:
  • Is the mandatory census, as conducted in 2006 and earlier, delivering valuable data? That is, are the results sufficiently reliable to be useful?
  • To what extent does making it optional lower the reliability, and will its results meet the test of being sufficiently reliable to be useful?
  • Are there good alternatives to the long form which deliver similar results with acceptably high reliability?
If the answer to the first question is negative, the utility of the long form census is highly questionable, raising the possibility of scrapping it entirely. In the case of the second question (assuming that the answer to the first question is positive), it is reasonable to assume that the reliability will be lower (based solely on sampling theory), but we must now determine just how much less reliable it is to determine whether it is worth doing at all. The third question is particularly interesting since some countries do manage to do this pretty well without a long form, however the manner in which it is accomplished is important.

Let's start with the current long form and its reliability. There are two questions to be considered: is the sampling done in a way to meet the target statistical confidence in the conclusions derived therefrom, and; is the data collected accurate? If this question interests you, I suggest you read this article in the National Post since it gives a flavour of the problem of collecting high-quality data before continuing.

As the article shows, whenever you involve humans in any survey there is an issue of quality. It is important to understand that this is entirely independent of the sample size and selection methodology. Humans can lie, be negligent, or simply be wrong no matter what they are being asked. The problem can be managed in part by making the census form easy to use, easy and quick for the responder to get right, and avoiding hot button issues that make people uncomfortable (e.g. tell us about your undiscovered criminal acts and your favourite kinky sex acts). The question about natural gas spending is a good example of a question that will often produce poor data.

The same difficulty is perhaps more obvious in political polls that ask which way you'd vote if an election were held today. These are carefully reported complete with statistical confidence levels, nationally, provincially, and by other responder characteristics (e.g. age and sex). This is all very mathematically rigourous but still totally wrong. The methodology used to calculate the confidence levels (e.g. plus or minus 3.8%, 19 times out of 20) assumes that responders are equivalent to coloured balls in a bag that are randomly drawn; if you have a bag of 100 balls and you draw 10 balls at "random" and you get 5 white and 5 red ones, there is a tried and true mathematical process to say something rigourous about the entire population of 100 balls. Not so with people. Some misunderstand the question, say whatever they believe the questioner wants to hear, lies in an attempt to skew the poll, or would never vote anyway.

Remember, we're still talking about a mandatory long form census. Stats Can spends a lot of effort to make the sample (who is sent the long form) both sufficiently random yet suitable to the types of analysis to which the collected data will be put, and it is not enough. Even the force of law to ensure that the sample integrity is assured isn't good enough. Stats Can knows this and they will apply many supplemental filters to the data in an attempt to tidy it up a bit, if their assumptions about how the data is skewed by human and other factors are reasonably close to the mark. However, they can never be certain.

If we now degrade the data by making the long form census optional, it becomes even more difficult to filter the data. As Minister Clement has stated, an expected reduction in the response rate will be compensated by increasing the sample size. That's nice, but it doesn't help Stats Can all that much. The thing is there is no good way to know who is opting out and why. There is an unknowable selection bias. Even if they know the location of each responder (this requirement is apparently also in doubt, in addition to who does or does not respond) the additional unknown selection bias degrades any analysis of the collected data. Are people's reasons political, privacy related, education or comprehension, availability, or other factors? Stats Can can't know. They may also not have the ability to compare the 2011 and 2006 samples to determine what new filters to apply to the data.

At this point there is some question as to whether the long form census should be done at all if the data and its analysis are degraded by some unknowable amount, especially in advance of the census being conducted. If the quality is deemed likely to be too low for its intended purposes, both by government and other organizations at large, it may be appropriate to scrap it entirely. The cost of an optional long form may even turn out to be higher than a mandatory form: more forms sent out and potentially processed, plus more filter analysis to recover some of the lost data quality, countered by lower costs of enforcing compliance.

If the results of the long form census are still deemed to be important even though it is untenable due the loss of quality, there are alternatives, but perhaps not ones that are doable. Some countries maintain databases that integrate all the data that all government departments know about the country's citizens. This super-database can be good enough to make a long form census superfluous. Except, and (in my opinion) fortunately, this isn't done in Canada. We have set some pretty strong barriers between government departments and agencies which make it very difficult to share and integrate data across those barriers. As an example, reporting income to CRA that was obtained in, shall we say, peculiar means, does not end up in the hands of the RCMP or other police forces. CRA can't even give your identity to Elections Canada without your consent. These barriers are actual respected and so can generally be trusted, which encourages more honestly when dealing with branches of the government. It is highly unlikely that Stats Can's needs would supersede the necessity and enforcement of those barriers. In other words, super-databases are not in the cards, and for which I am most thankful.

I don't know how this will eventually shake out since it has become a political hot potato that has already resulted in the resignation of the head of Stats Can (presumably regarding a perception of political interference) and is generating a surprisingly heated debate across the country. In closing I will note that our Mayor, Larry O'Brien, says that the City of Ottawa will if necessary replicate what the long form census produces by some means. It will be amusing to see how he proposes to do this since, as a minimum, the City does not have the funding to do a local "census" and does not have the federal government's power to force compliance in responding to any questionnaire the city sends to a sample of residents.

What a weird issue this is that has captured so much of the country's attention. Who would have thought it.

No comments: