There has been some discussion about the difference in Hillary Clinton's vote in New Hampshire with machine counted votes versus hand counted votes -- see here. Overall, Hillary got 39.0% of the NH vote. But she received 40.1% of the machine counted vote and only 34.7% of the hand counted vote.
When my son first showed me this, I jumped to the obvious conclusion that towns in NH that have machines counting the votes are different from town that count by hand. A classic statistical problem: correlation does not prove causation, and the old omitted variables problem. Hillary probably does better with voters who live in towns that count votes by machine. If you look at the data, it appears that "machine count" is the variable determining the Hillary vote, when in fact it is an underlying variable -- wealth, race, educational levels -- that really determines the vote difference, and that we are not measuring. So "machine count" is simply "picking up" the effect of the variable(s) omitted from the analysis.
Ah, the wonders of technology. My son showed me a website that did some analysis by town size. Sure enough, Hillary does better in large towns, and large towns tend to do more machine counting. So, it appears that the relationship between Hillary's vote and METHOD (the vote counting method) is just picking up the underlying relationship between Hillary's vote and TOWN SIZE. TOWN SIZE itself is a proxy for things such as wealth, education, etc.
Even more wonders of technology. I had my son collect the voting data on the 220 towns of NH. Using PERL, he downloaded the data to Excel for me to use in about 15 minutes. So I had four variables on each town in NH: TOTAL VOTE (a measure of town size), CLINTON PERCENT, OBAMA PERCENT, and METHOD (1 for MACHINE COUNT, 0 for HAND COUNT).
I quickly ran a univariate regression of CLINTON PERCENT on METHOD: Sure enough, the regression equation is
CLINTON PERCENT = 33.68 + 5.64 METHOD
with a standard error of 1.01 on METHOD (t-statistic of 5.58).
That fits with the univariate analysis of the data as presented earlier. Sure enough, Hillary seems to do better when the vote is counted by machine!
I was sure that when I added TOTAL VOTE to the regression, the coefficient on METHOD would drop in size and in statistical significance. This does not HAVE to happen with correlated variables such as METHOD and TOTAL VOTE, but I was pretty sure it would.
Here are the multiple regression results:
CLINTON PERCENT = 33.56 +5.08 METHOD + .00028 TOTAL VOTE
with a standard error on METHOD of 1.12 and on TOTAL VOTE of .00024.
Amazing! METHOD continues to be the variable carrying the weight of the data. Town size is statistically insignificant and the method of counting accounts for most of the variation in Clinton's vote difference.
UPDATE: I highly suspect, still, that METHOD is simply correlated with some underlying real determinant of the Clinton vote. If I can get more data by town, I will run those models. It is even possible that my variable TOTAL VOTE is not a good measure of town size, as it involves voter turnout as well. What would be really good is if I had exit polling data by town in NH. If I added that variable to the equation, I would think that METHOD will lose significance.
UPDATE2: I got town population data, estimated for 2006. Using that instead of TOTAL VOTE reduces the size of the METHOD variable but not by much and it is still significant. I also calculated a new variable, total vote divided by population, which is an attempt to get at a few things related to turnout and other demographics. Using this new variable in addition to popluation reduces the size of the METHOD variable a bit more, but it is still highly significant. This new variable of vote divided by population seems to be important. I think if I got better demographic data -- age and gender, income, turnout -- the METHOD variable would lose more of its significance. It must be picking up something. If somebody has good NH town data in Excel or easily parsed, let me know the source.
The other thing to note is that in the data on votes by machine vs. hand, Romney also has big differences. That would, I think, support the "underlying demographics" theory.
UPDATE3: And just for the record, I do not believe whatsoever that anything untoward or suspicious happened in the NH election. This is simply a great exercise in statistical analysis, a great example of the problem that omitted variables cause in regression analysis.