Lucia recently posted another in a series of hypothesis tests comparing the central tendency of the IPCC projections for warming to the surface temperature record. It was her website that inspired me to start my own blog. What really motivated me was my thought that it wasn’t ideal to get a trend line from seven years of surface temperature data and compare it to a trend line with no confidence interval based on 30 years of model data. I took up the issue in my third post but I found that my use of ttest2 was flawed. Well, I modified the function to perform a two-sample t-test with unequal variances. See the Wikipedia entry for a brief rundown.
I took data from 15 models running scenario A1B and calculated two temperature trends based on Jan-2001/July-2008 and Jan-2001/2031. The figure below shows the distribution for 7 and 30-year trends. The surface temperature record (STR) that’s plotted is a merge of GISS/HADCRUT/NCDC. The data for the 7-year and 30-year trends were baselined on their respective 7-year interval and 30-year intervals. The method used to calculate the temperature trends was Cochrane-Orcutt.


As I suspected, the temperature trends are more concentrated in the oft-quoted 2 °C/century region when we trend line out 30 years. Just eyballing the first graph we should expect that a few models should fail to reject in a hypothesis test. When I was writing the code to generate the graphs, I hadn’t even written anything for the hypothesis tests. I didn’t see the point in doing them for the 30-year trends since it is obvious from the graphs they would all fail! But I did them anyway. The first table shows the calculated temperature trends (95% CI). The second table summarizes the hypothesis test results. The null hypothesis is that the temperature trend in the models equals the temperature trend in the surface temperature record. The alternate hypothesis is that they are not equal. I tested with a 5% significance level.


Sorry if the second table looks dithered (damn WordPress!) but I’m sure you can read it. FGOALS, ECHO G and ECHAM 5 performed excellently. FGOALS, aka. the Alternating Current Model had a VERY large spread! This model’s results makes me glad many models are used because its mean is about 0 °C/century with half the confidence interval on either side! So is it warming or cooling? Echo G is the best fit to the STR. ECHAM 5 shows behavior similar to FGOALS. If those two models were politicians and you asked them where they stand on the whole global warming debate, they’d straddle both sides of the issue! (That was a joke.) For the most part, the p-values corresponding to rejecting the null hypothesis were extremely small. But in some cases where reject was the result, it would have failed to reject if we used 1% as our significance level. Comparing to GISS, GISS EH and BCM 2.0 rejected at 2.51% and 1.13%. Compared to HadCrut, INM CM 3.0 rejects at 2.55%. Compared to NCDC, BCM 2.0 and GISS EH reject at 1.43% and 4.89%. Compared to Merged, CM 3.0 passes at 5.25%. Finally, compared with Merged+RSS, CM 3.0 rejects at 3.65%.
As I suspected, all the models rejected at 5% significance with p-values on the order of 10^-12 or worse. If the raw numbers weren’t enough for you, I generated the following plot to show the relative distributions of the STRs used in the hypothesis tests.

Like most people would suspect, GISS is on the higher end. Interestingly, NCDC is a bit higher. I’ll leave it at that.
Update (8/25/08): I’ve added UAH to the mix.
So was there a reason for leaving out UAH?
Comment by Raven — August 25, 2008 @ 6:36 am
Yes. I began updating all my data sets when I came across UAH. I found the file Lucia linked to was different than the one that I had used. I saw the numbers were very close, but I didn’t want to get bogged down trying to figure out which one to use. I was more interested in getting my code and post written. Which one do you think I should use? I could always add it to my code and update everything. I just ran a quick calculation and found the trend for the file Lucia links to and find -1.9 ± 2.5 °C/century, so I don’t think it’ll make much difference.
edit:I put the wrong value for the trend stated above. Fixed.
Comment by cshme — August 25, 2008 @ 11:32 am
Hi Chad!
I was visiting family all weekend and just got back.
To clarify (so I understand what you are saying):
The first figure of seven year trends: Are you saying you used Cochrane Orcutt to calculate the trend assocaited with each model, and then you plotted the distribution of the trends using a gaussian in that figure?
The second figure: You did the same thing, but for 30 years?
Now when you say this:
Out of curiosity, are your test for individual runs? Or the average for each model? I think I have 3 runs for GISS EH with the A1B scenario during the 21st century.
Comment by lucia — August 25, 2008 @ 9:46 pm
Hi Lucia!
Yes, I used CO to get the trends and used normpdf in matlab to fit it to a Gaussian distribution. Yes, I did the same thing for the second figure, but on 30-years of data. I neglected to mention that for simulations that have multiple runs, I took the mean for the ensemble.
Comment by cshme — August 25, 2008 @ 10:04 pm