Unit 1: Statistical modelling
If you have ever had to build a bridge for a school project using toothpicks or balsa wood, or played with toy aeroplanes, then you have experience with models. You were building or using scaled-down versions of real-world aeroplanes and bridges. Now, imagine if you had collected real data about bridges such as what materials they were made from, what structures they used and which type of damage was most common. You could then use this information to construct your model, test its strength, and make predictions about how long it may last in the real world.
Someone who builds bridges uses a scaled-down model because it is impractical to build the actual bridge itself. Similarly, we take a sample because we cannot access all the units in the target population. We must make predictions or inferences based on models that are as accurate as possible based on the observed data.
Think of a statistical model as an adequate summary, i.e. a representative smaller version (like our toy model) of the data collected. It should summarise the data as closely as possible (be 'a good fit') but also be as simple as possible. We cannot measure a population, so the best we can do is make generalisations from a sample to a population using a representative summary, i.e. a statistical model.
We already use statistical models every day without necessarily realising it: the simplest summary model for numerical data is a mean, while the simplest summary model for categorical data is a proportion.
Categorical data (categorical variable): A variable with values that range over categories, rather than being numerical. Examples include gender (male, female), paint colour (red, white, blue) or type of animal (elephant, leopard, lion). Some categorical variables are ordinal.
Estimation: Estimation is the process by which sample data is used to indicate the value of an unknown quantity in a population. The results of estimation can be expressed as a single value, known as a point estimate. It is usual to also give a measure of the precision of the estimate. This is called the standard error of the estimate. A range of values, known as a confidence interval, can also be given.
Variability: The variability (or variation) in data is the extent to which values are different. Here are some examples of where variability occurs (sometimes without us noticing it).
The amount of variability in the data, and the different causes of the variability are often of importance in their own right. The variability (or 'noise') in the data can also obscure the important information (or 'signal').
While a p-value is often needed for publication purposes, it only has a statistical meaning, not a practical one. To gauge the practical importance of your research findings and make valid generalisations to your target population, you need standard error and confidence intervals. These would be of practical use to non-academic users of your research, who could then, say, feed your generalised results into a cost-benefit analysis.
In the traditional approach, hypothesis tests with their p-values on one side, and standard errors with their confidence intervals on the other, were kept separate. Statistical modelling brings the two sides together in the same place. P-values from hypothesis tests from within the statistical model help you to select the simplest summary model that is adequate to represent the data in the sample collected.
Even people who only ever use hypothesis testing are already choosing between simpler and more complex statistical models. For example, a two-sample t-test gives a p-value that allows us to choose between a model with one single overall mean for both groups and a model with two separate means: one for each group.
So why is the modelling approach 'modern'? While hypothesis tests were established early on in the development of statistical packages, the implementation of the modelling approach has been slower, but it is now widely available. Also, with the advent of the point-and-click interface, statistical packages have become much easier to use.
Why use a statistical package instead of a spreadsheet like Excel? The answer is simple: conducting statistical analysis with Excel is slow and inefficient compared with using a dedicated package, just like washing clothes by hand in a bucket is slower and less efficient than using a washing machine. Admittedly, Excel is not marketed as a dedicated statistical package, but as a spreadsheet with some additional statistical features.
Dr Daniel Smith
Associate Professor of Public Budgeting and Financial Management, New York University
So, the tests I usually conduct are within the regression framework, and regression really is just a fancy version of a t-test. It's just multiple variables instead of two variables. These days, I'm mostly testing the impact on various trust fund balances on other fiscal outcomes. So in a recent paper, we looked at whether having bigger unemployment insurance trust funds in the United States in the state level lead to bigger benefits. Potentially, when a state accumulates a bigger fund it doesn't necessarily translate into bigger benefits, it just continues to accumulate funds. We were interested in looking at the relationship between the two.
Dr James Abdey
Course Tutor in Statistics, London School of Economics
As an academic, often we're called upon to do some consultancy projects, and one I've been working on recently is dealing with the art market, and trying to come up with a price level for different types of artwork, be it a Picasso or a Monet, say. So it's possible to come up with a so-called 'art price index', so just trying to value different types of paintings or painters, whatever. That's a very difficult thing to do, but my main focus was actually trying to extrapolate these indices, and trying to come up with forecasts. So, for example, where is a particular part of the art market headed in, let's say, the next three or five years? So, in many fields, often, you know, we have data, obviously it's all historic data, about the past up to the present, but the past has been and gone. What we're interested in is where things are headed in the future. So we come up with a forecast, really our best guess based on some statistical model. Then, we recognise that there's going to be a lot of uncertainty with our forecast, and this degree of uncertainty is going to increase the further into the future we're trying to predict. So we come up with things called 'prediction intervals', which is a form of confidence interval but applied to predictions of some future event.
Ms Anna Bramwell-Dicks
PhD Student in Computer Science, University of York
Sometimes I look for relationships between the variables, and I'll do some correlation analysis where I'll usually, sort of, plot scatterplots and then look for Pearson's r or something, but that's not what I do often. That's only if I have a particular inkling that something might be informing something else. So I might look at, sort of, if my liking ratings of how much they like a particular piece of music are affecting the speed at which they're doing something, because I might have a hypothesis based on that, in which case I'll start looking at the correlations, but I have to have a reason for doing it.
Professor Evan D. Morris
Co-Director for Imaging, Yale PET Center, Yale University
The most important thing that I'm trying to determine in some of my work is, 'Did a certain area of the brain activate?', and to say it in a different way, 'Is there a component of what I'm measuring which I can attribute to a fluctuation in dopamine, a brain chemical?' So the critical question is, 'Did something happen?' Now, that may not necessarily be an obvious statistical question, but here's how we turn it into one. To estimate the amount and the timing of this dopamine phenomenon, we fit our data to models, and to turn this question into a statistical one, we fit the same data twice. First, to a model which has no component in it allowing for dopamine fluctuations. That's the simple model. It's, in a sense, assuming the null hypothesis, 'There is no dopamine component to our signal'.
Then secondly, we fit the data again to a more complicated model which contains a component describing the dopamine fluctuation. Then, the question becomes, 'Is the fit to the data better with the more complicated model which contains, or allows for, dopamine fluctuation?', but it's not as simple as saying, 'Is one thing better than another?' It's a little more subtle than that, because when we ask the question, 'Is it better?', we have to define it, and to say, 'Is a fit better?', means, 'Are the residual sums of squares smaller? Is the fit of the model closer to the data?,' but not just smaller, 'Is it smaller, allowing for the greater complexity of the model that includes the dopamine component?'