Univariate & Multivariate Methods for the
Analysis of Repeated Measures Data.
Anthony J. Wragg
A thesis submitted in partial fulfilment of the requirements for the degree of Master of
Applied Science (Statistics and Operations Research).
Department of Statistics and Operations Research
Royal Melbourne Institute of Technology
December 1999
2
Declaration
The work contained in this thesis has not been submitted previously, in whole or in
part, in respect of any academic award.
To the best of my knowledge and belief, this thesis contains no material previously
published or written by any other person except whee due reference is given.
Anthony J. Wragg
31
st
December 1999
3
Acknowledgments
I would like to thank my thesis supervisor Ms. Kaye E. Marion and Associate
Professor Panlop Zeephongsekul for their encouragement and guidance in the writing
of this minor thesis and throughout the rest of the course.
4
Abstract
This thesis considers both univariate and multivariate approaches to the analysis of a
set of repeated-measures data. Since repeated measures on the same subject are
correlated over time, the usual analysis of variance assumption of independence is
often violated. The models in this thesis demonstrate different approaches to the
analysis of repeated-measures data, and highlight their advantages and disadvantages.
Milk from two groups of lactating cows, one group vaccinated, the other not, was
analysed every month after calving for eight months in order to measure the amount
of bacteria in the milk. The primary goal of the experiment was to determine if a
vaccine developed by the Royal Melbourne Institute of Technology's Biology
Department led to a significant decrease in mean bacteria production per litre of milk
produced compared to the control group.
A univariate model suitable for repeated measures data was initially tried, with mean
bacteria production in the treatment group not significantly different from the control
group (p < 0.68).
The multivariate approach to repeated measures, profile analysis, yielded similar
results for treatment effects (p < 0.68), while meeting the necessary assumptions for
multivariate analysis.
Finally, a generalised multivariate analysis of variance was carried out in order to fit
polynomial growth curves for both the control and the vaccinated groups and to test if
the growth curves were equal for the two groups. It was found that a slope-intercept
model was adequate to describe both growth curves and that the growth curve for the
treatment group did not differ significantly from that of the control group (
p < 0.11).
5
Table of Contents
1. INTRODUCTION ...6
2. EXPLORING THE DATA...8
3. TIME BY TIME ANOVA ...11
4. UNIVARIATE APPROACH TO REPEATED MEASURES ...13
4.1 Repeated Measures ANOVA...13
4.2 Testing For Compound Symmetry ...21
5. MULTIVARIATE APPROACH TO REPEATED MEASURES ...26
5.1 Profile Analysis ...28
5.2 Assumptions of Profile Analysis ...29
5.3 Testing For Multivariate Normality...30
5.4 Testing For The Equality of Covariance Matrices...33
5.5 Hypothesis Tests for Profile Analysis ...35
6. THE GENERALISED MULTIVARIATE ANALYSIS OF VARIANCE...42
6.1 Growth Curves...42
6.2 Hypothesis Tests For Growth Curves...48
6.3 Testing Polynomial Adequacy...49
6.4 Testing For The Equality of Growth Curves ...53
7. CONCLUSION...59
REFERENCES ...61
APPENDIX A ...63
6
1. Introduction
Milk from two groups of lactating cows, one group vaccinated, the other not, was
analysed every month after calving for eight months in order to measure the amount
of bacteria in the milk. The primary goal of the experiment was to determine if a
vaccine developed by the Royal Melbourne Institute of Technology's Biology
Department led to a significant decrease in mean bacteria production compared to the
control group.
Experiments such as this fit into the family of designs known in the literature as
repeated measures data, longitudinal models, or growth curves. Data from these
models generally arise whenever more than two observations of the same variable are
made on an individual subject or experimental unit. These models are especially
common in biology, agriculture, and medicine and most often occur when
observations on a group of subjects are repeated over a period of time.
Repeated measures data such as this require somewhat different statistical treatment
than normal because the observations are not independent. This lack of independence
lies at the core of repeated measures analysis, and is what differentiates it from the
more commonly used statistical analyses. The implications of a lack of independence
within a subject's responses are serious, as will be explained in Chapter 4. For the
data collected by the RMIT biology department, this implies that the amount of
bacteria found in a litre of a cow's milk will, at any given time, be correlated with the
amount of bacteria found in that cow's milk at subsequent or preceding times. In
addition, the correlation between the amount of bacteria produced at different times
also tends to be stronger the shorter the time interval. In other words, the amount of
bacteria produced per litre of milk is more dependent on the amount of bacteria
produced one month ago than the amount of bacteria produced five months ago.
Correlation between observations is usually present in these types of experiments.
Nearby plots in a field trial are usually more similar than plots further apart.
7
When applying different levels of a factor, the effects of this correlation are generally
overcome by randomisation the levels are randomly allocated to the experimental
subjects. Randomisation ensures that in the long run there is no correlation between
the factor levels, so that observations with any given factor level are not more similar
to some factor levels than to others. Since time is treated as a factor with the eight
months considered the eight levels of time, randomisation is impossible the
observations must follow their natural sequence. Thus, it is not possible to randomise
the order of monthly observations: they must follow the sequence month 2, month 3,
month 4... As a result of the lack of randomisation, the means of two milk samples
taken a month apart, for example, tend to be more highly correlated than those taken 6
months apart. As a consequence, the precision of the difference tends to drop as the
time interval increases, nullifying the use of a single standard error of difference for
the time factor.
This variation in correlation between levels of the time factor means that it is
inappropriate to analyse the data as if time was a randomised factor. This thesis
covers some of the most commonly used techniques. Frequently, there is no single
best approach to analysis. It depends on what questions need to be answered. Often, it
is useful to use two or more approaches with the same data.
The analysis carried out on a set of repeated measures data is determined largely by
the questions the researcher wants answered. For this study, the most important
question is that concerning the vaccine: is there a significant difference between the
mean number of cells found in a litre of milk produced by cows in the treatment group
compared to the control group, over time? Secondary questions may include such
things as the change in cell production over the months irrespective of group
membership. The first question when studying repeated measures, or in fact any, data,
should not be how to analyse the data, but what is the experimenter is interested in
finding out (Lindsey, 1993). Once this is known, together with knowledge of the
techniques available, the selection of an appropriate technique becomes much easier.
8
2. Exploring The Data
The biology department of RMIT has developed a vaccine which is thought to reduce
the number of cells of mastitis, hereto known as `cells' found in a cow's milk. The
vaccine was tested on a randomly chosen sample of 23 cows, while a randomly
chosen sample of 18 cows was used as the control. Readings of the cell count of each
cow's milk were taken at 2, 3, 4, 5, 6, 7, 8, and 9 months after calving.
One of the attractive features of repeated measures data is that they can be displayed
in a graphical plot which is readily interpretable, without requiring a great effort and
little training is required to interpret the plots (Lindsey, 1993). Data plotting is
essential in order to get some feeling for what patterns are present in the data, whether
expected trends have occurred, what unexpected features are apparent and what
questions deserve analytical consideration. It should always precede detailed analysis
of the data. Figure 1 (a)-(d) shows various plots of the two groups.
Figure 1.
9
8
7
6
5
4
3
2
3500
3000
2500
2000
1500
1000
500
0
MONTH
CONTROL
(a)
2
3
4
5
6
7
8
9
0
500
1000
1500
2000
2500
3000
3500
MONTH
TREATMENT
(b)
9
8
7
6
5
4
3
2
3500
3000
2500
2000
1500
1000
500
0
MONTH
C
E
LLS
(c)
2
3
4
5
6
7
8
9
0
500
1000
1500
2000
2500
3000
3500
MONTH
C
E
LLS
(d)
9
The skew towards higher values of cells can be seen over most of the time periods,
although some months are worse than others. The boxplots, where the boxes contain
50% of the data, tend to vary over time in the control group. The pattern of variation
is roughly similar for the treatment group. The treatment group appears to have
comparatively smaller variances over time than the control group. If it were not for
the larger variance at month 3, the variance of the control group would be increasing
with time. The variance of the treatment group looks as if it decreases slightly over
time. Plots of the individual cow's results show that the control group cows cell
production increases linearly over time, although the trend is not patent. The treatment
group does not show any apparent increase over time. There appears to be little
evidence of a quadratic or cubic growth curve from the plots.
The outliers were noted and checked for accuracy with the RMIT Biology department
to make sure there were no transcription errors or similar problems. The outliers were
all legitimate observations.
The sample means of each group are plotted in Figure 2(a). Plotting the medians,
Figure 2(b) as well as the means allows one to look at the data from a slightly
different perspective, one that is resistant to outliers. Since the observations for any
given month are generally skewed, the medians are a useful adjunct. The control
group's mean response is not as stable as the treatment group's.
Figure 2.
Control Treatment
2
3
4
5
6
7
8
9
0
500
1000
1500
MONTH
MEANS
CELLS
(a)
2
3
4
5
6
7
8
9
0
500
1000
1500
MONTH
MEANS
CELLS
(a)
10
Fitting ordinary linear regression equations to the data gives an indication of the linear
trend for each group. Figure 3 (a) shows the control group has a positive slope which
highlights the increase in cells over time, while Figure 3(b) shows the treatment
group's slope is negative and flatter. At first it might appear that a possible model for
these observations is the general linear model. The problem is, like ANOVA, that the
assumptions of linear regression require independence of the variables. In chapter 6
growth curves will be fitted to the data using a multivariate approach which does not
have the restrictive assumption of independence.
Figure 3.
2
3
4
5
6
7
8
9
0
500
1000
1500
2000
2500
3000
3500
MONTH
CE
L
L
S
Y = 280.095 + 48.0218X
R-Sq = 3.9 %
Regression Plot
2
3
4
5
6
7
8
9
0
500
1000
1500
2000
2500
MONTH
CE
L
L
S
Y = 528.168 - 2.35559X
R-Sq = 0.0 %
Regression Plot
(a) (b)
Although plotting the data is imperative with any analysis, it does not allow the
analyst to make anything more than general statements summarizing the apparent
behaviour of the subjects being studied. What is really needed is to be able to quantify
the responses and formally test the research questions given in chapter 1 as
hypotheses. Chapter 3 describes the most simple method for doing this.
11
3. Time by Time ANOVA
One of the simplest forms of longitudinal analysis is a time-by-time ANOVA. It
consists of p separate analyses, on for each subset of data corresponding to each time
of observation t. For more than two groups each analysis is a conventional ANOVA,
however since there are only two groups being compared in this study, the ANOVA
reduces to a two-sample t-test of H
0
:
P
control
=
P
treatment
at each of the p = 8 times of
measurement. Table 1 shows the time-by-time ANOVA results for the data
Table 1.
Month
2
3
4
5
6
7
8
9
t
-0.80 0.14 -1.12 -1.56 -0.45 -0.47 1.02 1.21
p-value
0.43
0.89
0.27
0.13
0.65
0.64
0.32
0.24
The time-by-time analysis indicates that mean cells count does not differ between the
control and treatment groups in any of the 8 months. This suggests that the two mean
response profiles are alike. Month 5 has a large t test statistic (t = -1.56), but not
enough to be significant.
A time-by-time ANOVA is reasonably clear and uncomplicated, however Diggle,
Liang and Zeger (1994) point out its two major limitations. Firstly, it cannot address
questions concerning treatment effects which relate to the longitudinal growth of the
mean response curves, i.e. the growth rates between successive months. Secondly, the
inferences made within each of the p separate tests are not independent of each other,
nor is it clear how they should be combined. For example, a succession of marginally
significant group mean differences may be compelling with weakly correlated data,
but much less so if there are strong correlations between successive observations on
each cow.
12
The principal virtue of the time-by-time ANOVA approach to longitudinal studies is
its simplicity. The computational operations are elementary and the approach uses
familiar procedures in finding a solution to the problem.
In summary, whilst the time-by-time ANOVA may be useful in particular
circumstances, Diggle, Liang and Zeger (1994), do not recommend it as a viable
approach to longitudinal data analysis.
13
4. Univariate Approach to Repeated Measures
A more sophisticated approach than the time-by-time t-tests is a repeated measures
analysis of variance (ANOVA). In contrast to the time-by-time approach, Diggle,
Liang and Zeger (1994), regard it as a first attempt to provide a single analysis of a
complete longitudinal data set.
4.1 Repeated Measures ANOVA
Experiments utilising repeated measures designs differ from the usual ANOVA
models in that the levels of time cannot be randomly assigned to one or more the
experimental units in the experiment. In this case the levels of time cannot be
assigned at random to the time intervals, and thus the usual ANOVA models may not
be valid. Because of the non-random assignment of time, the errors corresponding to
the respective experimental units may have a covariance matrix which does not
conform to those for which the usual ANOVA analysis are valid.
The inherent dependence that is associated with repeated measures data introduces
extra complications into the analysis. Unfortunately, the simplifying properties arising
from data which are independently and identically distributed can no longer be relied
upon. To yield conclusions which are valid the analyst must take into account the
possible dependence within subjects. Fortunately, Diggle, Liang and Zeger (1994) and
Vonesh and Chinchilli (1997) have outlined methods which modify the problem so
that independence based methods like ANOVA can be used.
Superficially, the N cows by p months structure of the data resembles that of a
randomised block or split plot design, so there is a temptation to carry out a standard
two factor group
u month ANOVA. Using the standard ANOVA approach to this
problem presents problems for the unwary. Employing a standard ANOVA model
would regard the control and treatment groups as a factor on two levels, and
more
importantly, it would regard time as a factor on p levels. One of the difficulties with
this approach is that the allocation of times to the p observations within each cow
cannot be randomised.
0 comments