## Abstract

The purpose of the present article is to explain the calculation of incidence rates in dynamic populations with the use of simple mathematical and statistical concepts. The first part will consider incidence rates in dynamic populations, and how they can best be taught in basic, intermediate and advanced courses. The second part will briefly explain how and why incidence rates are calculated in cohorts.

## Introduction

The calculation of frequencies of disease is the most basic tool for epidemiology. However, the fundamental concept of the nature of the incidence rate and its calculation in dynamic populations is often not well explained, especially in introductory courses. The concept and the method of calculation were already known to William Farr in the 1850s.^{1}^{,}^{2} He thoroughly understood the dynamic population concept and used it imaginatively, e.g. to calculate incidences of nosocomial infections in hospitals.^{3} Farr clearly distinguished what he called the ‘rate of mortality’, which we call today the ‘incidence rate’ (or one of its synonyms, see Box 1), from what he called ‘the probability of death’, which is today’s ‘cumulative incidence’ (or ‘incidence proportion’ or other synonyms, Box 1). He explained in detail the concept of the dynamic population as the basis for the calculation of the ‘rate of mortality’.

William Farr’s knowledge was subsequently largely forgotten in medicine and epidemiology. It survived, with other generally overlooked scholarship, in demography—a discipline that, like epidemiology, sees Farr as one of its pioneers.^{5} The distinction between the two ways of measuring disease frequency was rediscovered in epidemiology in the 1970s.^{2}^{,}^{6–9} Still, the concept of dynamic populations as the basis for incidence rate calculations remains often inadequately understood. This hampers not only the understanding of one of the most fundamental measures in epidemiology, but in addition, it hampers the proper understanding of case–control studies. Insight into the calculation of incidence rates in dynamic populations is necessary to understand how the majority of case–control studies are done, and how the odds ratios from such studies should be interpreted, as will be explained in our companion paper.^{10}

Note that the terms used do not usually make it clear whether incidence rates or cumulative incidences are meant. The origins and early history of the use of different terms has been traced by Turner and Hanley.^{4}

For incidence rates:

Force of mortality

Force of morbidity

Incidence density

Hazard and hazard rate (mostly in statistics).

For cumulative incidences:

Attack rate

Case fatality rate

Lethality

Risk

Incidence proportion.

For both (often unspecified which)

Mortality (rate)

Morbidity (rate)

Death rate

Incidence.

The purpose of the present article is to explain the calculation of incidence rates in dynamic populations with the use of simple mathematical and statistical concepts. The first part will consider incidence rates in dynamic populations, and how they can best be taught in basic, intermediate and advanced courses. The second part will briefly explain how and why incidence rates are calculated in cohorts.

## Dynamic and steady state populations

### Basic teaching

#### Cohorts vs dynamic populations in medicine, demography and epidemiology

When epidemiologists embark on a follow-up study in a group of people, i.e. in a population, that population can present itself to them (or be defined by them) in the following two ways: cohorts or dynamic populations.

##### Cohorts

In clinical research, most groups of people that are followed up over time present themselves as cohorts. Think of a group of people whom we follow up from surgery until death, e.g. ‘the 5-year death rate of a cohort of patients who had surgery for colon cancer in a particular hospital during the year 2005’. The hallmark of a cohort is that its membership is fixed, usually by one defining event (see Box 2 for further details): all those who had surgery during the year(s) in which we accrued patients in a study, belong to the cohort. All are followed up until a particular disease end point or until the closing date of the study. For example if 125 people were operated on in 2005, and 34 die within 5 years after surgery; the 5-year risk of death after colon surgery for cancer is 34 per 125 or 27 per 100 individuals. In today’s epidemiological publications, the commonly used word ‘risk’ is mostly replaced by more technical terms, such as ‘cumulative incidence’, or ‘incidence proportion’ (see below).

By ‘cohort’, we concur with the description of ‘a group of persons which is determined in permanent fashion or a population, which is determined entirely by a single defining event and so becomes permanent’*.*^{11} Examples are clinical cohorts, such as patients followed up from date of surgery, from date of initiation of a particular drug, or birth cohorts consisting of all persons born in a particular year. The membership of the cohort is fixed by a common event, which is taken as time zero of follow-up. This usage has been long established in epidemiology^{12} and is consistent with the definition in the fifth edition of the Dictionary of Epidemiology.^{13}

By ‘dynamic population’, we refer to populations in which the members vary over time; the membership is not fixed. This is a general characteristic of most populations that one commonly thinks of, such as the population of a town or country: ‘one can be a member at one time, not a member at a later time, a member again and so on’*.*^{11} This dynamic aspect has also been described as tantamount to observing persons who are in a particular ‘state’, as long as they are in that state, e.g. as long as they are inhabitants of a town or a country.^{12} This usage of ‘dynamic populations’ is consistent with the definition in the fifth edition of the Dictionary of Epidemiology.^{13}

A more complicated example of a dynamic population, which highlights its fundamental characteristics, is an epidemiological study of driving, cellular phone use and accidents. This takes the following form: the observation periods of interest are the periods in which people drive, which they do only intermittently. These observation periods can be divided into subperiods in which the driver is phoning and those in which (s)he is not. What the investigator wants to compare is the incidence rate of accidents while driving and phoning vs while driving and not phoning. In theory, this could be investigated in a cohort study, but the easiest design for this study is a case–control study, as explained in the companion article.^{10}

It is important to emphasize that not all authors use these words in the same way. The word ‘cohort study’ in a publication may indicate either a ‘cohort’ or a ‘dynamic population’; for this reason, some texts refer to the former as ‘fixed cohorts’ to differentiate them from other types of cohort study (See fifth edition of the Dictionary of Epidemiology for ‘fixed cohort’).

**Advanced technical point**

A distinction has been made between ‘open’ and ‘closed’ populations, as characteristics that might differ according to the time axis along which the investigator looks at the group (s)he is studying.^{11} This distinction is not essential for our basic description of the calculation of incidence rates in dynamic populations and cohorts, but it may become important if the same data are analysed along different time axes. For example, an occupational cohort study can be regarded as ‘fixed’ (and closed) in that all participants join the cohort on the day they start work and never leave (apart from loss to follow-up, mortality); on the other hand, the study population may be regarded as dynamic (and open) in terms of calendar time, with study participants ‘joining’ at different times. The study might be analysed for differences in incidence of disease between people who enter at different calendar times, if exposures are judged to vary over time. Moreover, participants may never ‘leave’ the cohort if they are being followed up indefinitely for cancer incidence, but they may ‘leave’ if the focus is on workplace injuries, in which case follow-up stops when a participant leaves work.

The example above assumes that all patients were followed up for a minimum of 5 years after surgery or until death. When using such examples in teaching, depending on the aim of the course, the example can be made more complicated to take into account people who disappeared from the cohort in other ways, such as loss to follow-up or censoring at the end date of the study, which usually leads to the use of either life tables or calculation of person-years of follow-up in cohorts (see below).

##### Dynamic populations

When demographers think about a population, they think about entities like ‘the population of a country during a particular year’. In a particular country in a particular year, say 2005, a number of people are living on the first day of the year, a number of people are living on each subsequent day of the year and a number of people are alive on the last day of the year. These are not all the same people. During the year, some people leave the country or die, other people come to live in the country and babies are born. Such a population is called a ‘dynamic population’. The hallmark of a dynamic population is that its members vary, and they are defined by a particular ‘state’, such as living in a particular country. A person can live in a country for a number of years, and while living in that country, (s)he is a member of that dynamic population; before and after (s)he is not (see Box 2 for further details).

A dynamic population can be understood intuitively as a regiment of a given size in a modern army. Imagine a regiment with a size of 5000 persons. Each time a soldier leaves the regiment, for whatever reason (death, disease, pensioning and so forth), he or she is replaced by a new recruit. The size of the regiment varies slightly from day to day: on some days there are slightly <5000, because the new replacement recruits have not yet arrived; on other days slightly more because the new recruits have arrived before the last day of duty of previous recruits. Even on the battlefield, in today’s armies, numbers are sometimes kept constant by flying in new soldiers to replace the dead and wounded. As long as they are members of the regiment, soldiers belong to this dynamic population. Calculations of death rates based on a regiment are straightforward: on average, each day of the year there are 5000 soldiers. Thus, for a year, there are 5000 soldier-years of follow-up. If 63 soldiers die during the year (e.g. in a continuing entrenched war), this would lead to an incidence rate of 63/5000 soldier-years, or 1.3 per 100 soldier-years. This is an incidence rate of death.

In demography, these concepts were already used in the 19th century to calculate population incidence rates. Today, they are still used to calculate death rates in populations of countries, counties or towns; they are also used to calculate ‘cancer rates’, ‘coronary heart disease rates’, ‘birth rates’ or ‘marriage rates’. The numerator of such rates is the number of people who developed the condition (e.g. died, developed cancer or gave birth) during a particular year in a country, in a county or in a town. The denominator is *not* the number of people, because people move in and out of the town, county or country, are born and die. The denominator is the ‘average’ number of people constantly present (alive), multiplied by the amount of time that they are present in the ‘risk period’ (the particular year, in this example); it is expressed as the number of person-years in the population during that particular year. For example, for a cancer registry of a country in which an average of 2 347 465 women of reproductive age (15–45 years) lived each day of a particular calendar year, and 498 cases of breast cancer occurred in that year, the incidence rate of breast cancer is 498 cases per 2 347 465 women-years or 212 per 1 000 000 women-years (in demographic tables of registries or in vital statistics tables the word ‘person-years’ is not used, but the concept is referred to as ‘1 000 000 persons constantly alive’—as if each day of the year there were 1 000 000 persons). Thus, a ‘mortality rate’, a ‘cancer incidence rate’, a ‘marriage rate’ or a ‘birth rate’ are all incidence rates—they are not cumulative incidences or ‘risks’.

Many synonyms exist for the terms that denote risks and rates (see Box 1), which are rooted in the history of these concepts.^{4} Because many names are used for the same concepts, it is often not clear from the terminology which is which, and the reader of the literature has to know how the calculations were actually done to understand whether a particular term denotes a rate or a risk. In this article, we will use the term ‘cumulative incidence’ to denote ‘risk’ as it is the term most widely used at present, although we should note that the term ‘incidence proportion’ has been advocated because ‘cumulative incidence’ has also been used with a slightly different meaning.^{11}

##### A dynamic population can often be seen as in steady state

As a first approximation, for a short period, say a year, dynamic populations of whole countries, counties or towns can be thought about as in ‘steady state’: on each day, the number of people is more or less the same, although it will fluctuate from day to day. Similarly, the proportion of men and women will be approximately the same on each day, and the age distribution will be roughly the same for all days of the year. Blood group distributions will remain the same (blood group distributions in populations vary only slowly, over decades or even centuries; in general, the genetic make-up of populations remains constant for short periods) and also the number of smokers or the number of vegetarians can be assumed to be in steady state (people stop and start being smokers or vegetarians, and smokers and vegetarians move in and out of town or country, or die). Hence, these subpopulations (women, vegetarians, blood group O carriers, smokers, vegetarians and so forth) can be seen as dynamic subpopulations, that are approximately in steady state for a relatively short period.

### Advanced teaching

The underlying concepts about dynamic populations were established in demography in the 19th and first half of 20th century and can be found in classic textbooks of demography, usually in mathematical terms, using calculus (i.e. integration and differentiation).^{14} Some epidemiological textbooks cover the principles in depth, but usually in mathematical notation.^{8}^{,}^{9}^{,}^{11} The following paragraphs give an account of the underlying principles with the use of only elementary mathematics.

#### The steady state assumption in more detail

A small-scale example, with only six people, presented and explained in Figure 1 helps to imagine what a dynamic and steady state population of 30-year-olds would look like.

The steady state population assumption uses the idea that people who ‘leave’, either because they die or because they move out, are constantly replaced by the same type of people. From a demographic point of view, this is less far-fetched than it may seem, at least for short periods. Think about some suburb with which you are familiar: when people move or die, other people come to live in their houses; e.g. when a family with three children moves out of a house, it will be replaced on average by family that is similar, not only in terms of the number of children, but also with regards to socio-economic factors.

The crucial element of this way of thinking is that the population of a suburb in a particular year, say, the year 2005, is *not* the people who lived in that suburb on the 1 January 2005 (which would be the way a clinician would think about a cohort, such as patients after surgery), but the *flow* of people who lived in that suburb throughout the year. That flow is calculated as the number present on average, multiplied by the follow-up time, which yields the ‘person-years’ (see Figure 1).

#### What if a population is dynamic but not in steady state?

In real life, dynamic populations are never totally in steady state: towns grow, populations age, neighbourhoods may lose inhabitants or may change with respect to the type of people who live there. However, demographers use a time-honoured and easy solution, which makes the steady state assumption work, even if the underlying dynamic population is not in steady state. We have already used this implicitly in Figure 1 for the simplified example. It consists of taking the estimated population in the middle of the year as an approximation of the ‘average’ number present for the year. If multiplied by the time of observation (the ‘risk period’—1 year in this case), this yields an approximation of the total number of person-years. Figure 2 presents a graph and an explanation of how this looks like for a population that is not in steady state, i.e. an ageing population.

A real life example for the calculations in Figure 2 is as follows. Consider the population of ‘males aged 60–64 years in 2001 in The Netherlands’, which is a 5-year age group (from the 60th birthday of a person, until the day of his 65th birthday). This population consists of:

All men who were already aged 60–64 years on the 1 January 2001—the 64-year-olds will count up to the date of their 65th birthday, which will be in the year 2001; all others will remain 60–64 during the year and will count for the entire year;

Plus all 59-year-old men who turned 60 somewhere in between 1 January and 31 December 2001 and stayed 60 years of age for the rest of 2001; these will count from their 60th birthday.

To calculate the number of person-years, we do not need the amount of ‘60–64-year-old-time’ lived by each individual. Instead, from the Central Bureau of Statistics of The Netherlands, we use the following data: on 1 January 2001 there were 368 632 men aged 60–64 years, and on 1 January 2002 there were 375 803 men aged 60–64 years. Thus, on average, there were 372 217.5 men alive each day during 2001. The mortality rate is then calculated as the number of 60–64–year-old males who died in 2001, which is 4648 divided by the amount of person-years of 60–64-year-old males alive in 2001. As the average number was 372 217.5, the number of person-years becomes 372 217.5 × 1.

Using these person–time estimates, the mortality rate will be 4648 per 372 271.5 person-years, which in vital statistics is usually given as a mortality rate of 12 487 per 1 000 000 person-years (mostly called per ‘million constantly alive’ in demographic or vital statistics tables). It should be noted that in this example, the numbers alive on 1 January of subsequent calendar years, as reported in vital statistics publications, are themselves interpolations.

Calculating person–time in this manner is tantamount to calculating an ‘area under the curve’ by numerical integration methods (for more details, treatises on statistics or demography should be consulted). The calculation assumes that for short periods, the increase or decrease of a population can be assumed to be linear (see Figure 2), unless something dramatic happens.

#### General properties of incidence rates

The cumulative incidence, or risk (also referred to as the incidence proportion),^{11}^{,}^{16} calculated from a cohort, is a dimensionless number: people with a particular event are divided by the number of people present at time zero (the common starting point of follow-up, e.g. the day of having surgery). The numerator is contained in the denominator, and the resulting quantity is by necessity always <1. In contrast, the incidence rate can be >1, depending on the units that are used for person–time or when more than one event is counted for an individual. This can be seen if half of the regiment in the above example is killed in one battle in a single day; then the mortality rate on that day is 2500 soldiers/5000 soldier-days. As a day is 1/365.25 of a year (the 0.25 is to correct for leap years), the ‘annualized incidence’ of death because of this 1-day battle, i.e. if the numbers that would be killed if there would be such a battle each day of the year, is (2500/5000) × 365.25 or 183 per each person-year. The other way in which incidence rates can be >1 is when more than one outcome event is counted, e.g. when the outcome event is a short disease state; e.g. when surveying the incidence of diarrhoea in infants in developing countries, the number of diarrhoeal episodes may easily become >1 per child-year. Counting more than one event in a person is not possible with cumulative incidences, and in some circumstances, it is a distinct advantage of incidence rates.

The reporting of incidence rates that were >1 was the cause of acrimonious accusations of possible fraud against William Farr and Florence Nightingale, who in the 1860s calculated and compared incidence rates of death in hospitals. These rates were sometimes >1—which is, of course, logical, as in those times, more than one person might have died on each hospital bed in a single year. Interestingly, these accusations were rehashed >130 years later, and they needed renewed explanations of the underlying principles.^{17}^{,}^{18} In today’s times, a hospice or palliative care unit wherein the few beds are in high demand, may also present with an incidence rate of death that is >1.

Although the principles are clear, the incidence rate has occasionally come ‘under attack’ during the past decades, in particular, because the person-years concept is not understood or because the fact that more than one episode can be counted is not understood.^{18}^{,}^{19}

An important caveat with the use of incidence rates is that they are assumed to be constant for the time window in which they are measured. In practice, 10 persons followed up for 100 years will usually show a different incidence rate of death in comparison with 1000 persons followed up for 1 year, although both yield ‘1000 person-years’. Thus, one should always clearly define the time windows (risk periods) when estimating incidence rates and reflect on whether the proposed time windows, say, for a particular age category of persons for a number of calendar years, is likely to have a reasonably constant incidence rate.^{11} If not, follow-up time should be divided into finer strata, to separately estimate mortality rates in different age groups or periods.

On the other hand, this property of incidence rates is at the same time its main advantage: an incidence rate gives insight into the strength of the morbidity or mortality in a dynamic population and is a kind of ‘constant’ characteristic of that population. This is in contrast to risk calculations from cohorts, which always approach 1 as the follow-up time becomes longer, because ‘in the long run, we are all dead’.^{20} This is also the reason that incidence rates are sometimes seen as a more basic concept than risks.

Finally, there is an intriguing relationship between incidence rates and life expectancy. In a population that is in perfect steady state, with a constant incidence rate of death, the life expectancy is simply the inverse of the incidence rate. This can be understood because the incidence rate is the number of deaths divided by all years lived, whereas the life expectancy is the number of years lived, divided by the number of persons who lived them.

## Person-years calculations in cohorts

Person-years can also be calculated from cohorts. Doll and Hill used person-years as denominators in the 1956 report of their follow-up study of smoking and lung cancer in British doctors.^{21} They used an elegant and simple pre-computer-age procedure: they estimated the number of doctors alive in each age category at one particular date of each follow-up year, and then averaged over the successive years, as explained by MacMahon and Pugh.^{22} In his influential 1937 textbook on medical statistics, based on a series of educational articles in the *Lancet*, and which was still being reprinted and revised 40 years later, Austin Bradford Hill advocated calculating person–time in cohorts to get rid of the fallacy of ‘neglecting the period of exposure to risk’.^{23} Unfortunately, he did not introduce the concept as formally as he did with life tables and survival in cohorts, to which he devoted a full chapter.

#### Comparisons with the general population

The use of person-years calculations is pivotal to comparing morbidity and mortality in cohorts (with fixed membership) with that in the general population (with variable membership). One application is in occupational health, where incidence rates of diseases in a particular occupational cohort are compared with corresponding incidence rates in the general population. A time-honoured way of making such comparisons is by direct or indirect standardization (indirect standardization is also called the standardized mortality ratio: it applies the incidence rates of disease in the general population to the person-years in the cohort to compare the observed and expected numbers of diseased or deceased persons, either cause-specific or general).^{24} Similar calculations were already done in Willam Farr’s time,^{1} and they are still a standard way to analyse occupational disease and occupational mortality data in today’s medical literature, e.g. for radon exposure in uranium mining.^{25}

Another example is the comparison of patient cohorts with the general population, for instance, the development of ‘secondary malignancies’ after a patient is treated for a first malignancy. The frequency of second malignancies is then compared with the baseline rate of the same type of malignancy in the general population. Such comparisons can also be done for patients who have been treated differently, by radiotherapy or chemotherapy in comparison with the general population—e.g. during long-term follow-up of children treated for acute leukaemia.^{26}

Although all the above examples are about ‘person-years’, of course, one can also use person-months, person-days or even person-hours. Person-days were already used by William Farr to calculate the diminishing mortality due to smallpox during the course of the disease.^{27} They are still used today, e.g. bed-days are used to calculate the incidence of nosocomial infections in the early days of hospitalization vs later days of hospitalization. Person-days of being at a certain level of anti-coagulation have been used to look for the optimal level of anticoagulation in patients with different indications, i.e. the level with the least thrombosis, but also the least bleeding.^{28} Person-hours of being at a certain level of heparinization in an intensive care unit have been used to calculate the optimal level of such anticoagulation therapy during acute haemofiltration.^{29}

## Relationships between risks and rates

Incidence rates, as calculated based on person-years, can be used to estimate cumulative incidences. For small time windows or when the disease is rare, which is almost always the case when the follow-up time is small, incidence rates and cumulative incidences (risks) that are estimated for the same follow-up period become numerically indistinguishable. This can be seen if one imagines a population of, say, 341 874 adults who are followed up for a single day; if the number of deaths in that day is 23, then the cumulative incidence of death is 23/341 874 or 6.7 per 100 000 persons, whereas the incidence rate of death would be 23/[341 874 – (23/2)] person-days (which amounts to subtracting half of the number of people who died, as an approximation of the number of half-days not lived on that day) and when expressed per 100 000 person-days is also 6.7. If the incidence rate is larger, and/or follow-up time is longer, the calculation involves an exponential assumption, based on principles of calculus, because the same incidence rate will act on an ever-smaller cohort.^{7–9}^{,}^{11}

The inverse calculation is also possible, e.g. from randomized trials in which estimates of risk are usually given. An incidence rate can be estimated when the initial number of people in the treatment arms and the average follow-up time in the trial are known (often given in trial reports), as the multiplication of average follow-up time by the number of people in the trial equals the number of person-years of follow-up; the incidence rate is obtained when the number of outcomes in the trial (usually also given) is divided by this number of person-years.

In statistics, the ‘hazard’ or ‘hazard rate’ is a peculiar form of incidence rate wherein the follow-up time approaches the limit of zero and becomes infinitesimally small, which is often called an ‘instantaneous hazard’. It creates a situation in which there is no more numerical difference between incidence rates and cumulative incidences. It is used, among others, in the proportional hazards model (see our related article on case–control studies).^{10}

Importantly, estimation of incidence rates through person-years (or person-days or person-hours) permits, in principle, total flexibility of multivariable analyses, i.e. adding several variables to the analysis by using a Poisson model and slicing up person–time in different ways.

## Conclusions

In addition to the concepts of cumulative incidence (‘risk’) calculation in cohorts, the calculation of incidence rates using person-years in dynamic populations should be taught thoroughly in basic courses of epidemiology. In fact, from a population perspective, incidence rates could be considered the more basic notion than risks. It is important to teach that person–time calculations can be done in dynamic populations and cohorts, whereas risk calculations can only be done directly in cohorts. Moreover, a basic understanding of incidence rates is pivotal to understanding case–control studies^{10} as well as for the understanding of the analyses of many cohort studies and of the basic demographic measures that are used in public health.

## Funding

Jan P Vandenbroucke is an Academy Professor of the Royal Netherlands Academy of Arts and Sciences. The center for Public Health research is supported by a Programme Grant from the Health Research Council of New Zealand.

**Conflict of interest:** None declared.

## Acknowledgements

The authors thank the editors, the anonymous reviewers and Dr J Hanley for their constructive comments that improved the article.

## References

*Vol. 1—The Analysis of Case-Control Studies*. Lyon: IARC, 1980, pp. 42–53

*N Engl J Med*2003;349:1299