As the need grows for conceptualization, formalization, and abstraction in biology, so too does mathematics' relevance to the field (Fagerström et al. 1996). Mathematics is particularly important for analyzing and characterizing random variation of, for example, size and weight of individuals in populations, their sensitivity to chemicals, and time-to-event cases, such as the amount of time an individual needs to recover from illness. The frequency distribution of such data is a major factor determining the type of statistical analysis that can be validly carried out on any data set. Many widely used statistical methods, such as ANOVA (analysis of variance) and regression analysis, require that the data be normally distributed, but only rarely is the frequency distribution of data tested when these techniques are used.

The Gaussian (normal) distribution is most often assumed to describe the random variation that occurs in the data from many scientific disciplines; the well-known bell-shaped curve can easily be characterized and described by two values: the arithmetic mean x and the standard deviation s, so that data sets are commonly described by the expression x ± s. A historical example of a normal distribution is that of chest measurements of Scottish soldiers made by Quetelet, Belgian founder of modern social statistics (Swoboda 1974). In addition, such disparate phenomena as milk production by cows and random deviations from target values in industrial processes fit a normal distribution.

However, many measurements show a more or less skewed distribution. Skewed distributions are particularly common when mean values are low, variances large, and values cannot be negative, as is the case, for example, with species abundance, lengths of latent periods of infectious diseases, and distribution of mineral resources in the Earth's crust. Such skewed distributions often closely fit the log-normal distribution (Aitchison and Brown 1957, Crow and Shimizu 1988, Lee 1992, Johnson et al. 1994, Sachs 1997). Examples fitting the normal distribution, which is symmetrical, and the log-normal distribution, which is skewed, are given in Figure 1. Note that body height fits both distributions.

Often, biological mechanisms induce log-normal distributions (Koch 1966), as when, for instance, exponential growth is combined with further symmetrical variation: With a mean concentration of, say, 106 bacteria, one cell division more—or less—will lead to 2 × 106—or 5 — 105—cells. Thus, the range will be asymmetrical—to be precise, multiplied or divided by 2 around the mean. The skewed size distribution may be why “exceptionally” big fruit are reported in journals year after year in autumn. Such exceptions, however, may well be the rule: Inheritance of fruit and flower size has long been known to fit the log-normal distribution (Groth 1914, Powers 1936, Sinnot 1937).

What is the difference between normal and log-normal variability? Both forms of variability are based on a variety of forces acting independently of one another. A major difference, however, is that the effects can be additive or multiplicative, thus leading to normal or log-normal distributions, respectively.

Some basic principles of additive and multiplicative effects can easily be demonstrated with the help of two ordinary dice with sides numbered from 1 to 6. Adding the two numbers, which is the principle of most games, leads to values from 2 to 12, with a mean of 7, and a symmetrical frequency distribution. The total range can be described as 7 plus or minus 5 (that is, 7 ± 5) where, in this case, 5 is not the standard deviation. Multiplying the two numbers, however, leads to values between 1 and 36 with a highly skewed distribution. The total variability can be described as 6 multiplied or divided by 6 (or 6 ×/ 6). In this case, the symmetry has moved to the multiplicative level.

Although these examples are neither normal nor log-normal distributions, they do clearly indicate that additive and multiplicative effects give rise to different distributions. Thus, we cannot describe both types of distribution in the same way. Unfortunately, however, common belief has it that quantitative variability is generally bell shaped and symmetrical. The current practice in science is to use symmetrical bars in graphs to indicate standard deviations or errors, and the sign ± to summarize data, even though the data or the underlying principles may suggest skewed distributions (Factor et al. 2000, Keesing 2000, Le Naour et al. 2000, Rhew et al. 2000). In a number of cases the variability is clearly asymmetrical because subtracting three standard deviations from the mean produces negative values, as in the example 100 ± 50. Moreover, the example of the dice shows that the established way to characterize symmetrical, additive variability with the sign ± (plus or minus) has its equivalent in the handy sign ×/ (times or divided by), which will be discussed further below.

Log-normal distributions are usually characterized in terms of the log-transformed variable, using as parameters the expected value, or mean, of its distribution, and the standard deviation. This characterization can be advantageous as, by definition, log-normal distributions are symmetrical again at the log level.

Unfortunately, the widespread aversion to statistics becomes even more pronounced as soon as logarithms are involved. This may be the major reason that log-normal distributions are so little understood in general, which leads to frequent misunderstandings and errors. Plotting the data can help, but graphs are difficult to communicate orally. In short, current ways of handling log-normal distributions are often unwieldy.

To get an idea of a sample, most people prefer to think in terms of the original rather than the log-transformed data. This conception is indeed feasible and advisable for log-normal data, too, because the familiar properties of the normal distribution have their analogies in the log-normal distribution. To improve comprehension of log-normal distributions, to encourage their proper use, and to show their importance in life, we present a novel physical model for generating log-normal distributions, thus filling a 100-year-old gap. We also demonstrate the evolution and use of parameters allowing characterization of the data at the original scale. Moreover, we compare log-normal distributions from a variety of branches of science to elucidate patterns of variability, thereby reemphasizing the importance of log-normal distributions in life.

A physical model demonstrating the genesis of log-normal distributions

There was reason for Galton (1889) to complain about colleagues who were interested only in averages and ignored random variability. In his thinking, variability was even part of the “charms of statistics.” Consequently, he presented a simple physical model to give a clear visualization of binomial and, finally, normal variability and its derivation.

Figure 2a shows a further development of this “Galton board,” in which particles fall down a board and are deviated at decision points (the tips of the triangular obstacles) either left or right with equal probability. (Galton used simple nails instead of the isosceles triangles shown here, so his invention resembles a pinball machine or the Japanese game Pachinko.) The normal distribution created by the board reflects the cumulative additive effects of the sequence of decision points.

A particle leaving the funnel at the top meets the tip of the first obstacle and is deviated to the left or right by a distance c with equal probability. It then meets the corresponding triangle in the second row, and is again deviated in the same manner, and so forth. The deviation of the particle from one row to the next is a realization of a random variable with possible values +c and −c, and with equal probability for both of them. Finally, after passing r rows of triangles, the particle falls into one of the r + 1 receptacles at the bottom. The probabilities of ending up in these receptacles, numbered 0, 1,…,r, follow a binomial law with parameters r and p = 0.5. When many particles have made their way through the obstacles, the height of the particles piled up in the several receptacles will be approximately proportional to the binomial probabilities.

For a large number of rows, the probabilities approach a normal density function according to the central limit theorem. In its simplest form, this mathematical law states that the sum of many (r) independent, identically distributed random variables is, in the limit as r→∞, normally distributed. Therefore, a Galton board with many rows of obstacles shows normal density as the expected height of particle piles in the receptacles, and its mechanism captures the idea of a sum of r independent random variables.

Figure 2b shows how Galton's construction was modified to describe the distribution of a product of such variables, which ultimately leads to a log-normal distribution. To this aim, scalene triangles are needed (although they appear to be isosceles in the figure), with the longer side to the right. Let the distance from the left edge of the board to the tip of the first obstacle below the funnel be xm. The lower corners of the first triangle are at xm · c and xm/c (ignoring the gap necessary to allow the particles to pass between the obstacles). Therefore, the particle meets the tip of a triangle in the next row at X = xm · c, or X = xm /c, with equal probabilities for both values. In the second and following rows, the triangles with the tip at distance x from the left edge have lower corners at x · c and x/c (up to the gap width). Thus, the horizontal position of a particle is multiplied in each row by a random variable with equal probabilities for its two possible values c and 1/c.

Once again, the probabilities of particles falling into any receptacle follow the same binomial law as in Galton's device, but because the receptacles on the right are wider than those on the left, the height of accumulated particles is a “histogram” skewed to the left. For a large number of rows, the heights approach a log-normal distribution. This follows from the multiplicative version of the central limit theorem, which proves that the product of many independent, identically distributed, positive random variables has approximately a log-normal distribution. Computer implementations of the models shown in Figure 2 also are available at the Web site http://stat.ethz.ch/vis/log-normal (Gut et al. 2000).

J. C. Kapteyn designed the direct predecessor of the log-normal machine (Kapteyn 1903, Aitchison and Brown 1957). For that machine, isosceles triangles were used instead of the skewed shape described here. Because the triangles' width is proportional to their horizontal position, this model also leads to a log-normal distribution. However, the isosceles triangles with increasingly wide sides to the right of the entry point have a hidden logical disadvantage: The median of the particle flow shifts to the left. In contrast, there is no such shift and the median remains below the entry point of the particles in the log-normal board presented here (which was designed by author E. L.). Moreover, the isosceles triangles in the Kapteyn board create additive effects at each decision point, in contrast to the multiplicative, log-normal effects apparent in Figure 2b.

Consequently, the log-normal board presented here is a physical representation of the multiplicative central limit theorem in probability theory.

Basic properties of log-normal distributions

The basic properties of log-normal distribution were established long ago (Weber 1834, Fechner 1860, 1897, Galton 1879, McAlister 1879, Gibrat 1931, Gaddum 1945), and it is not difficult to characterize log-normal distributions mathematically. A random variable X is said to be log-normally distributed if log(X) is normally distributed (see the box on the facing page). Only positive values are possible for the variable, and the distribution is skewed to the left (Figure 3a).

Two parameters are needed to specify a log-normal distribution. Traditionally, the mean µ and the standard deviation σ (or the variance σ2) of log(X) are used (Figure 3b). How-ever, there are clear advantages to using “back-transformed” values (the values are in terms of x, the measured data):  
formula

We then use X ∼ Λ(µ*,σ*) as a mathematical expression meaning that X is distributed according to the log-normal law with median µ* and multiplicative standard deviation σ*.

The median of this log-normal distribution is med(X) = µ* = eµ, since µ is the median of log(X). Thus, the probability that the value of X is greater than µ* is 0.5, as is the probability that the value is less than µ*. The parameter σ*, which we call multiplicative standard deviation, determines the shape of the distribution. Figure 4 shows density curves for some selected values of σ*. Note that µ* is a scale parameter; hence, if X is expressed in different units (or multiplied by a constant for other reasons), then µ* changes accordingly but σ* remains the same.

Distributions are commonly characterized by their expected value µ and standard deviation σ. In applications for which the log-normal distribution adequately describes the data, these parameters are usually less easy to interpret than the median µ* (McAlister 1879) and the shape parameter σ*. It is worth noting that σ* is related to the coefficient of variation by a monotonic, increasing transformation (see the box below, eq. 2).

For normally distributed data, the interval µ ± σ covers a probability of 68.3%, while µ ± 2σ covers 95.5% (Table 1). The corresponding statements for log-normal quantities are  
formula

This characterization shows that the operations of multiplying and dividing, which we denote with the sign ×/ (times/divide), help to determine useful intervals for log-normal distributions (Figure 3), in the same way that the operations of adding and subtracting (±, or plus/minus) do for normal distributions. Table 1 summarizes and compares some properties of normal and log-normal distributions.

The sum of several independent normal variables is itself a normal random variable. For quantities with a log-normal distribution, however, multiplication is the relevant operation for combining them in most applications; for example, the product of concentrations determines the speed of a simple chemical reaction. The product of independent log-normal quantities also follows a log-normal distribution. The median of this product is the product of the medians of its factors. The formula for σ* of the product is given in the box below (eq. 3).

For a log-normal distribution, the most precise (i.e., asymptotically most efficient) method for estimating the parameters µ* and σ* relies on log transformation. The mean and empirical standard deviation of the logarithms of the data are calculated and then back-transformed, as in equation 1. These estimators are called x* and s*, where x* is the geometric mean of the data (McAlister 1879; eq. 4 in the box below). More robust but less efficient estimates can be obtained from the median and the quartiles of the data, as described in the box below.

As noted previously, it is not uncommon for data with a log-normal distribution to be characterized in the literature by the arithmetic mean x and the standard deviation s of a sample, but it is still possible to obtain estimates for µ* and σ* (see the box on page 345). For example, Stehmann and De Waard (1996) describe their data as log-normal, with the arithmetic mean x and standard deviation s as 4.1 ± 3.7. Taking the log-normal nature of the distribution into account, the probability of the corresponding x ± s interval (0.4 to 7.8) turns out to be 88.4% instead of 68.3%. Moreover, 65% of the population are below the mean and almost exclusively within only one standard deviation. In contrast, the proposed characterization, which uses the geometric mean x* and the multiplicative standard deviation s*, reads 3.0 ×/ 2.2 (1.36 to 6.6). This interval covers approximately 68% of the data and thus is more appropriate than the other interval for the skewed data.

Comparing log-normal distributions across the sciences

Examples of log-normal distributions from various branches of science reveal interesting patterns (Table 2202). In general, values of s* vary between 1.1 and 33, with most in the range of approximately 1.4 to 3. The shapes of such distributions are apparent by comparison with selected instances shown in Figure 4.

Geology and mining

In the Earth's crust, the concentration of elements and their radioactivity usually follow a log-normal distribution. In geology, values of s* in 27 examples varied from 1.17 to 5.6 (Razumovsky 1940, Ahrens 1954, Malanca et al. 1996); nine other examples are given in Table 2202. A closer look at extensive data from different reefs (Krige 1966) indicates that values of s* for gold and uranium increase in concert with the size of the region considered.

Human medicine

A variety of examples from medicine fit the log-normal distribution. Latent periods (time from infection to first symptoms) of infectious diseases have often been shown to be log-normally distributed (Sartwell 1950, 1952, 1966, Kondo 1977); approximately 70% of 86 examples reviewed by Kondo (1977) appear to be log-normal. Sartwell (1950, 1952, 1966) documents 37 cases fitting the log-normal distribution. A particularly impressive one is that of 5914 soldiers inoculated on the same day with the same batch of faulty vaccine, 1005 of whom developed serum hepatitis.

Interestingly, despite considerable differences in the median x* of latency periods of various diseases (ranging from 2.3 hours to several months; Table 2202), the majority of s* values were close to 1.5. It might be worth trying to account for the similarities and dissimilarities in s*. For instance, the small s* value of 1.24 in the example of the Scottish soldiers may be due to limited variability within this rather homogeneous group of people. Survival time after diagnosis of four types of cancer is, compared with latent periods of infectious diseases, much more variable, with s* values between 2.5 and 3.2 (Boag 1949, Feinleib and McMahon 1960). It would be interesting to see whether x* and s* values have changed in accord with the changes in diagnosis and treatment of cancer in the last half century. The age of onset of Alzheimer's disease can be characterized with the geometric mean x* of 60 years and s* of 1.16 (Horner 1987).

Environment

The distribution of particles, chemicals, and organisms in the environment is often log-normal. For example, the amounts of rain falling from seeded and unseeded clouds differed significantly (Biondini 1976), and again s* values were similar (seeding itself accounts for the greater variation with seeded clouds). The parameters for the content of hydroxymethylfurfurol in honey (see Figure 1b) show that the distribution of the chemical in 1573 samples can be described adequately with just the two values. Ott (1978) presented data on the Pollutant Standard Index, a measure of air quality. Data were collected for eight US cities; the extremes of x* and s* were found in Los Angeles, Houston, and Seattle, allowing interesting comparisons.

Atmospheric sciences and aerobiology

Another component of air quality is its content of microorganisms, which was—not surprisingly—much higher and less variable in the air of Marseille than in that of an island (Di Giorgio et al. 1996). The atmosphere is a major part of life support systems, and many atmospheric physical and chemical properties follow a log-normal distribution law. Among other examples are size distributions of aerosols and clouds and parameters of turbulent processes. In the context of turbulence, the size of which is distributed log-normally (Limpert et al. 2000b).

Phytomedicine and microbiology

Examples from microbiology and phytomedicine include the distribution of sensitivity to fungicides in populations and distribution of population size. Romero and Sutton (1997) analyzed the sensitivity of the banana leaf spot fungus (Mycosphaerella fijiensis) to the fungicide propiconazole in samples from untreated and treated areas in Costa Rica. The differences in x* and s* among the areas can be explained by treatment history. The s* in untreated areas reflects mostly environmental conditions and stabilizing selection. The increase in s* after treatment reflects the widened spectrum of sensitivity, which results from the additional selection caused by use of the chemical.

Similar results were obtained for the barley mildew pathogen, Blumeria (Erysiphe) graminis f. sp. hordei, and the fungicide triadimenol (Limpert and Koller 1990) where, again, s* was higher in the treated region. Mildew in Spain, where triadimenol had not been used, represented the original sensitivity. In contrast, in England the pathogen was often treated and was consequently highly resistant, differing by a resistance factor of close to 450 (x* England / x* Spain). To obtain the same control of the resistant population, then, the concentration of the chemical would have to be increased by this factor.

The abundance of bacteria on plants varies among plant species, type of bacteria, and environment and has been found to be log-normally distributed (Hirano et al. 1982, Loper et al. 1984). In the case of bacterial populations on the leaves of corn (Zea mays), the median population size (x*) increased from July and August to October, but the relative variability expressed (s*) remained nearly constant (Hirano et al. 1982). Interestingly, whereas s* for the total number of bacteria varied little (from 1.26 to 2.0), that for the subgroup of ice nucleation bacteria varied considerably (from 3.75 to 8.04).

Plant physiology

Recently, convincing evidence was presented from plant physiology indicating that the log-normal distribution fits well to permeability and to solute mobility in plant cuticles (Baur 1997). For the number of combinations of species, plant parts, and chemical compounds studied, the median s* for water permeability of leaves was 1.18. The corresponding s* of isolated cuticles, 1.30, appears to be considerably higher, presumably because of the preparation of cuticles. Again, s* was considerably higher for mobility of the herbicides Dichlorophenoxyacetic acid (2,4-D) and WL110547 (1-(3-fluoromethylphenyl)-5-U-14C-phenoxy-1,2,3,4-tetrazole). One explanation for the differences in s* for water and for the other chemicals may be extrapolated from results from food technology, where, for transport through filters, s* is smaller for simple (e.g., spherical) particles than for more complex particles such as rods (E. J. Windhab [Eidgenös sische Technische Hochschule, Zurich, Switzerland], personal communication, 2000).

Chemicals called accelerators can reduce the variability of mobility. For the combination of Citrus aurantium cuticles and 2,4-D, diethyladipate (accelerator 1) caused s* to fall from 1.38 to 1.17. For the same combination, tributylphosphate (accelerator 2) caused an even greater decrease, from 1.56 to 1.03. Statistical reasoning suggests that these data, with s* values of 1.17 and 1.03, are normally distributed (Baur 1997). However, because the underlying principles of permeability remain the same, we think these cases represent log-normal distributions. Thus, considering only statistical reasons may lead to misclassification, which may handicap further analysis. One question remains: What are the underlying principles of permeability that cause log-normal variability?

Ecology

In the majority of plant and animal communities, the abundance of species follows a (truncated) log-normal distribution (Sugihara 1980, Magurran 1988). Interestingly, the range of s* for birds, fish, moths, plants, or diatoms was very close to that found within one group or another. Based on the data and conclusions of Preston (1948), we determined the most typical value of s* to be 11.6 .4

Food technology

Various applications of the log-normal distribution are related to the characterization of structures in food technology and food process engineering. Such disperse structures may be the size and frequency of particles, droplets, and bubbles that are generated in dispersing processes, or they may be the pores in filtering membranes. The latter are typically formed by particles that are also log-normally distributed in diameter. Such particles can also be generated in dry or wet milling processes, in which log-normal distribution is a powerful approximation. The examples of ice cream and mayonnaise given in Table 2202 also point to the presence of log-normal distributions in everyday life.

Linguistics

In linguistics, the number of letters per word and the number of words per sentence fit the log-normal distribution. In English telephone conversations, the variability s* of the length of all words used—as well as of different words—was similar (Herdan 1958). Likewise, the number of words per sentence varied little between writers (Williams 1940).

Social sciences and economics

Examples of log-normal distributions in the social sciences and economics include age of marriage, farm size, and income. The age of first marriage in Western civilization follows a three-parameter log-normal distribution; the third parameter corresponds to age at puberty (Preston 1981). For farm size in England and Wales, both x * and s* increased over 50 years, the former by 38.6% (Allanson 1992). For income distributions, x* and s* may facilitate comparisons among societies and generations (Aitchison and Brown 1957, Limpert et al. 2000a).

Typical s* values

One question arises from the comparison of log-normal distributions across the sciences: To what extent are s* values typical for a certain attribute? In some cases, values of s* appear to be fairly restricted, as is the case for the range of s* for latent periods of diseases—a fact that Sartwell recognized (1950, 1952, 1966) and Lawrence reemphasized (1988a). Describing patterns of typical skewness at the established log level, Lawrence (1988a, 1988b) can be regarded as the direct predecessor of our comparison of s* values across the sciences. Aitchison and Brown (1957), using graphical methods such as quantile-quantile plots and Lorenz curves, demonstrated that log-normal distributions describing, for example, national income across countries, or income for groups of occupations within a country, show typical shapes.

A restricted range of variation for a specific research question makes sense. For infectious diseases of humans, for example, the infection processes of the pathogens are similar, as is the genetic variability of the human population. The same appears to hold for survival time after diagnosis of cancer, although the value of s* is higher; this can be attributed to the additional variation caused by cancer recognition and treatment. Other examples with typical ranges of s* come from linguistics. For bacteria on plant surfaces, the range of variation of total bacteria is smaller than that of a group of bacteria, which can be expected because of competition. Thus, the ranges of variation in these cases appear to be typical and meaningful, a fact that might well stimulate future research.

Future challenges

A number of scientific areas—and everyday life—will present opportunities for new comparisons and for more far-reaching analyses of s* values for the applications considered to date. Moreover, our concept can be extended also to descriptions based on sigmoid curves, such as, for example, dose–response relationships.

Further comparisons of s* values

Permeability and mobility are important not only for plant physiology (Baur 1997) but also for many other fields such as soil sciences and human medicine, as well as for some industrial processes. With the help of x* and s*, the mobility of different chemicals through a variety of natural membranes could easily be assessed, allowing for comparisons of the membranes with one another as well as with those for filter actions of soils or with technical membranes and filters. Such comparisons will undoubtedly yield valuable insights.

Farther-reaching analyses

An adequate description of variability is a prerequisite for studying its patterns and estimating variance components. One component that deserves more attention is the variability arising from unknown reasons and chance, commonly called error variation, or in this case, s*E. Such variability can be estimated if other conditions accounting for variability—the environment and genetics, for example—are kept constant. The field of population genetics and fungicide sensitivity, as well as that of permeability and mobility, can demonstrate the benefits of analyses of variance.

An important parameter of population genetics is migration. Migration among regions leads to population mixing, thus widening the spectrum of fungicide sensitivity encountered in any one of the regions mentioned in the discussion of phytomedicine and microbiology. Population mixing among regions will increase s* but decrease the difference in x*. Migration of spores appears to be greater, for instance, in regions of Costa Rica than in those of Europe (Limpert et al. 1996, Romero and Sutton 1997, Limpert 1999).

Another important aim of pesticide research is to estimate resistance risk. As a first approximation, resistance risk is assumed to correlate with s*. However, s* depends on genetic and other causes of variability. Thus, determining its genetic part, s *G, is a task worth undertaking, because genetic variation has a major impact on future evolution (Limpert 1999). Several aspects of various branches of science are expected to benefit from improved identification of components of s*. In the study of plant physiology and permeability noted above (Baur 1997), for example, determining the effects of accelerators and their share of variability would be elucidative.

Sigmoid curves based on log-normal distributions

Dose-response relations are essential for understanding the control of pests and pathogens (Horsfall 1956). Equally important are dose-response curves that demonstrate the effects of other chemicals, such as hormones or minerals. Typically, such curves are sigmoid and show the cumulative action of the chemical. If plotted against the logarithm of the chemical dose, the sigmoid is symmetrical and corresponds to the cumulative curve of the log-normal distribution at logarithmic scale (Figure 3b). The steepness of the sigmoid curve is inversely proportional to s*, and the geometric mean value x* equals the “ED50,” the chemical dose creating 50% of the maximal effect. Considering the general importance of chemical sensitivity, a wide field of further applications opens up in which progress can be expected and in which researchers may find the proposed characterization x* ×/s* advantageous.

Normal or log-normal?

Considering the patterns of normal and log-normal distributions further, as well as the connections and distinctions between them (Tables 1, 3), is useful for describing and explaining phenomena relating to frequency distributions in life. Some important aspects are discussed below.

The range of log-normal variability

How far can s* values extend beyond the range described, from 1.1 to 33? Toward the high end of the scale of possible s* values, we found one s* larger than 150 for hail energy of clouds (Federer et al. 1986, calculations by W. A. S). Values below 1.2 may even be common, and therefore of great interest in science. However, such log-normal distributions are difficult to distinguish from normal ones—see Figures 1 and 3—and thus until now have usually been taken to be normal.

Because of the general preference for the normal distribution, we were asked to find examples of data that followed a normal distribution but did not match a log-normal distribution. Interestingly, original measurements did not yield any such examples. As noted earlier, even the classic example of the height of women (Figure 1a; Snedecor and Cochran 1989) fits both distributions equally well. The distribution can be characterized with 62.54 inches ± 2.38 and 62.48 inches ×/ 1.039, respectively. The examples that we found of normally—but not log-normally—distributed data consisted of differences, sums, means, or other functions of original measurements. These findings raise questions about the role of symmetry in quantitative variation in nature.

Why the normal distribution is so popular

Regardless of statistical considerations, there are a number of reasons why the normal distribution is much better known than the log-normal. A major one appears to be symmetry, one of the basic principles realized in nature as well as in our culture and thinking. Thus, probability distributions based on symmetry may have more inherent appeal than skewed ones. Two other reasons relate to simplicity. First, as Aitchison and Brown (1957, p. 2) stated, “Man has found addition an easier operation than multiplication, and so it is not surprising that an additive law of errors was the first to be formulated.” Second, the established, concise description of a normal sample—x±s—is handy, well-known, and sufficient to represent the underlying distribution, which made it easier, until now, to handle normal distributions than to work with log-normal distributions. Another reason relates to the history of the distributions: The normal distribution has been known and applied more than twice as long as its log-normal sister distribution. Finally, the very notion of “normal” conjures more positive associations for nonstatisticians than does “log-normal.” For all of these reasons, the normal or Gaussian distribution is far more familiar than the log-normal distribution is to most people.

This preference leads to two practical ways to make data look normal even if they are skewed. First, skewed distributions produce large values that may appear to be outliers. It is common practice to reject such observations and conduct the analysis without them, thereby reducing the skewness but introducing bias. Second, skewed data are often grouped together, and their means—which are more normally distributed—are used for further analyses. Of course, following that procedure means that important features of the data may remain undiscovered.

Why the log-normal distribution is usually the better model for original data

As discussed above, the connection between additive effects and the normal distribution parallels that of multiplicative effects and the log-normal distribution. Kapteyn (1903) noted long ago that if data from one-dimensional measurements in nature fit the normal distribution, two- and three-dimensional results such as surfaces and volumes cannot be symmetric. A number of effects that point to the log-normal distribution as an appropriate model have been described in various papers (e.g., Aitchison and Brown 1957, Koch 1966, 1969, Crow and Shimizu 1988). Interestingly, even in biological systematics, which is the science of classification, the number of, say, species per family was expected to fit log-normality (Koch 1966).

The most basic indicator of the importance of the log-normal distribution may be even more general, however. Clearly, chemistry and physics are fundamental in life, and the prevailing operation in the laws of these disciplines is multiplication. In chemistry, for instance, the velocity of a simple reaction depends on the product of the concentrations of the molecules involved. Equilibrium conditions likewise are governed by factors that act in a multiplicative way. From this, a major contrast becomes obvious: The reasons governing frequency distributions in nature usually favor the log-normal, whereas people are in favor of the normal.

For small coefficients of variation, normal and log-normal distributions both fit well. For these cases, it is natural to choose the distribution found appropriate for related cases exhibiting increased variability, which corresponds to the law governing the reasons of variability. This will most often be the log-normal.

Conclusion

This article shows, in a nutshell, the fundamental role of the log-normal distribution and provides insights for gaining a deeper comprehension of that role. Compared with established methods for describing log-normal distributions (Table 3), the proposed characterization by x* and s* offers several advantages, some of which have been described before (Sartwell 1950, Ahrens 1954, Limpert 1993). Both x* and s* describe the data directly at their original scale, they are easy to calculate and imagine, and they even allow mental calculation and estimation. The proposed characterization does not appear to have any major disadvantage.

On the first page of their book, Aitchison and Brown (1957) stated that, compared with its sister distributions, the normal and the binomial, the log-normal distribution “has remained the Cinderella of distributions, the interest of writers in the learned journals being curiously sporadic and that of the authors of statistical textbooks but faintly aroused.” This is indeed true: Despite abundant, increasing evidence that log-normal distributions are widespread in the physical, biological, and social sciences, and in economics, log-normal knowledge has remained dispersed. The question now is this: Can we begin to bring the wealth of knowledge we have on normal and log-normal distributions to the public? We feel that doing so would lead to a general preference for the log-normal, or multiplicative normal, distribution over the Gaussian distribution when describing original data.

Acknowledgments

This work was supported by a grant from the Swiss Federal Institute of Technology (Zurich) and by COST (Coordination of Science and Technology in Europe) at Brussels and Bern. Patrick Flütsch, Swiss Federal Institute of Technology, constructed the physical models. We are grateful to Roy Snaydon, professor emeritus at the University of Reading, United Kingdom, to Rebecca Chasan, and to four anonymous reviewers for valuable comments on the manuscript. We thank Donna Verdier and Herman Marshall for getting the paper into good shape for publication, and E. L. also thanks Gerhard Wenzel, professor of agronomy and plant breeding at Technical University, Munich, for helpful discussions.

Because of his fundamental and comprehensive contribution to the understanding of skewed distributions close to 100 years ago, our paper is dedicated to the Dutch astronomer Jacobus Cornelius Kapteyn (van der Heijden 2000).

References cited

1

Ahrens
LH.
.
1954
. The log-normal distribution of the elements (A fundamental law of geochemistry and its subsidiary).
Geochimica et Cosmochimica Acta
.
5
:
49
73
.

2

Aitchison
J
Brown
JAC.
. 1957. The Log-normal Distribution. Cambridge (UK): Cambridge University Press.

3

Allanson
P.
.
1992
. Farm size structure in England and Wales, 1939–89.
Journal of Agricultural Economics
.
43
:
137
148
.

4

Baur
P.
.
1997
. Log-normal distribution of water permeability and organic solute mobility in plant cuticles.
Plant, Cell and Environment
.
20
:
167
177
.

5

Biondini
R.
.
1976
. Cloud motion and rainfall statistics.
Journal of Applied Meteorology
.
15
:
205
224
.

6

Boag
JW.
.
1949
. Maximum likelihood estimates of the proportion of patients cured by cancer therapy.
Journal of the Royal Statistical Society B
.
11
:
15
53
.

7

Crow
EL
Shimizu
K
. eds. 1988. Log-normal Distributions: Theory and Application. New York: Dekker.

8

Di Giorgio
C
Krempff
A
Guiraud
H
Binder
P
Tiret
C
Dumenil
G.
.
1996
. Atmospheric pollution by airborne microorganisms in the City of Marseilles.
Atmospheric Environment
.
30
:
155
160
.

9

Factor
VM
Laskowska
D
Jensen
MR
Woitach
JT
Popescu
NC
Thorgeirsson
SS.
.
2000
. Vitamin E reduces chromosomal damage and inhibits hepatic tumor formation in a transgenic mouse model.
Proceedings of the National Academy of Sciences
.
97
:
2196
2201
.

10

Fagerström
T
Jagers
P
Schuster
P
Szathmary
E.
.
1996
. Biologists put on mathematical glasses.
Science
.
274
:
2039
2041
.

11

Fechner
GT.
. 1860. Elemente der Psychophysik. Leipzig (Germany): Breitkopf und Härtel.

12

Fechner
GT.
. 1897. Kollektivmasslehre. Leipzig (Germany): Engelmann.

13

Federer
B.
.
1986
. Main results of Grossversuch IV.
Journal of Climate and Applied Meteorology
.
25
:
917
957
.

14

Feinleib
M
McMahon
B.
.
1960
. Variation in the duration of survival of patients with the chronic leukemias.
Blood
.
15–16
:
332
349
.

15

Gaddum
JH.
.
1945
. Log normal distributions.
Nature
.
156
:
463
,
747

16

Galton
F.
.
1879
. The geometric mean, in vital and social statistics.
Proceedings of the Royal Society
.
29
:
365
367
.

17

Galton
F.
. 1889. Natural Inheritance. London: Macmillan.

18

Gibrat
R.
. 1931. Les Inégalités Economiques. Paris: Recueil Sirey.

19

Groth
BHA.
.
1914
. The golden mean in the inheritance of size.
Science
.
39
:
581
584
.

20

Gut
C
Limpert
E
Hinterberger
H.
. 2000. A computer simulation on the web to visualize the genesis of normal and log-normal distributions. http://stat.ethz.ch/vis/log-normal.

21

Herdan
G.
.
1958
. The relation between the dictionary distribution and the occurrence distribution of word length and its importance for the study of quantitative linguistics.
Biometrika
.
45
:
222
228
.

22

Hirano
SS
Nordheim
EV
Arny
DC
Upper
CD.
.
1982
. Log-normal distribution of epiphytic bacterial populations on leaf surfaces.
Applied and Environmental Microbiology
.
44
:
695
700
.

23

Horner
RD.
.
1987
. Age at onset of Alzheimer's disease: Clue to the relative importance of etiologic factors?.
American Journal of Epidemiology
.
126
:
409
414
.

24

Horsfall
JG.
. 1956. Principle of fungicidal actions. Chronica Botanica 30.

25

Johnson
NL
Kotz
S
Balkrishan
N.
. 1994. Continuous Univariate Distributions. New York: Wiley.

26

Kapteyn
JC.
. 1903. Skew Frequency Curves in Biology and Statistics. Astronomical Laboratory, Groningen (The Netherlands): Noordhoff.

27

Keesing
F.
.
2000
. Cryptic consumers and the ecology of an African Savanna.
BioScience
.
50
:
205
215
.

28

Koch
AL.
.
1966
. The logarithm in biology. I. Mechanisms generating the log-normal distribution exactly.
Journal of Theoretical Biology
.
23
:
276
290
.

29

Koch
AL.
.
1969
. The logarithm in biology. II. Distributions simulating the log-normal.
Journal of Theoretical Biology
.
23
:
251
268
.

30

Kondo
K.
.
1977
. The log-normal distribution of the incubation time of exogenous diseases.
Japanese Journal of Human Genetics
.
21
:
217
237
.

31

Krige
DG.
.
1966
. A study of gold and uranium distribution patterns in the Klerksdorp Gold Field.
Geoexploration
.
4
:
43
53
.

32

Lawrence
RJ.
. 1988a. The log-normal as event–time distribution. Pages 211–228 in Crow EL, Shimizu K, eds. Log-normal Distributions: Theory and Application. New York: Dekker.

33

Lawrence
RJ.
. 1988b. Applications in economics and business. Pages 229–266 in Crow EL, Shimizu K, eds. Log-normal Distributions: Theory and Application. New York: Dekker.

34

Lee
ET.
. 1992. Statistical Methods for Survival Data Analysis. New York: Wiley.

35

Le Naour
F
Rubinstein
E
Jasmin
C
Prenant
M
Boucheix
C.
.
2000
. Severely reduced female fertility in CD9-deficient mice.
Science
.
287
:
319
321
.

36

Limpert
E.
. 1993. Log-normal distributions in phytomedicine: A handy way for their characterization and application. Proceedings of the 6th International Congress of Plant Pathology; 28 July–6 August, 1993; Montreal, National Research Council Canada.

37

Limpert
E.
. 1999. Fungicide sensitivity: Towards improved understanding of genetic variability. Pages 188–193 in Modern Fungicides and Antifungal Compounds II. Andover (UK): Intercept.

38

Limpert
E
Koller
B.
. 1990. Sensitivity of the Barley Mildew Pathogen to Triadimenol in Selected European Areas. Zurich (Switzerland): Institute of Plant Sciences.

39

Limpert
E
Finckh
MR
Wolfe
MS
. eds. 1996. Integrated Control of Cereal Mildews and Rusts: Towards Coordination of Research Across Europe. Brussels (Belgium): European Commission. EUR 16884 EN.

40

Limpert
E
Fuchs
JG
Stahel
WA
. 2000a. Life is log normal: On the charms of statistics for society. Pages 518–522 in Häberli R, Scholz RW, Bill A, Welti M, eds. Transdisciplinarity: Joint Problem-Solving among Science, Technology and Society. Zurich (Switzerland): Haffmans.

41

Limpert
E
Abbt
M
Asper
R
Graber
WK
Godet
F
Stahel
WA
Windhab
EJ
. 2000b. Life is log normal: Keys and clues to understand patterns of multiplicative interactions from the disciplinary to the transdisciplinary level. Pages 20–24 in Häberli R, Scholz RW, Bill A, Welti M, eds. Transdisciplinarity: Joint Problem-Solving among Science, Technology and Society. Zurich (Switzerland): Haffmans.

42

Loper
JE
Suslow
TV
Schroth
MN.
.
1984
. Log-normal distribution of bacterial populations in the rhizosphere.
Phytopathology
.
74
:
1454
1460
.

43

Magurran
AE.
. 1988. Ecological Diversity and its Measurement. London: Croom Helm.

44

Malanca
A
Gaidolfi
L
Pessina
V
Dallara
G.
.
1996
. Distribution of 226-Ra, 232-Th, and 40-K in soils of Rio Grande do Norte (Brazil).
Journal of Environmental Radioactivity
.
30
:
55
67
.

45

May
RM.
. 1981. Patterns in multi-species communities. Pages 197–227 in May RM, ed. Theoretical Ecology: Principles and Applications. Oxford: Blackwell.

46

McAlister
D.
.
1879
. The law of the geometric mean.
Proceedings of the Royal Society
.
29
:
367
376
.

47

Ott
WR.
. 1978. Environmental Indices. Ann Arbor (MI): Ann Arbor Science.

48

Powers
L.
.
1936
. The nature of the interaction of genes affecting four quantitative characters in a cross between Hordeum deficiens and H. vulgare.
Genetics
.
21
:
398
420
.

49

Preston
FW.
.
1948
. The commonness and rarity of species.
Ecology
.
29
:
254
283
.

50

Preston
FW.
.
1962
. The canonical distribution of commonness and rarity.
Ecology
.
43
:
185
215
.
410
432
.

51

Preston
FW.
.
1981
. Pseudo-log-normal distributions.
Ecology
.
62
:
355
364
.

52

Razumovsky
NK.
.
1940
. Distribution of metal values in ore deposits.
Comptes Rendus (Doklady) de l'Académie des Sciences de l'URSS
.
9
:
814
816
.

53

Renner
E.
. 1970. Mathematisch-statistische Methoden in der praktischen Anwendung. Hamburg (Germany): Parey.

54

Rhew
CR
Miller
RB
Weiss
RF.
.
2000
. Natural methyl bromide and methyle chloride emissions from coastal salt marshes.
Nature
.
403
:
292
295
.

55

Romero
RA
Sutton
TB.
.
1997
. Sensitivity of Mycosphaerella fijiensis, causal agent of black sigatoka of banana, to propiconozole.
Phytopathology
.
87
:
96
100
.

56

Sachs
L.
. 1997. Angewandte Statistik. Anwendung statistischer Methoden. Heidelberg (Germany): Springer.

57

Sartwell
PE.
.
1950
. The distribution of incubation periods of infectious disease.
American Journal of Hygiene
.
51
:
310
318
.

58

Sartwell
PE.
.
1952
. The incubation period of poliomyelitis.
American Journal of Public Health and the Nation's Health
.
42
:
1403
1408
.

59

Sartwell
PE.
.
1966
. The incubation period and the dynamics of infectious disease.
American Journal of Epidemiology
.
83
:
204
216
.

60

Sinnot
EW.
.
1937
. The relation of gene to character in quantitative inheritance.
Proceedings of the National Academy of Sciences
.
23
:
224
227
.

61

Snedecor
GW
Cochran
WG.
. 1989. Statistical Methods. Ames (IA): Iowa University Press.

62

Statistisches Jahrbuch der Schweiz
. 1997: Zürich (Switzerland): Verlag Neue Züricher Zeitung.

63

Stehmann
C
De Waard
MA.
.
1996
. Sensitivity of populations of Botrytis cinerea to triazoles, benomyl and vinclozolin.
European Journal of Plant Pathology
.
102
:
171
180
.

64

Sugihara
G.
.
1980
. Minimal community structure: An explanation of species abundunce patterns.
American Naturalist
.
116
:
770
786
.

65

Swoboda
H.
. 1974. Knaurs Buch der Modernen Statistik. München (Germany): Droemer Knaur.

66

van der Heijden
P.
. 2000. Jacob Cornelius Kapteyn (1851–1922): A Short Bio graphy. (16 May 2001; www.strw.leidenuniv.nl/~heijden/kapteynbio.html).

67

Weber
H.
. 1834. De pulsa resorptione auditu et tactu. Annotationes anatomicae et physiologicae. Leipzig (Germany): Koehler.

68

Williams
CB.
.
1940
. A note on the statistical analysis of sentence length as a criterion of literary style.
Biometrika
.
31
:
356
361
.

Appendices

Definition and properties of the log-normal distribution

A random variable X is log-normally distributed if log(X) has a normal distribution. Usually, natural logarithms are used, but other bases would lead to the same family of distributions, with rescaled parameters. The probability density function of such a random variable has the form  
formula
A shift parameter can be included to define a three-parameter family. This may be adequate if the data cannot be smaller than a certain bound different from zero (cf. Aitchison and Brown 1957, page 14). The mean and variance are exp(µ + σ/2) and (exp(σ2)−1)exp(2µ+σ2), respectively, and therefore, the coefficient of variation is  
formula
which is a function in σ only. The product of two independent log-normally distributed random variables has the shape parameter  
formula
since the variances at the log-transformed variables add. Estimation: The asymptotically most efficient (maximum likelihood) estimators are  
formula
The quartiles q1 and q2 lead to a more robust estimate (q1/q2)c for s*, where 1/c = 1.349 = 2 · Φ−1 (0.75), Φ−1 denoting the inverse standard normal distribution function. If the mean x and the standard deviation s of a sample are available, i.e. the data is summarized in the form x ± s, the parameters µ* and s* can be estimated from them by using x/√ω and exp(√log(ω), respectively, with ω = 1 +(s/x)2 = 1 + cv2, cv = coefficient of variation. Thus, this estimate of s* is determined only by the cv (eq. 2).
Table 1.

A bridge between normal and log-normal distributions

Table 1.

A bridge between normal and log-normal distributions

Table 2.

Comparing log-normal distributions across the sciences in terms of the original data. x* is an estimator of the median of the distribution, usually the geometric mean of the observed data, and s*estimates the multiplicative standard deviation, the shape parameter of the distribution; 68% of the data are within the range of x*×/ s*, and 95% within x*×/ (s*)2. In general, values of s*and some of x* were obtained by transformation from the parameters given in the literature (cf. Table 3). The goodness of fit was tested either by the original authors or by us

Table 2.

Comparing log-normal distributions across the sciences in terms of the original data. x* is an estimator of the median of the distribution, usually the geometric mean of the observed data, and s*estimates the multiplicative standard deviation, the shape parameter of the distribution; 68% of the data are within the range of x*×/ s*, and 95% within x*×/ (s*)2. In general, values of s*and some of x* were obtained by transformation from the parameters given in the literature (cf. Table 3). The goodness of fit was tested either by the original authors or by us

Table 2.

(continued from previous page)

Table 2.

(continued from previous page)

Table 3.

Established methods for describing log-normal distributions

Table 3.

Established methods for describing log-normal distributions

Figure 1.

Examples of normal and log-normal distributions. While the distribution of the heights of 1052 women (a, in inches; Snedecor and Cochran 1989) fits the normal distribution, with a goodness of fit p value of 0.75, that of the content of hydroxymethylfurfurol (HMF, mg·kg−1) in 1573 honey samples (b; Renner 1970) fits the log-normal (p = 0.41) but not the normal (p = 0.0000). Interestingly, the distribution of the heights of women fits the log-normal distribution equally well (p = 0.74)

Figure 1.

Examples of normal and log-normal distributions. While the distribution of the heights of 1052 women (a, in inches; Snedecor and Cochran 1989) fits the normal distribution, with a goodness of fit p value of 0.75, that of the content of hydroxymethylfurfurol (HMF, mg·kg−1) in 1573 honey samples (b; Renner 1970) fits the log-normal (p = 0.41) but not the normal (p = 0.0000). Interestingly, the distribution of the heights of women fits the log-normal distribution equally well (p = 0.74)

Figure 2.

Physical models demonstrating the genesis of normal and log-normal distributions. Particles fall from a funnel onto tips of triangles, where they are deviated to the left or to the right with equal probability (0.5) and finally fall into receptacles. The medians of the distributions remain below the entry points of the particles. If the tip of a triangle is at distance x from the left edge of the board, triangle tips to the right and to the left below it are placed at x + c and x − c for the normal distribution (panel a), and x · ć and x / ć for the log-normal (panel b, patent pending), c and ć being constants. The distributions are generated by many small random effects (according to the central limit theorem) that are additive for the normal distribution and multiplicative for the log-normal. We therefore suggest the alternative name multiplicative normal distribution for the latter

Figure 2.

Physical models demonstrating the genesis of normal and log-normal distributions. Particles fall from a funnel onto tips of triangles, where they are deviated to the left or to the right with equal probability (0.5) and finally fall into receptacles. The medians of the distributions remain below the entry points of the particles. If the tip of a triangle is at distance x from the left edge of the board, triangle tips to the right and to the left below it are placed at x + c and x − c for the normal distribution (panel a), and x · ć and x / ć for the log-normal (panel b, patent pending), c and ć being constants. The distributions are generated by many small random effects (according to the central limit theorem) that are additive for the normal distribution and multiplicative for the log-normal. We therefore suggest the alternative name multiplicative normal distribution for the latter

Figure 3.

A log-normal distribution with original scale (a) and with logarithmic scale (b). Areas under the curve, from the median to both sides, correspond to one and two standard deviation ranges of the normal distribution

Figure 3.

A log-normal distribution with original scale (a) and with logarithmic scale (b). Areas under the curve, from the median to both sides, correspond to one and two standard deviation ranges of the normal distribution

Figure 4.

Density functions of selected log-normal distributions compared with a normal distribution. Log-normal distributions Λ(µ*,σ*) shown for five values of multiplicative standard deviation, s*, are compared with the normal distribution (100 ± 20, shaded). The σ* values cover most of the range evident in Table 2202. While the median µ* is the same for all densities, the modes approach zero with increasing shape parameters σ*. A change in µ* affects the scaling in horizontal and vertical directions, but the essential shape σ* remains the same

Figure 4.

Density functions of selected log-normal distributions compared with a normal distribution. Log-normal distributions Λ(µ*,σ*) shown for five values of multiplicative standard deviation, s*, are compared with the normal distribution (100 ± 20, shaded). The σ* values cover most of the range evident in Table 2202. While the median µ* is the same for all densities, the modes approach zero with increasing shape parameters σ*. A change in µ* affects the scaling in horizontal and vertical directions, but the essential shape σ* remains the same

1

Species abundance may be described by a log-normal law (Preston 1948), usually written in the form S(R) = S0 · exp(−a2R2), where S0 is the number of species at the mode of the distribution. The shape parameter a amounts to approximately 0.2 for all species, which corresponds to s* = 11.6.

Author notes

1

Eckhard Limpert (email: Eckhard.Limpert@ipw.agrl.ethz.ch) is a biologist and senior scientist in the Phytopathology Group of the Institute of Plant Sciences in Zurich, Switzerland.

2

Werner A. Stahel (email: stahel@stat.math.ethz.ch) is a mathematician and head of the Consulting Service at the Statistics Group, Swiss Federal Institute of Technology (ETH), CH-8092 Zürich, Switzerland.

3

Markus Abbt is a mathematician and consultant at FJA Feilmeier & Junker AG, CH-8008 Zürich, Switzerland.