A robust method for fitting degree distributions of complex networks

This work introduces a method for fitting to the degree distributions of complex network datasets, such that the most appropriate distribution from a set of candidate distributions is chosen while maximizing the portion of the distribution to which the model is fit. Current methods for fitting to degree distributions in the literature are inconsistent and often assume a priori what distribution the data are drawn from. Much focus is given to fitting to the tail of the distribution, while a large portion of the distribution below the tail is ignored. It is important to account for these low degree nodes, as they play crucial roles in processes such as percolation. Here we address these issues, using maximum likelihood estimators to fit to the entire dataset, or close to it. This methodology is applicable to any network dataset (or discrete empirical dataset), and we test it on over 25 network datasets from a wide range of sources, achieving good fits in all but a few cases. We also demonstrate that numerical maximization of the likelihood performs better than commonly used analytical approximations. In addition, we have made available a Python package which can be used to apply this methodology.


I. INTRODUCTION
Although complex networks have been studied for around a quarter of a century, there is still a lack of consistency in determining one of their core propertiesnamely, the degree distribution.One of the early observations about complex networks was that their degree distributions differed from those of the random graph models which had been studied for many years before.Specifically it was observed that complex networks had right-skewed distributions, and Barabási and Albert claim that these right tails often followed power-law distributions [1].Their work appears to show that different types of networks have different power-law exponents and these exponents are often in the range 2 ≤ γ ≤ 3.
However, Amaral et al. [2] argue that power-law distributions are not so common and suggest that for social networks other distributions are sometimes more appropriate.In addition they define three classes of small-world networks: power-law or 'scale-free' networks, 'broad-scale' networks (power-law distributions with sharp cut-offs), and 'single-scale' networks, which have distributions that decay quickly, such as exponential or Gaussian distributions.The distribution of the network determines which of these classes it belongs to.
More recently, Broido & Clauset [3] argue that scalefree networks are rare.They introduce a strict procedure for identifying power-laws as well as several weaker regimes.This method, however, tends to be more focused on the tail of the distribution, as even the strongest criteria they outline only requires a power-law to be a good fit to the 50 highest-degree nodes and the method requires comparing to many simulated graphs which are generated under strict conditions.It is more concerned with whether or not evidence can be found to support or rule out a power-law distribution than being able to robustly identify the most appropriate distribution.
Knowing the distribution of a graph is important as it allows us to model and estimate other properties of the network structure.Similarly, it allows us to build general models for the formation of complex networks such as the preferential-attachment model [1].The preferentialattachment model is still one of the most popular complex network formation models, though it requires the degree distribution to follow a power law, or at least with slight deviations from it.
The question of what distribution a given network's degree distribution follows remains a heavily researched topic (see [3,4]).In addition, some controversy exists around the prevalence of power-law distributions in empirical networks [3,5,6].
Despite this, current methods for fitting degree distributions are inconsistent.Often still, the method of least squares is used to determine the parameter of a distribution, despite work highlighting the issues with this approach [7].The method of least squares has several drawbacks, for example, in the case of a power-law distribution, one would plot the distribution on a log-log scale and use least squares to determine the slope of the relatively straight line through the data.This assumes a priori that the data is a power-law however, and data appearing as a relatively straight line on a log-log scale is not a sufficient condition to rule out other possibilities.Additionally, it is not applicable to data that is less easy to linearize.
A better method is described in the work of Clauset et al. [8].This uses the method of maximum likelihood to estimate the parameters of the distribution.This method is intended for general empirical datasets.A problem with using it to specifically fit degree distributions is that it focuses on fitting to the tail of the distribution, and in doing so can ignore huge amounts of the data available, as we will discuss in Section III.Nodes with a low degree are important for percolation processes [9] and community identity in social networks [10], choosing a high degree cut-off will miss these.
In this paper we build on the method of Clauset et al. [8] in order to identify the most appropriate distribution for the degree distribution of a network from a set of candidate distributions and apply this methodology to a variety of network datasets.We do this while tak-ing minimal approximations (for example using a minimum value of k min ), which are discussed in more detail throughout this paper.The methods are applicable to any network dataset and do not rely on choosing things such as k min or the distribution by eye.
The rest of this paper proceeds as follows.In the next section we cover important background information concerning some basics of network science.In Section III we cover power-law distributions, methods of estimating power-law exponents and maximum likelihood estimators.Section IV compares the estimates for the power-law exponent covered in the previous section.In Section V we introduce the distributions which we use in our methodology before discussing model selection and goodness of fit in Section VI.Section VII describes in detail exactly how our methodology works.An indepth analysis of the results we obtain and discussion around them are found in Sections VIII and IX.
A Python package for the implementation of these methods is included in Appendix A.

II. BACKGROUND
A network is defined as an ordered set of N nodes and M edges between them.In this work we are concerned with unweighted and undirected networks, however the methods described can be easily applied to weighted degree distributions and directed networks.Nodes and edges can represent many things in different networks.For example, nodes may represent people in a social network and the edges represent friendship ties between them.
The degree of a node is the number of edges connected to it.The distribution of the degrees in a network gives some important characteristics related to its structure.The degree distribution p k is the fraction of vertices in a network with degree k, where k i is the degree of node i and δ k i ,k is the Kronecker delta function which is one if k i = k and zero otherwise.We are assuming there are no nodes of degree zero in the network.We use p k to denote probability mass functions (for discrete distributions) and p(k) to denote probability density functions (for continuous distributions).
For complex networks, degree distributions are often found to have a positive or right skew [11], which is also referred to as being heavy-tailed.These heavy tails are frequently due to a small number of vertices having very large degrees.Such highly connected nodes are often referred to as hubs.In contrast, Erdős-Rényi graphs tend to lack hubs due to having a small standard deviation in their degree when compared to complex networks.
In Figure 1 we display the degree distribution for pro-tein interactions in yeast, data taken from [12].The distribution has a noisy tail due to many instances of only one vertex with a specific (large) degree.To reduce the noise in the tail we often consider instead the complementary cumulative distribution function (CCDF) for the dataset [11].This is the probability that a node has a degree greater than or equal to k and is denoted P k :

III. POWER-LAW DISTRIBUTIONS
One of the most common and thoroughly studied distributions observed in complex networks is the powerlaw distribution [13,14].A power-law distribution has the form where the exponent of the power-law, γ ≥ 1.This is the most basic definition of a power-law, though the criteria for what is classified as a power-law is a much debated topic in network science (see [3] for a summary of different definitions).As discussed in Section I, this behaviour is usually found only in the tail of the distribution beginning at some value k min .If we take logarithm of Equation ( 3), we can see that the relationship between the log of p k and the log of k is linear.Hence, when a power-law is displayed on a log-log scale, it will appear linear.Therefore, one way in which we can obtain an estimate for the exponent γ is by performing a least squares regression on Equation (4).In this section we look at this and other methods of estimating the power-law exponents.First, we must normalize Equation (3).We start with where ρ k is the distribution below k min .Our equation therefore becomes where Here, ζ(γ, k min ) is the Hurwitz-zeta function.In the case where the entire distribution follows a power-law and k min = 1, the denominator is just the Riemann zeta function ζ(γ).The complementary cumulative distribution function for the full degree distribution is given by

Continuous Approach
Often when fitting, the approximation is made that the distribution is continuous instead of made up of discrete integers.We now consider the continuous regime to introduce an approximation for the exponent of a powerlaw before returning to the discrete case.In the continuous case, Equation (3) can be normalized by where ρ(k) is some unknown distribution below k min .The normalized distribution is then where δ ≡ k min 0 ρ(k) dk is treated as constant.We can then integrate Equation (8) to determine the cumulative distribution function, We can see from this that the cumulative distribution function also follows a power-law but with an exponent differing by one to the original.As with Equation (4), a least squares fit to the cumulative distribution can be used to determine an estimate for the exponent.This estimate will usually provide a better fit than a fit to the probability density function as the cumulative distribution is less noisy in the tail [11].

Moments of The Distribution
If we make the assumption that the distribution can be used for the entire data (i.e., set k min = 1) and thus δ → 0, the probability density function Equation (8) is Since the probability that k = 1 is bounded by unity, we must have γ < 2. The first moment of Equation ( 9) is If γ ≤ 2, then ⟨k⟩ will diverge when we evaluate this integral (for the discrete case it will be a large finite value).
As a result, we should use k min > 1 to remain consistent with k = 1 implying γ < 2. Empirically in complex networks we often observe that the mean degree is often low compared to the size of the network.Hence, choosing a high value of k min could lead to the distribution only being fitted to a small number of nodes.Indeed, in the case of 1,000 simulated true power-law distributions with γ = 2.5, 918 of these have an average degree less than three.
It may be also worth considering the using the modal degree as the value for k min as the distribution likely de-cays beyond this point.Clauset et.al [8] provide various methods for estimating k min , however, they assume large values of N, and so these methods may not be applicable to smaller datasets.We aim to provide a methodology to fit to datasets of any size.Methods for choosing k min will be discussed in section VI.For now we continue in the continuous regime.If γ > 2 in Equation (10) then we can obtain an estimate for γ from the mean, We will use this as one of the estimates for the exponent.
The same procedure can be repeated for the second moment and we obtain As N goes to infinity, one observes that the second moment diverges when γ ≤ 3. Therefore, if the degree distributions follow a power-law with γ ≤ 3, they would be expected to have a large value of ⟨k 2 ⟩.

Maximum Likelihood Estimators
From the continuous regime, we have three estimates for the exponent, 1) a least squares fit to the log-log probability mass function, 2) a least squares fit to the log-log complementary cumulative distribution function (exponent will be different by one) and 3) an estimate from the first moment of the degree distribution.We now turn our attention to the main focus of this paper which is using maximum likelihood estimators to obtain estimates for distribution parameters.
In general, the likelihood L is the probability that an independent and identically distributed dataset of N observations were drawn from a model p k with parameter θ, Here k i is the degree of node i and N is the number of nodes in the network.As mentioned above, the complementary cumulative distribution in Equation ( 2) is often used to reduce noise in the tail of the probability distribution.Fits are then often made to the cumulative distribution P k rather than the original distribution p k .It has been suggested that applying the method of least squares to cumulative distributions is unsuitable for empirical data [8,15].This is in part due to the data being discrete and therefore the continuous approach to the complementary cumulative distribution may not be appropriate (particularly for small datasets).Network data will always contain dependencies between the observations, violating one of the assumptions underlying maximum likelihood estimation [16].Despite this, MLE still shows good results as we see below and shown in Ref. [3].Furthermore, the estimates for the parameters obtained using a least squares fit can be inaccurate, and the error in the parameter value can be difficult to estimate [8].Instead, maximum likelihood estimators (MLEs) are said to provide better measurements for the parameters as no estimator has lower asymptotic error in the large sample size as the MLE [8].
Sticking with the continuous regime for now, the likelihood corresponding to the power-law from Equation ( 8) is Taking the logarithm we obtain We can obtain a value for ∆ from the data.This does not affect the maximisation, however, it is important to have an estimate for ∆ when displaying the fit to the distribution.Typically, ∆ or k min −1 k=1 ρ k as in Equation ( 5), is ignored in works such as ref.[8] or any other methodology that focuses on the tail of the distribution.Here we aim to account for it as fully as possible, though it does not affect the maximization.
We set ∂(ln L/∂γ) = 0 and solve analytically for an estimate for the power-law exponent.We obtain where γ denotes an estimate from the data rather than the true value.If a power-law is a good model for the data then γ ≈ γ.Our tests show however, that this estimate performs relatively poorly and this will be discussed in the next section (see Tables I, II, III).
In addition, complex network data, even when relatively large, is rarely a complete set of integers and there may be large gaps between consecutive degrees.It is therefore much more accurate to look at the discrete case from Equation (6).Taking the log likelihood of this we get This can be numerically maximized to obtain an estimate for γ.Additionally, Clauset et al. [8] find an approxima-tion for the log likelihood in Equation ( 15) as However, the convenience of this estimate comes at the cost of accuracy for smaller networks, particularly with small k min as we show later.We compare all of these estimates in Section IV.

IV. COMPARISON OF ESTIMATES
In the previous section we outlined four ways of obtaining estimates for the power-law estimate γ.In this section we will compare these four estimates along with estimates obtained by performing least squares on the log-transformed PDF and CCDF of the data.
We denote as γ c the estimate obtained from the continuous version of the distribution as in Equation ( 14), γ mle as the maximisation of the discrete discrete log likelihood in Equation ( 15), and γ d as the discrete approximation from Equation ( 16).
Additionally we define γ pdf to be the value obtained by performing least squares regression on log-log plot of the probability density function, and γ ccdf to be the value obtained by performing least squares regression on the log-log plot of the complementary cumulative distribution.Lastly, γ ⟨k⟩ is the estimate calculated from the first moment of the distribution as described by Equation (11).
For this we generated synthetic power-law distributions of varying sizes, all with a parameter γ = 2.5.We generate these using the NumPy package's built-in random number generators.For each value of N, we generated 1,000 distributions and calculated each estimate.The average estimates as well as their standard distributions for N = 1, 000 are shown below in Table I.Additional tables for graphs of size N = 100 and N = 10, 000 can be found in Appendix B. We see that in this case, with small k min , the numerical maximisation γ mle performs best when compared to other estimates.
In Ref. [8], the authors say that the continuous estimate, γ c is less accurate than the discrete estimate and no easier to calculate and so it should not be used.Our results here agree with this statement as we can see the continuous estimate is inaccurate even at large values of k min .Instead, they suggest using the discrete approximation, however, we observe here that the numerically maximized discrete MLE performs better than the approximation.
As we can see here the least squares fit to the CCDF, γ ccdf , is highly consistent under variation of k min , however, Clauset et al. [8] also point out flaws with this method, such as the underlying assumptions being invalid.The fit to the PDF, γ pdf , is always less accurate than that to the CCDF, on account of the PDF having a noisier tail [17].In addition, as mentioned in Sections I and III, there are drawbacks and limitations to applying the method of least squares to cumulative distributions.
The discrete estimate, γ d , that Clauset et al. [8] recommend using is accurate at k min = 3 and above, however, at this point already over 90% of the points will be missed.In Ref. [8], it is stated that this method works well only for k min ⪆ 6.With this we would run into a more extreme version of this problem, and our tests show that it begins to become less accurate at this point.
The method calculated using the first moment of the distribution, γ ⟨k⟩ performs poorly for all values of k min and we recommend that it should not be used.
We can see from Table I that in the case of small k min , the numerical maximisation of the log likelihood γ d performs best of all the estimates compared here.This is not surprising, given that all other methods are approximations, however, we believe the increased accuracy of maximum likelihood estimators justifies the relative increase in difficulty of their implementation.There are several other advantages that MLEs have over the other estimates as well, as mentioned in Section III.
Furthermore, none of the other estimators here are applicable to any distribution in the way that MLEs are, as the estimates γ c and γ d are unique to power-laws, and least squares regression is only applicable if the data can be easily linearized.The method of moments can be applied to some other distributions (e.g.exponential), however, its performance is still poor.Finally, all of these other estimates assume the distribution a priori, whereas MLEs allow for easy comparison of multiple distributions using information criteria.For these reasons, we argue that maximum likelihood estimators should be the standard approach for fitting distributions to network data.
Indeed, this is a clarification needed when discussing the work of Clauset et al. [8]; the methodology they discuss is applied to empirical data, but not necessarily network distributional data, which are often small and incomplete, especially in the case of poor sampling.It should be noted as well that the above paper is more concerned with determining if a power-law distribution is an appropriate fit for data than it is with testing many distributions to find the most appropriate fit.In the succeeding sections we will show all of the common distributions that one should consider when fitting a distribution to network data as well as equations for the log likelihood to be maximized, and some of the pitfalls that should be avoided in plotting these distributions against empirical data to ensure correct fitting.

V. OTHER DISTRIBUTIONS
Additionally we looked at the following distributions; (a) truncated power-law distributions, (b) exponential distributions, (c) stretched exponential distributions, (d) Weibull distributions, (e) Poisson distributions, (f) nor-TABLE I. Values for each power-law exponent estimate discussed in section IV obtained on 1,000 datasets each of size N = 1, 000 with true parameter value γ = 2.5.Values shown are means with standard deviations in brackets.γc is the continuous estimator for the exponent from eq. ( 14), γ pdf and γ ccdf are the values obtained by least squares regression of the pdf and ccdf respectively.γ d is the value obtained by the discrete approximation eq.16, γ ⟨k⟩ is the value obtained by the method of moments and γ mle is the value obtained by maximum likelihood estimation.Best estimates for each value of k min are highlighted in bold.mal (Gaussian) distributions and (g) lognormal distributions.Most of these distributions are for continuous variables, but here we discretize and normalize them in order to obtain their corresponding log likelihoods.We choose these as they are common in formation processes and the complex network literature, e.g.[2,8,18].

Truncated Power-Law Distributions
The second class of network described by Amaral et al. [2] contains a power-law regime followed by a sharp cut-off.Albert & Barabási [14] provide a list of exponents and cut-offs for a range of complex networks.A truncated power-law, or a power-law with exponential cut-off has the form A truncated power-law is shown for the collaboration network of the condensed matter arXiv from 1995-1999 in Ref. [19].The log likelihood then is where Z(k min ) ≡ ∞ m=0 (x + m) −γ e −m/κ .We then maximize this numerically.Details of obtaining Equation (18) from Equation ( 17) are shown in Appendix A.

Exponential and Weibull Distributions
The final class of network described by Amaral et al. are so-called single-scale networks characterized by fastdecaying tails.Although they are rarely used, these classifications provide a useful ordering by complexity of the different types of distributions.
We first look at the exponential distribution which is given by Its log likelihood is The exponential distribution can be generalized to a stretched exponential distribution which has been found to be a good fit for some empirical datasets, such as the jazz band network in Ref. [20], however, the stretched exponential & Weibull distributions are related with the stretched exponential being the CCDF of the Weibull distribution.Hence we only include the Weibull here.The Weibull distribution is given by As we can see when β = 1, the Weibull simplifies to the exponential distribution.When β = 2, it is closely related to the normal distribution (below).Its log likelihood is Details of this calculation can once again be found in Appendix A.

Normal & Lognormal Distributions
The normal distribution (or Gaussian) is given by where µ is the mean and σ 2 is the variance.This distribution is included here as it is found to fit some social networks [2], however, given the skewed nature of network degree sequences it is unlikely to be chosen in most cases.The log likelihood is given by (24) The lognormal distribution then is given by The log likelihood is given by

Poisson Distribution
The last distribution we test for these network datasets is the Poisson distribution.It is given by The log likelihood then is Again, details of obtaining this equation as well as those for the normal and lognormal distributions can be found in Appendix A. As with the normal distribution, a Poisson is an unlikely candidate for most network datasets given their skewed nature, however, both are included here for completeness given their ubiquity in other empirical datasets and scientific settings [21,22].

VI. MODEL SELECTION, CHOOSING k min , AND GOODNESS OF FIT
As discussed in Section IV, there are several advantages to using maximum likelihood estimation.Most importantly is the ability to compare the likelihoods of different distributions in order to choose the best distribution for a given dataset.In order to compare different models we use the Akaike information criterion with correction for finite sample sizes (AIC c ) [23,24] and the Bayesian information criterion (BIC) [25].The AIC c is given by and the BIC is given by In each equation n θ is the number of parameters in the model, and N is the sample size.This AIC c gives greater penalty for number of parameters, and the penalty is diminished with a larger sample size.Anderson & Burnham [24] recommend always using the AIC c instead of the regular AIC (which is the same as Equation 29but without the last term) as in the case of large N the correction tends to zero.We also make use of AIC weights.For a given set of m models for a given dataset, the AIC weights are given by where ∆ i = AIC ci − min(AIC c ).These values represent the relative likelihood of the models and allow us to determine if there is support for more than one model.The values of the weights will sum to 1, and the closer a value is to one the more support for that model there is.
Usually the weights are accompanied by some cut-off, below which we say a model is not supported.Since the AIC c and the BIC both depend on N, we can only compare two distributions that start at the same value of k min .For this reason, the model and k min must be chosen in tandem.It is important to remember that to capture the true distribution of a dataset, it should be a priority that k min be minimized.How exactly we achieve this is discussed in Sections VII and VIII.
Goodness-of-fit tests for this type of analysis leave much to be desired.Two common means of assessing goodness-of-fit are (i) Kolmogorov-Smirnov tests, and (ii) Q-Q plots, both of which have their drawbacks.Descriptions of how these tests work can be found in Refs.[26] and [27] respectively.
First, with network datasets, we are often dealing with sample sizes of the order of 10,000 nodes or more.These sample sizes are in contrast to the work done by Goldstein et al. [7], who find success using the KS test on sample sizes of the order of 1,000 nodes.The critical value for a KS test at a 5% level of significance in the limit of large n is given by 1.36/

√
n, where n is the sample size or in our case the number of nodes.Our tests find that the KS test is simply too restrictive, and will tell the user that even the best of our fits are poor.
Quantile-quantile or Q-Q plots (shown in Figure 4) are useful in many cases but since they require visual interpretation, they are not suitable in a situation where automation is desired.Furthermore they are sensitive to outliers or extreme values, which are common across all of our datasets.

VII. FITTING
Here we outline the methodology used to choose distributions for datasets based on the methods discussed already.The datasets to which this methodology was applied can be found in Table IV.For a given degree sequence, the methodology works as follows: (i) Beginning with k min = 1, we find the parameter values that maximize the log likelihood functions of each of the distributions discussed above.
(ii) The AIC c and AIC weights for each distribution are calculated.The user may choose to use the AIC c or the BIC to choose the best distribution or both.
(iii) We increase k min by one and repeat steps (i) and (ii).Once the same distribution is chosen for three consecutive k min values we choose this distribution for the dataset.In the case of small datasets (here chosen to be fewer than 2,500 nodes), the same distribution need only be chosen for two consecutive k min values.The value 2,500 was chosen based on our datasets; below this value, requiring the same distribution to be chosen three times simply resulted in cutting off too much of the data below k min .
(iv) We use the determined distribution and parameters to fit a curve to the CCDF of the degree sequence for visual purposes.At this point, both the CCDF and the PDF should be inspected, as we find that looking at just the CCDF is insufficient to determine if we have obtained a good fit or not.Other goodness-of-fit metrics such as a Q-Q plots can be used.If one determines that the fit obtained is poor, then k min should be increased and the algorithm repeated.We find that we can best determine the degree distribution when using all available information.We will explore this further in the results section.
(v) We then bootstrap the dataset 1,000 times and fit the same distribution to our bootstrapped samples in order to obtain means, standard deviations, and other desired summary statistics for the parameters of our fitted distributions.These summary statistics are shown in Table V.
When fitting we must account for data below k min .This is done using our empirical CCDF.If, for example, we have k min = 3, then we multiply the values in our fitted CCDF by 3  k=0 p(k) ≡ P(3).This ensures that the fitted curve begins at the correct point.
There are some limitations for what can be done given a network dataset.Many networks are poorly sampled or not a complete sample.Consider the Enron email dataset [28], here emails between employees were recorded but not emails to external addresses, i.e., contacts outside of the organisation would not add to an individual's degree.Therefore, we do not have the actual degrees of nodes, and so conclusions from the dataset about the degree distribution cannot be extended to the network as a whole.In this case we may just want to fit to the tail and not the full distribution so a high value of k min may be appropriate.
In contrast to this is the power-grid network studied in Ref. [29].Here, the degree is the number of physical connections to a generator, a transformer, or a substation and so the degree of a node is correct to the underlying network.In this case, a high k min may result in a different distribution being chosen for the tail and not be representative of the full network.
Keeping these limitations in mind, this methodology can be applied to any distribution of discrete integers.In the following section we will see how it performs on a wide variety of network types.

VIII. RESULTS
We now look at some of the results obtained by using the above described methodology on a collection of network datasets, some of our own, one sexual contact network taken from Ref. [4], and many taken from the Konect database.Konect stands for the Koblenz Network Collection [30]; an open-access online database of network datasets from various areas of science.A full list of the datasets we applied the methodology to as well as a description of each dataset can be seen in Table IV in Appendix B. A summary of some basic properties of these networks can be found in Table V For most networks that we looked at, we obtain good estimates for the parameters of the chosen distributions, as can be seen from the closely fitting curves to the datapoints.Note as well that in the majority of these cases we begin at k min values of one and two, in line with our priority of minimizing k min .
The results shown in Figure 2 are representative of the results as a whole (results for all other datasets can be seen in Figures 7 and 8).In fact, based on its CCDF the Google+ (Figure 2 (h)) dataset appears to be the worst fit of any dataset displayed.One possible reason for this relatively poor fit is the very noisy tail of the distribution, which can be seen in the PDF of this distribution in Figure 6.This noisy tail is caused by many hubs of different degrees, i.e., there are many points with different degrees but equal probabilities, hence the long horizontal line of points in Figure 6 (b).This is in contrast to the more 'well-behaved' PDF of the astrophysics dataset in FIG. 2. Results for a selection of the networks whose degree distributions we fitted.As we can see in most cases, we manage to obtain excellent fits with low k min values in all cases.Notable exceptions are Gene Fusion and Google+, which are discussed in turn in Section VIII. Figure 6 (a).Even in this case, looking at the PDF, we can see that there is not that much room for improvement in the fit.
According to the CCDF in Figure 5 (b) for the Linux dataset we appear to obtain a bad fit here too.However, looking at the PDF (Figure 5 (a)) we see we actually have quite a good fit.In this case, the sharp drop off in CCDF values may have been caused by finite size effects, cumulative errors in individual probability values, or something else.The lesson here is that we cannot judge fit based solely on the CCDF.
A counter example to this is the Twitter dataset whose CCDF and PDF can be seen in Figures 7 (v) and 8 (v) respectively.Here, we see a poor fit to both the PDF and CCDF, highlighting the need to look at both before determining if a distribution is a good fit or not.
In Figure 2 (g) we have the Gene Fusion dataset.Overall this fits the data well however the dataset only contains approximately 250 nodes, hence we have large distances between the predicted and observed probabilities.As mentioned in step (iii) of the algorithm, in the case of such small datasets, the methodology is altered slightly so that only two votes for the same distribution are required for the distribution to be chosen, as we found that a larger value results in too large a portion of the data being excluded.
We can see that in cases such as the Facebook dataset in Figure 2 (c), we obtain a good fit for the entire distribution.In the work of Clauset et.al [8], it would be reasonable to choose a k min value even as high as k min ≈ 110.In doing so all of the data below this point would be ignored, which equates to over 99% of the values, and with our methodology we have fit to the entire dataset.Other methodologies would similarly ignore the bulk of the data and fit a power-law to the tail, using one of the estimators shown in Table I based on it appearing as a straight line when plotted on a log-log scale.Some fits may be poor while prioritizing a low value of k min , but in this situation we would increase k min to see if we can obtain a better fit.Examples of this can be seen in the Internet (CAIDA), Internet AS 2, and Internet Topology datasets, for which we get a very poor fit for low values of k min but see improvement when we increase k min .This is in contrast to the Twitter dataset, for which we find that even larger values of k min struggle to obtain a good fit.In this case, perhaps some less common distribution which we have not covered here would provide a better fit.
In the case of the RIP Ireland dataset, we see that the chosen distribution is a Weibull with parameters κ = 183.14,β = 0.99.Since β is so close to one this is essentially an exponential distribution.However, since an exponential distribution is a special case of a Weibull distribution, and one can always achieve a higher likelihood by adding more parameters [8], the likelihood of an exponential will always be less than that of a Weibull.In this case, we rely on the information criteria to sufficiently penalize the distribution with more parameters.If using the Bayesian information criteria with k min = 1, we obtain an exponential fit for this distribution, though not for k min = 2 or greater.With the advantage of a lower k min and a simpler distribution, an exponential distribution is likely the better choice here.Results for this daataset shown in Table V are for a Weibull distribution to illustrate our point.
In all other cases, we find that choosing BIC instead of AIC has no effect on what distribution is chosen.
Lastly, we focus on the Petstser dataset, which is a more ambiguous example.Initially we fit using the steps outlined in the previous section, which tells us that the data follow a Weibull distribution for k ≥ 3.This fit is shown in Figure 3.
Suppose we feel that based on the CCDF we could improve the fit to the tail of the distribution.We increase k min to four and repeat the algorithm, which gives the lognormal again shown in Figure 3.
This distribution arguably fits better in the tail of the distribution, so how do we choose between the two options?Looking at the PDF for both distributions we see there is very little appreciable difference, aside from the different k min values.We then look at the Q-Q plots for each distribution, shown in Figure 4.As we can see here the Weibull appears to be a better fit for the distribution, as the data follows the diagonal more closely (both begin to deviate at the tail which is to be expected when dealing with extreme values).
When the choice between two distributions is so close, we must look at the AIC weights.At k min = 3, the weights for the Weibull and lognormal distributions were 0.99 and 0.01 respectively.At k min = 5, these values were 0.3 and 0.7.Because of this, at k min = 3 we can rule out a lognormal distribution.However, at k min = 5, there is still some support for a Weibull distribution.
Combining all of this information, along with our priority of minimizing k min we conclude that the Weibull distribution is the better fit for the data.In the case of more ambiguous fits, or wanting to improve the fit to a distribution we recommend this approach of looking at all available information.
All in all, we can see that the methodology described in this paper performs well in a variety of scenarios, and only does poorly in a few cases.This may be improved upon by testing other distributions, as here we have only tested a handful of commonly found distributions.

IX. DISCUSSION
For the majority of the time that the field of complex networks has been studied, questions around the degree distributions of networks have persisted to be left without satisfying answers.Here we have built upon the work of Clauset et al. [8] to provide a methodology that introduces more rigour to the process of determining the degree distribution of a network dataset.We have three key differences to their method.
Firstly, we suggest taking a low value of k min to consider as many of the nodes as possible.Depending on the network's distribution, even a value of 3 or larger could end up fitting to less than 10% of the network.This is only desirable when a network is poorly sampled and we are just interested in the tail.If we suspect the network is well sampled, then a high value of k min will miss the majority of the network.These often-ignored low degree nodes are important in many contexts [31], playing crucial roles in processes such as percolation [9], as well community detection in social networks [10].
Secondly, we show that the parameter estimate obtained through numerical maximisation of Equation 15is superior to the estimate given by Equation 16, especially for low values of k min .Taking the approximation for smaller networks with low values of k min will give poor results and possibly rule out a power-law even if that is the true distribution.
Thirdly, we use information criteria to choose the model rather than constructing a p-value from the Kolmogorov-Smirnov tests.These methods do not tell us about the goodness of fit but allow us to choose the most appropriate of the candidate models for this dataset.
As we can see from the results in Section VIII, using maximum likelihood estimation to determine the most appropriate fit from a set of candidate models performs well on a variety of network datasets with complex distributions.This work improves the standard of distribution fitting, with most if not all of the data being fit to in all cases, through minimisation of k min .This process is not entirely automated and we believe this should be the case.In some cases Q-Q plots may be necessary to interpret the results and one may need to check both the CCDF and PDF.These are noisy empirical datasets so caution must be taken and an automated process could easily miss these subtleties.
After checking many different distributions, one may wonder why we are so interested in the specific distribution a complex network has.Networks have different underlying formation processes.For example, if we know the degree distribution follows a Poisson or bino- mial distribution, then we know this network if formed by purely a random process.If we know nodes prefer to connect to high degree nodes, we know it's a preferential attachment process.Hence, if we know the underlying distribution, we can come up with a theory that will give rise to that distribution.This is one of the reasons why the power-law is so popular, not only is it instrumental in many statistical FIG. 6. Probability density functions for the Astrophysics dataset (left) and Google+ dataset (right).As we can see the Google+ data is far noisier in the tail, making the distribution more difficult to fit to.physics theories, and required for the epidemic threshold in disease transmission models, but it is the distribution underlying the preferential attachment model.It is also a single parameter distribution making it quite desirable for fitting with low statistical complexity.In our results, we find support for power-laws in many internet-based networks.However, this does not seem like a good mechanism for many of the social networks we analyse.
Power-laws suggest some nodes will have an extremely high degree, often approaching the system size, something not realistic for social systems.In the social networks here for example, they are better fitted by Weibull or log-normal distributions.In fact the RIP Ireland data is well fitted by a Weibull with β ≈ 1 -i.e. an exponential.With the exception of two outliers (people known in the media) no one has a lifetime degree even close to 2,000 (for a system size > 4 million).Therefore, preferential attachment seems like a poor mechanism for social networks in particular with their fast decaying tails.
Many other processes for social networks have been suggested even in the early years of complex networks, for example Ebel et al. [18] propose a mechanism which will give rise to a stretched exponential.This is closer to the results that we observe.We aim to test this further going forward to identify an appropriate mechanism for social networks.
The foundation presented here performs well and can be extended to improve its performance further.In the future, we aim to extend the analysis both to more datasets and to test more parametric distributions.Another avenue for the extension of this work that we aim to explore is allowing for fitting of more than one distribution function to distributions in which we feel the tail behaves differently to the body of the distribution, i.e., splitting the distribution at some value k * and fitting two different distributions to the data above and below this value.
We believe that this widely applicable and userfriendly approach provides much needed rigour in answering a fundamental question about the properties of a network.
To normalize the distribution we multiply the summation here above and below by 1 − e −1/κ like so All but the first and last terms here cancel, leaving us with C 1 − e −1/κ e −k min /κ − e −(k max +1)/κ , and so and this leads to the normalized exponential distribution from which the log likelihood Equation ( 20) is obtained.

Weibull Distribution
Starting with Equation ( 21), we first define the constant to normalize the distribution.
and with C defined as above, we have The likelihood then is given by from which the log likelihood Equation ( 22) is obtained

Poisson Distribution
Starting with Equation ( 27), we normalize like so Adding and subtracting the same term from the above equation leads to This gives Equation 27 then becomes and our log likelihood is as shown in Equation (28).

Lognormal Distribution
Starting with Equation (25) we normalize in the usual way .
This gives , and the log likelihood Equation ( 26) is obtained from this.

Appendix B: Additional Tables & Figures
Below are a collection of supporting tables and figures referenced throughout the paper.TABLE II.Values for each power-law exponent estimate discussed in section IV obtained on 1,000 datasets each of size N = 100 with true parameter value γ = 2.5.Values shown are means with standard deviations in brackets.γc is the continuous estimator for the exponent from eq. ( 14), γ pdf and γ ccdf are the values obtained by least squares regression of the pdf and ccdf respectively.γ d is the value obtained by the discrete approximation eq.16, γ ⟨k⟩ is the value obtained by the method of moments and γ mle is the value obtained by maximum likelihood estimation.Best estimates for each value of k min are highlighted in bold.

FIG. 1 .
FIG.1.Degree distribution for yeast protein interactions on a log-log scale (a) and the complementary cumulative degree distribution, again on a log-log scale.

FIG. 3 .
FIG. 3. Weibull distribution and a lognormal distribution fit to the Petster dataset at different k min values as described in Section VIII.

TABLE III .
Values for each power-law exponent estimate discussed in section IV obtained on 1,000 datasets each of size N = 10, 000 with true parameter value γ = 2.5.Values shown are means with standard deviations in brackets.γc is the continuous estimator for the exponent from eq. (14), γ pdf and γ ccdf are the values obtained by least squares regression of the pdf and ccdf respectively.γ d is the value obtained by the discrete approximation eq.16, γ ⟨k⟩ is the value obtained by the method of moments and γ mle is the value obtained by maximum likelihood estimation.Best estimates for each value of k min are highlighted in bold.

TABLE V .
Quantities for each of our studied networks.N is number of nodes, L is number of edges, ⟨k⟩ is the mean degree and k max is the maximum degree.

TABLE VI .
Network datasets and their corresponding distributions, all figures rounded to two decimal places.x 1 is the first parameter for a distribution, x1 is the mean parameter obtained from fitting this distribution to bootstrapped samples.σx 1 is the standard deviation for this parameter.P 2.5 (x 1 ) and P 97.5 (x 1 ) are the 2.5th and 97.5th percentiles of the parameters respectively.These values are then repeated for x 2 in the case of two-parameter distributions.NetworkDistribution k min x 1 x1 σ x 1 P 2.5 (x 1 ) P 97.5 (x 1 )x 2 x2 σ x 2 P 2.5 (x 2 ) P 97.5 (x 2 ) .7. Empirical and fitted CCDFs for all studied networks listed in table IV.When assessing fit both PDFs and CCDFs should be considered. FIG