Power-law Distributions in Binned Empirical Data

This page is a companion for the paper on power-law distributions in binned empirical data, written by Yogesh Virkar and Aaron Clauset (me). It presents a version of the power-law tools from here that work with data that are binned. This page hosts our implementations of the methods we describe in the article, including several by developers other than us. Our goal is for the methods to be widely accessible to the community.
NOTE: we cannot provide technical support for code not written by us, and we are busy with other projects now and so may not provide support for our own code.

Journal Reference
Y. Virkar and A. Clauset, Power-law distributions in binned empirical data. Annals of Applied Statistics 8(1), 89 - 119 (2014). (arXiv:1208.3524).

Fitting a binned power-law distribution
This function fits a power-law model to binned data using maximum likelihood estimator discussed in the paper. It uses a goodness-of-fit based method to estimate the lower bound for the scaling region. Additional information can be obtained by typing 'help bplfit' at the Matlab command window.
bplfit.m (Matlab, by Yogesh Virkar)

Visualizing the plotting function
This function plots (on log-log axes) the binned empirical data along with the fitted power-law model. Additional information can be obtained by typing 'help bplplot' at the Matlab command window.
bplplot.m (Matlab, by Yogesh Virkar)

Estimating uncertainty in the fitted parameters
This function estimates the uncertainty in estimated parameters for the power-law model using nonparametric bootstrap approach. Additional information can be obtained by typing 'help bplvar' at the Matlab command window.
bplvar.m (Matlab, by Yogesh Virkar)

Calculating p-value for the fitted power-law model
Using the Kolmogorov-Smirnov statistic as a distance measure between data and fitted model and semi-parametric bootstrap for resampling of data, this function calculates the plausibility of the fitted power-law model. Additional information can be obtained by typing 'help bplpva' at the Matlab command window.
bplpva.m (Matlab, by Yogesh Virkar)

Comparing to alternative distributions
Using maximum likelihood estimator found for each alternative distribution (exponential, stretched exponential, lognormal and power law with exponential cut off), these functions fit the corresponding alternative distribution to binned data. Additional information about usage can be obtained by typing help name_of_function.
bexpnfit.m (Matlab, by Yogesh Virkar)
bstexpfit.m (Matlab, by Yogesh Virkar)
blgnormfit.m (Matlab, by Yogesh Virkar)
bplcutfit.m (Matlab, by Yogesh Virkar)

Calculating likelihood ratio
This function implements the log likelihood ratio test explained in the paper to compare between different fitted models. Additional information about usage can be obtained by typing 'help blrtest'.
blrtest.m (Matlab, by Yogesh Virkar)

Calculating probability density functions
This function calculates the probability density function for specified type of model. Finding PDFs can be useful for model comparison (See 'blrtest.m'). Additional information about usage can be obtained by typing 'help getPDF'.
getPDF.m (Matlab, by Yogesh Virkar)

Download all files
All the functions implemented above are available as a single downloadable zip file here.
Full Matlab package (by Yogesh Virkar)

A note about bugs and alternative implementations
The code provided here is provided as-is, with no warranty, with no guarantees of technical support or maintenance, etc. If you experience problems while using the code, please let us know via email. We are also happy to host (or link to) implementations of any of these functions in other programming languages, in the interest of facilitating their more widespread use. That being said, all such code also comes with no warranties, etc. If you do have questions about any of the implementations, please contact the respective function's author.
Finally, if you use our code in an academic publication, it would be courteous of you to thank Yogesh in your acknowledgements for providing you with implementations of the methods.

Data sets used
All data sets used in the paper are either previously published or are available online.

  1. Estimated number of personnel in a terrorist organization, binned by powers of ten, except that the first two bins are merged.
    Link to the data (mirror).

    V. Asal and R. K. Rethemeyer. "The Nature of the Beast: Organizational structures and the lethality of terrorist attacks." Journal of Politics 70(2):437-449 (2008).


  2. Diameter of branches in the plant species Cryptomeria, binned in 30mm intervals.

    Download the data.

    K. Shinokazi, K. Yoda, K. Hozumi, and T. Kira, "A quantitative analysis of plant form-The Pipe Model Theory II: Further evidence of the theory and its application in forest ecology." Japanese Journal of Ecology 14(2):133-139 (1964).

  3. Volume of ice in an iceberg calving event.

    Contact authors for data.

    A. Chapuis and T. Tetzlaff, "The variability of tidewater-glacier calving: origin of event-size and interval distributions." E-print, arXiv:1205.1640 (2012).

  4. Length of a patient's hospital stay within a year.
    Contact authors for data.

    Heritage Provider Network. Health Heritage Prize Data Files, HHP_release3 (2012).

  5. Wind speed (mph) of a tornado in the United States from 2007 to 2011, binned into categories according to the Enhanced Fujita (EF) scale, a roughly logarithmic binning scheme.
    Link to the data.

    Storm Prediction Center, Severe Weather Database Files (1950-2011) (2011).

  6. Maximum wind speed (knots) of tropical storms and hurricanes in the United States between 1949 and 2010.
    Link to the data.

    B. Jarvinen, C. Neumann, and M.A.S. Davis, NHC Data Archive. National Hurricane Center (2012).

  7. The human population of U.S. cities in the 2000 U.S. Census.
    Download the data.

  8. The sizes in acres of wildfires occurring on US federal land between 1986 and 1996.
    Download the data.

    M. E. J. Newman, "Power laws, Pareto distributions and Zipf's law." Contemporary Physics 46, 323 (2005).

  9. The intensities of earthquakes occurring in California between 1910 and 1992, measured as the maximum amplitude of motion during the quake.
    Download the data. (Magnitudes on the Gutenberg-Richter scale.)

    M. E. J. Newman, "Power laws, Pareto distributions and Zipf's law." Contemporary Physics 46, 323 (2005).

  10. Area (sq. km) of glaciers in Scandinavia.
    Link to the data.

    World Glacier Monitoring Service and National Snow and Ice Data Center. World Glacier Inventory (2012).

  11. Number of cases per 100,000 of various rare disease.
    Link to the data.

    Orphanet Report Series, Rare Diseases collection. Prevalence of rare diseases: Bibliographic data (2011).

  12. Number of genes associated with a disease.
    Link to the data (Table 1).

    K. Goh, M. Cusick, D. Valle, B. Childs, M. Vidal, and A. L. Barabasi, "The human disease network." Proc. Nat. Acad. Sci. USA 104(21), 8685-8690 (2007).

Updates
8 June 2015: corrected a small bug in the final calculation of the likelihood of the fit in bplfit; this bug did not impact any other aspects of the calculation. Thanks to Babak Fotouhi for finding it
5 September 2012: data information posted.
3 September 2012: v1.0 of code posted.
16 July 2012: initial page created.