Benford's Law, 2013 and 2025

Wikipedia has this entry for Benford’s Law, which deals with the distribution of leading digits of a collection of samples.

In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time.

In 2013, I empirically verified Benford’s Law for file sizes on four Linux machines. 2013 CE seems like centuries ago. I revisited the topic.

2013 Data

Digit A B C D
0 7901 1236 1233 265
1 63691 91489 25857 11524
2 50970 80713 15732 6791
3 24497 42356 11096 5472
4 18566 32831 8731 4510
5 14738 27335 6390 2787
6 12079 24884 4989 2188
7 12737 22963 4025 1875
8 12578 22362 5503 1781
9 10716 18541 3247 2168
Files 228473 364710 86803 39361

I characterized machines A through D like this:

  • Machine A: x86 Arch Linux development laptop
  • Machine B: Slackware Linux 12.0, x86, SMTP, HTTP, NTP, DHCP server
  • Machine C: Freshly installed Arch Linux x86 SMTP, HTTP, NTP, DHCP server
  • Machine D: Corporate Developer’s $HOME directory, x86_64 RHEL 5.0 server

I had been a Slackware user from about 2003 at that point. In 2012, I did Linux From Scratch, and then tried Arch Linux on a laptop. In October of 2013, I was in the process of spinning up a new Arch Linux server, so I had my previous Slackware server on hand.

The data can be visualized like this:

Linux file sizes, percentage vs leading digit

The yellow bars show the exact values of initial digit percentages according to Benford’s Law. Not a perfect fit, but good enough to say that in 2013, Linux file sizes fit Benford’s Law.

Software

Back in 2013, I wrote a plain C program to walk an entire Linux file system get file sizes and count files and leading digits (base 10) of file sizes. It’s 73 lines of code. It uses the GNU version of the ftw() C standard library function, because it’s better behaved than the POSIX version.

In 2025, 12 years later, that same C code compiles with only 2 warnings:

1743 % cc -g -Wall -Wextra -O -o ben2 ben2.c
ben2.c: In function 'main':
ben2.c:20:10: warning: unused parameter 'ac' [-Wunused-parameter]
   20 | main(int ac, char **av)
      |      ~~~~^~
ben2.c: In function 'per_file':
ben2.c:44:21: warning: unused parameter 'ftwbuf' [-Wunused-parameter]
   44 |         struct FTW *ftwbuf
      |         ~~~~~~~~~~~~^~~~~~

If you are extraordinarily disciplined, you can write C code that’s portable through time, and across operating systems and compiler versions. In 2013, machines A, B and C were 32-bit. All machines I used in 2025 ran 64-bit software, but this time on 2 instruction sets.

2025 Investigation

  • Machine E: Qotom fanless server, an Intel x86_64 machine, running Arch Linux. I’ve run pacman -Syu on it 163 times since February of 2024.
  • Machine F: Dell R530 rack mounted server. I’ve run pacman -Syu 347 times since 2020-07-16.
  • Machine G: Dell E7470 laptop, set up for software development, I’ve run pacman -Syu 428 times since 2021-08-01.
  • Machine H: Apple MacBook Air, Apple M2 silicon
  • Machine J: ASUS AX6000 TUF WiFi “gaming router” running OpenWrt 23.05.5
  • Machine K: Linksys Velop WHW03 V1 WiFi running OpenWrt 24.10.0
  • Machine L: Bosgame Ecolite E2 mini-PC, new-ish, 41 invocations of pacman -Syu
Digit E F G H J K L
1 50946 146305 313394 7931045 866 919 30109
2 28876 85669 174066 4823367 252 320 16645
3 30083 67454 136665 2609875 414 359 13405
4 17990 53372 94731 2268817 257 337 12129
5 14380 39915 87410 1499538 129 194 9157
6 9774 30878 60708 1274907 296 117 6254
7 7901 26062 47971 1334309 93 98 4617
8 8821 23278 47806 1278795 68 92 4688
9 6412 22613 38171 1028309 70 78 4726
0 10402 4708 16181 42772 63 60 403

2025 Linux file size leading digit data visualized

Looks like Linux fie size data still fits Benford’s Law fairly closely. I’ll admit to an Arch Linux heavy selection, but I don’t think that matters much, because OpenWrt and MacOS also fit closely.

The worst deviation is 4. 4 as a leading digit comes up more often than Benford’s Law predicts, both in 2013 and 2025.

File Size Distribution

The distribution of files I found conforms to the stereotype of lots of small files, and a few larger files. It looks exponential or maybe log-normal.

Dell R530 exponential file size distribution

That’s a histogram of the smallest 75% of file sizes on my Dell R530. It’s representative of the machines I examined in 2025. The largest file represented in the histogram is 9,999 bytes. The maximum sized file on that machine, which doen’t appear in the hitogram, is 1,869,047,133 bytes. Without trimming off the largest 25% of the files, the histogram would have a single tall bucket at about 0, and the remaninder of the buckets very short.

Software - 2025

This time around, I wrote a Go program. I knew I wanted to run it on a number of different architectures, and two OSes, so portability seemed important. Performance was not paramount, I was only going to run the program two or three times per machine.

Code repo

Portability

I made four executables, all using the same source code. All I had to do was specify an Instruction Set Architecture.

  • go build benford.go for machines E, F, G, L
  • GOOS=linux GOARCH=arm64 go build -o benford_aarch64 benford.go for machine J
  • GOOS=linux GOARCH=arm go build -o benford_arm benford.go for machine K
  • GOOS=darwin GOARCH=arm64 go build -o benford_macos benford.go for machine H

Performance

Machine File Count Time to find sizes files/sec Storage type
E 185585 5.0 sec 37117 SATA HDD
F 500254 50.3 sec 9945 SAS HDD
G 1017103 44.2 sec 23011 SATA SSD
H 24091734 1734 sec 13894 Apple SSD
J 2508 0.6 sec 4108 256NAND FLASH
K 2574 0.5 sec 5148 eMMC FLASH
L 102133 1.0 sec 102133 NVMe

Alas, I didn’t keep performance data in 2013. It’s pretty clear that SSDs are faster than hard disks, and NVMe devices are faster than SSDs. That has to be a consequence of NVMe devices being directly attached to PCI buses.