Benford's Law, 2013 and 2025
Wikipedia has this entry for Benford’s Law, which deals with the distribution of leading digits of a collection of samples.
In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time.
In 2013, I empirically verified Benford’s Law for file sizes on four Linux machines. 2013 CE seems like centuries ago. I revisited the topic.
2013 Data
| Digit | A | B | C | D |
|---|---|---|---|---|
| 0 | 7901 | 1236 | 1233 | 265 |
| 1 | 63691 | 91489 | 25857 | 11524 |
| 2 | 50970 | 80713 | 15732 | 6791 |
| 3 | 24497 | 42356 | 11096 | 5472 |
| 4 | 18566 | 32831 | 8731 | 4510 |
| 5 | 14738 | 27335 | 6390 | 2787 |
| 6 | 12079 | 24884 | 4989 | 2188 |
| 7 | 12737 | 22963 | 4025 | 1875 |
| 8 | 12578 | 22362 | 5503 | 1781 |
| 9 | 10716 | 18541 | 3247 | 2168 |
| Files | 228473 | 364710 | 86803 | 39361 |
I characterized machines A through D like this:
- Machine A: x86 Arch Linux development laptop
- Machine B: Slackware Linux 12.0, x86, SMTP, HTTP, NTP, DHCP server
- Machine C: Freshly installed Arch Linux x86 SMTP, HTTP, NTP, DHCP server
- Machine D: Corporate Developer’s $HOME directory, x86_64 RHEL 5.0 server
I had been a Slackware user from about 2003 at that point. In 2012, I did Linux From Scratch, and then tried Arch Linux on a laptop. In October of 2013, I was in the process of spinning up a new Arch Linux server, so I had my previous Slackware server on hand.
The data can be visualized like this:

The yellow bars show the exact values of initial digit percentages according to Benford’s Law. Not a perfect fit, but good enough to say that in 2013, Linux file sizes fit Benford’s Law.
Software
Back in 2013, I wrote a plain C program to walk an entire Linux file system
get file sizes and count files and leading digits (base 10) of file sizes.
It’s 73 lines of code.
It uses the GNU version of the ftw() C standard library function,
because it’s better behaved than the POSIX version.
In 2025, 12 years later, that same C code compiles with only 2 warnings:
1743 % cc -g -Wall -Wextra -O -o ben2 ben2.c
ben2.c: In function 'main':
ben2.c:20:10: warning: unused parameter 'ac' [-Wunused-parameter]
20 | main(int ac, char **av)
| ~~~~^~
ben2.c: In function 'per_file':
ben2.c:44:21: warning: unused parameter 'ftwbuf' [-Wunused-parameter]
44 | struct FTW *ftwbuf
| ~~~~~~~~~~~~^~~~~~
If you are extraordinarily disciplined, you can write C code that’s portable through time, and across operating systems and compiler versions. In 2013, machines A, B and C were 32-bit. All machines I used in 2025 ran 64-bit software, but this time on 2 instruction sets.
2025 Investigation
- Machine E: Qotom fanless server, an Intel x86_64 machine, running Arch Linux.
I’ve run
pacman -Syuon it 163 times since February of 2024. - Machine F: Dell R530 rack mounted server.
I’ve run
pacman -Syu347 times since 2020-07-16. - Machine G: Dell E7470 laptop, set up for software development,
I’ve run
pacman -Syu428 times since 2021-08-01. - Machine H: Apple MacBook Air, Apple M2 silicon
- Machine J: ASUS AX6000 TUF WiFi “gaming router” running OpenWrt 23.05.5
- Machine K: Linksys Velop WHW03 V1 WiFi running OpenWrt 24.10.0
- Machine L: Bosgame Ecolite E2 mini-PC, new-ish, 41 invocations of
pacman -Syu
| Digit | E | F | G | H | J | K | L |
|---|---|---|---|---|---|---|---|
| 1 | 50946 | 146305 | 313394 | 7931045 | 866 | 919 | 30109 |
| 2 | 28876 | 85669 | 174066 | 4823367 | 252 | 320 | 16645 |
| 3 | 30083 | 67454 | 136665 | 2609875 | 414 | 359 | 13405 |
| 4 | 17990 | 53372 | 94731 | 2268817 | 257 | 337 | 12129 |
| 5 | 14380 | 39915 | 87410 | 1499538 | 129 | 194 | 9157 |
| 6 | 9774 | 30878 | 60708 | 1274907 | 296 | 117 | 6254 |
| 7 | 7901 | 26062 | 47971 | 1334309 | 93 | 98 | 4617 |
| 8 | 8821 | 23278 | 47806 | 1278795 | 68 | 92 | 4688 |
| 9 | 6412 | 22613 | 38171 | 1028309 | 70 | 78 | 4726 |
| 0 | 10402 | 4708 | 16181 | 42772 | 63 | 60 | 403 |

Looks like Linux fie size data still fits Benford’s Law fairly closely. I’ll admit to an Arch Linux heavy selection, but I don’t think that matters much, because OpenWrt and MacOS also fit closely.
The worst deviation is 4. 4 as a leading digit comes up more often than Benford’s Law predicts, both in 2013 and 2025.
File Size Distribution
The distribution of files I found conforms to the stereotype of lots of small files, and a few larger files. It looks exponential or maybe log-normal.

That’s a histogram of the smallest 75% of file sizes on my Dell R530. It’s representative of the machines I examined in 2025. The largest file represented in the histogram is 9,999 bytes. The maximum sized file on that machine, which doen’t appear in the hitogram, is 1,869,047,133 bytes. Without trimming off the largest 25% of the files, the histogram would have a single tall bucket at about 0, and the remaninder of the buckets very short.
Software - 2025
This time around, I wrote a Go program. I knew I wanted to run it on a number of different architectures, and two OSes, so portability seemed important. Performance was not paramount, I was only going to run the program two or three times per machine.
Portability
I made four executables, all using the same source code. All I had to do was specify an Instruction Set Architecture.
go build benford.gofor machines E, F, G, LGOOS=linux GOARCH=arm64 go build -o benford_aarch64 benford.gofor machine JGOOS=linux GOARCH=arm go build -o benford_arm benford.gofor machine KGOOS=darwin GOARCH=arm64 go build -o benford_macos benford.gofor machine H
Performance
| Machine | File Count | Time to find sizes | files/sec | Storage type |
|---|---|---|---|---|
| E | 185585 | 5.0 sec | 37117 | SATA HDD |
| F | 500254 | 50.3 sec | 9945 | SAS HDD |
| G | 1017103 | 44.2 sec | 23011 | SATA SSD |
| H | 24091734 | 1734 sec | 13894 | Apple SSD |
| J | 2508 | 0.6 sec | 4108 | 256NAND FLASH |
| K | 2574 | 0.5 sec | 5148 | eMMC FLASH |
| L | 102133 | 1.0 sec | 102133 | NVMe |
Alas, I didn’t keep performance data in 2013. It’s pretty clear that SSDs are faster than hard disks, and NVMe devices are faster than SSDs. That has to be a consequence of NVMe devices being directly attached to PCI buses.