Mergesort Investigation Summary
- Problem statement and some experimentation
- Explanation of the algorithm I came up with in 2021.
- Recursive merge sort algorithm to have something to compare
- Re-using linked lists while benchmarking
- Effect of address-ordered linked lists
- Traversing linked lists does not exhibit weird performance drops at specific list lengths
- Effect of size of linked list nodes tests whether virtual memory is somehow involved. It is not.
- Bottom-up implementation with linked lists from Wikipedia, has the same performance drops as my July 2021 algorithm
- Memory access effects:
incrementing
.Data
values, or arbitrarily changing.Next
pointers does not cause the performance drops, and looking for other anomalies - Effect of garbage collection before preparing a new, unsorted, linked list
- The memory access patterns of recursive and wikipedia bottom up algorithms are identical, for linked lists that are powers-of-2 long. For other length linked lists, the memory access patterns are not.
- I wrote an iterative, explicit stack frame variation of recursive mergesort to try to get the same abrupt performance drops as the wikipedia bottom up algorithm. No joy.
- Recursive mergesort variations that explore specific differences between iterative and recursive algorithms do not impart any revelations.
- Counts of comparison operations finally reveals the cause of the performance drops.
- More counts of comparison operations, this time featuring pre-sorted and reverse-sorted initial linked lists
The Answer
My July 2021 iterative, and the wikipedia bottom up algorithms work with pieces of the unsorted list that are powers of 2 in length. For a list of 2N+1 number of nodes, the main part of the algorithm sorts the first 2N nodes of the original list. The “+1” node, the last node in the original list, gets merged with the first 2N nodes after they’ve been mergesorted. On average, it takes 2N/2 comparisons to find the slot in the newly merged list for the “+1” node.
All of my recursive mergesort variations handle 2N+1 length lists,
and indeed, any odd number length list, differently.
They split the 2N+1 length list into sublists of lengths
2N/2 and (2N/2)+1.
The last node in the list gets sorted with ordinary merges.
There are a few extra comparisons made relative to a list of length 2N,
and a few extra .Next
pointer changes.
That last node is always in a sorted location in the smaller sublists.
Why didn’t I see the answer sooner?
I first did the Daily Coding Problem asking for a mergesort algorithm of a linked list with O(1) extra space in July of 2021. I did some fiddling around back then, ran it on multiple machines, transliterated to C, to try to figure out the performance drops. I started deliberately working to find an answer in June of 2024. I got an answer that makes sense in November of 2024. I tried many things, some of them pointed me in the right direction. Why didn’t I see the answer sooner?
The first discovery I made that pointed me in the correct direction was that exactly 2N+1 length lists exhibited the performance drop. There’s no gradient from good to bad performance, adding a single node to a list with randomly-chosen data values causes a performance drop.
The second discovery is that a recursive mergesort algorithm didn’t have the performance drop. There was something peculiar to my July 2021 algorithm that caused it.
Finding out that re-using a linked list
for multiple sortings caused an overall performance drop
for July 2021 iterative and recursive algorithms
should have eliminated a lot of alternative answers.
Finding out that an address-ordered
initial node ordering by .Next
pointers increased performance should have reinforced that
Linux’ virtual memory subsystem wasn’t to blame.
Besides, the same performance oddities with re-using lists,
and address-ordered initial lists showed up in MacOS benchmarks.
The OS isn’t to blame.
Transliterating the wikipedia bottom up
algorithm to Go and discovering it, too, had performance drops
at list lengths of 2N+1 nodes should have been an enormous clue.
Reading the code convinced me that this algorithm accessed memory
(reading .Data
values, updating .Next
pointers) in the same pattern
that the recursive algorithm did.
It was only when I realized that I wasn’t taking into account the “roll up”
of 2N sized lists at the end of the bottom up algorithm that I figured out
that any algorithm that worked with 2N sized sublists would have
to deal with the last node of odd length lists all on its own.
After that realization, the pieces fell into place.
When originally transliterating Wikipedia’s pseudo-code, I had some doubts that it would deal with odd-length lists correctly, but trying it with shorter odd-length lists and seeing it work made me put those doubts behind me. That was a mistake. I should have examined that doubt closely, or at least kept it in mind.
The fact that sorting pre-sorted and reverse pre-sorted lists did not show the performance drops was misleading. I now believe the two initial list conditions don’t show performance drops at specific list lengths due to two different causes. The visual appearance of the benchmarking for both kinds of initial list data values clouded my perception.
I believe that cognitive biases, and limitations of my own intelligence kept me from seeing the cause. The “roll up” of 2N sized lists as place where special cases could exist really didn’t dawn on me until I looked at counts of comparisons. It’s difficult to keep special cases like that in your head, which is why I re-wrote the list merging code to do its work with pointers-to-pointers instead of having a slightly faster if-then-else block. I know that exceptions and special cases lead to problems, usually bugs, but in this case, performance anomalies.
Things I learned solving this puzzle
- Recursive is at least as fast as iterative in this case. Again.
- Sometimes in-lining code really does perform better.
- Memory location of a data structure matters.
- Go’s heap allocator gives back smaller allocations from 8192-byte chunks.
- Mergesort truly is O(n log n) in time complexity.
- Different machines really do have differences in performance, sometimes incomprehensibly so.
- Detecting anomalies, like abrupt performance drops, isn’t particularly objective
- Output files need to have their provenance in them. Don’t depend on file names.
- The pattern of memory accesses (read
.Data
and possibly update.Next
pointer) isn’t the problem. Recursive mergesort doesn’t share performance anomalies, despite mostly identical memory access patterns as Wikipedia’s bottom up with lists - Two difference CPUs/ISA show the same performance anomalies because the anomaly derives from the afflicted algorithms leaving the last node in the list for the final merge.
- Linked list node size doesn’t affect where the performance anomalies occur.
- Layout or order of linked lists nodes affects performance generally, does not change list lengths where anomalies occur
- Garbage collection after each benchmark iteration doesn’t affect were anomalies occur
- That feeling of “That’s the solution!” is well worth the work.
Software Engineering Notes
Recursive vs Iterative
I found once again that a purely recursive algorithm is just as fast as a non-recursive, iterative algorithm. I’ve only experienced a few situations where an iterative algorithm is faster than the equivalent recursive algorithm. In the case of mergesort, the iterative algorithms also had very difficult to suss out performance blips, too, which decreased my confidence in the correctness of my implementation.
Data Provenance
I found that keeping data in human-readable form is a vast benefit. You can look at the data with a text editor to verify that a graph is correct. You can embed comments in the data to keep track of what the data represents, and how it was generated.
What metadata I put in the output changed. At first, it was simple, two columns, tab-separated, linked list length, and the mean of 10 repetitions of whatever sorting algorithm. Then, I added a third column, the total time of all 10 repetitions, so I could get an idea of the overhead of preparing lengthy linked lists, possibly sorting the lengthy linked lists on their addresses, etc. I added fourth and fifth columns, the minimum and maximum time it took to sort a list. I started adding provenance, info like time-of-day the benchmarking started, beginning list length, increment and ending list length. I added more and more provenance. A timestamp when the entire run started, which machine it was running on, which mergesort algorithm, any linked list preparation before sorting. I believe the final provenance I added was a timestamp at the very end of the data noting when the run finished
Provenance for graphs is important too. Graph titles need to indicate what goes on in them, so you can look at them outside the context the embedding text and still figure out what they mean.
Implicit Binary Tree
Recursive mergesorts have a binary tree implicit in the recursive function calls.
The recursive function terminates on 1-node-long lists.
Once the code has split a list into left and right hand parts,
it calls itself on the left part, and the right part.
The only thing that’s missing is a struct named “tree” with .Left
and .Right
fields.
Some Meditations on Big O
Complexity only shows up in very long lists, longer than anyone would do in practice.
Code Repo
Supporting software
I feel like I need to acknowledge the supporting software I used in this investigation.
- I wrote code for this in the Go programming language, because Go programs are easier to get right.
- I used Gnuplot to generate the graphs. I used Gnuplot extensively when visualizing the benchmarking, most of that use doesn’t show up in this writeup.
- I used GraphViz to make the nice linked list diagrams.