Mergesort Investigation Summary

The Answer

My July 2021 iterative, and the wikipedia bottom up algorithms work with pieces of the unsorted list that are powers of 2 in length. For a list of 2N+1 number of nodes, the main part of the algorithm sorts the first 2N nodes of the original list. The “+1” node, the last node in the original list, gets merged with the first 2N nodes after they’ve been mergesorted. On average, it takes 2N/2 comparisons to find the slot in the newly merged list for the “+1” node.

All of my recursive mergesort variations handle 2N+1 length lists, and indeed, any odd number length list, differently. They split the 2N+1 length list into sublists of lengths 2N/2 and (2N/2)+1. The last node in the list gets sorted with ordinary merges. There are a few extra comparisons made relative to a list of length 2N, and a few extra .Next pointer changes. That last node is always in a sorted location in the smaller sublists.

Why didn’t I see the answer sooner?

I first did the Daily Coding Problem asking for a mergesort algorithm of a linked list with O(1) extra space in July of 2021. I did some fiddling around back then, ran it on multiple machines, transliterated to C, to try to figure out the performance drops. I started deliberately working to find an answer in June of 2024. I got an answer that makes sense in November of 2024. I tried many things, some of them pointed me in the right direction. Why didn’t I see the answer sooner?

The first discovery I made that pointed me in the correct direction was that exactly 2N+1 length lists exhibited the performance drop. There’s no gradient from good to bad performance, adding a single node to a list with randomly-chosen data values causes a performance drop.

The second discovery is that a recursive mergesort algorithm didn’t have the performance drop. There was something peculiar to my July 2021 algorithm that caused it.

Finding out that re-using a linked list for multiple sortings caused an overall performance drop for July 2021 iterative and recursive algorithms should have eliminated a lot of alternative answers. Finding out that an address-ordered initial node ordering by .Next pointers increased performance should have reinforced that Linux’ virtual memory subsystem wasn’t to blame. Besides, the same performance oddities with re-using lists, and address-ordered initial lists showed up in MacOS benchmarks. The OS isn’t to blame.

Transliterating the wikipedia bottom up algorithm to Go and discovering it, too, had performance drops at list lengths of 2N+1 nodes should have been an enormous clue. Reading the code convinced me that this algorithm accessed memory (reading .Data values, updating .Next pointers) in the same pattern that the recursive algorithm did. It was only when I realized that I wasn’t taking into account the “roll up” of 2N sized lists at the end of the bottom up algorithm that I figured out that any algorithm that worked with 2N sized sublists would have to deal with the last node of odd length lists all on its own. After that realization, the pieces fell into place.

When originally transliterating Wikipedia’s pseudo-code, I had some doubts that it would deal with odd-length lists correctly, but trying it with shorter odd-length lists and seeing it work made me put those doubts behind me. That was a mistake. I should have examined that doubt closely, or at least kept it in mind.

The fact that sorting pre-sorted and reverse pre-sorted lists did not show the performance drops was misleading. I now believe the two initial list conditions don’t show performance drops at specific list lengths due to two different causes. The visual appearance of the benchmarking for both kinds of initial list data values clouded my perception.

I believe that cognitive biases, and limitations of my own intelligence kept me from seeing the cause. The “roll up” of 2N sized lists as place where special cases could exist really didn’t dawn on me until I looked at counts of comparisons. It’s difficult to keep special cases like that in your head, which is why I re-wrote the list merging code to do its work with pointers-to-pointers instead of having a slightly faster if-then-else block. I know that exceptions and special cases lead to problems, usually bugs, but in this case, performance anomalies.

Things I learned solving this puzzle

  • Recursive is at least as fast as iterative in this case. Again.
  • Sometimes in-lining code really does perform better.
  • Memory location of a data structure matters.
  • Go’s heap allocator gives back smaller allocations from 8192-byte chunks.
  • Mergesort truly is O(n log n) in time complexity.
  • Different machines really do have differences in performance, sometimes incomprehensibly so.
  • Detecting anomalies, like abrupt performance drops, isn’t particularly objective
  • Output files need to have their provenance in them. Don’t depend on file names.
  • The pattern of memory accesses (read .Data and possibly update .Next pointer) isn’t the problem. Recursive mergesort doesn’t share performance anomalies, despite mostly identical memory access patterns as Wikipedia’s bottom up with lists
  • Two difference CPUs/ISA show the same performance anomalies because the anomaly derives from the afflicted algorithms leaving the last node in the list for the final merge.
  • Linked list node size doesn’t affect where the performance anomalies occur.
  • Layout or order of linked lists nodes affects performance generally, does not change list lengths where anomalies occur
  • Garbage collection after each benchmark iteration doesn’t affect were anomalies occur
  • That feeling of “That’s the solution!” is well worth the work.

Software Engineering Notes

Recursive vs Iterative

I found once again that a purely recursive algorithm is just as fast as a non-recursive, iterative algorithm. I’ve only experienced a few situations where an iterative algorithm is faster than the equivalent recursive algorithm. In the case of mergesort, the iterative algorithms also had very difficult to suss out performance blips, too, which decreased my confidence in the correctness of my implementation.

Data Provenance

I found that keeping data in human-readable form is a vast benefit. You can look at the data with a text editor to verify that a graph is correct. You can embed comments in the data to keep track of what the data represents, and how it was generated.

What metadata I put in the output changed. At first, it was simple, two columns, tab-separated, linked list length, and the mean of 10 repetitions of whatever sorting algorithm. Then, I added a third column, the total time of all 10 repetitions, so I could get an idea of the overhead of preparing lengthy linked lists, possibly sorting the lengthy linked lists on their addresses, etc. I added fourth and fifth columns, the minimum and maximum time it took to sort a list. I started adding provenance, info like time-of-day the benchmarking started, beginning list length, increment and ending list length. I added more and more provenance. A timestamp when the entire run started, which machine it was running on, which mergesort algorithm, any linked list preparation before sorting. I believe the final provenance I added was a timestamp at the very end of the data noting when the run finished

Provenance for graphs is important too. Graph titles need to indicate what goes on in them, so you can look at them outside the context the embedding text and still figure out what they mean.

Implicit Binary Tree

Recursive mergesorts have a binary tree implicit in the recursive function calls. The recursive function terminates on 1-node-long lists. Once the code has split a list into left and right hand parts, it calls itself on the left part, and the right part. The only thing that’s missing is a struct named “tree” with .Left and .Right fields.

Some Meditations on Big O

Complexity only shows up in very long lists, longer than anyone would do in practice.

Code Repo

Github repo

Supporting software

I feel like I need to acknowledge the supporting software I used in this investigation.

  • I wrote code for this in the Go programming language, because Go programs are easier to get right.
  • I used Gnuplot to generate the graphs. I used Gnuplot extensively when visualizing the benchmarking, most of that use doesn’t show up in this writeup.
  • I used GraphViz to make the nice linked list diagrams.