Hello, author of glidesort (and for context, pdqsort) here.

No it's not released yet but it's getting real close. The code is virtually done, with some minor cleanup required. Besides some personal stuff, a large source of my delay has been the headache that is panic safety in Rust. Non-trivial sorting code becomes extra difficult if a forced stack unwinding can occur every time you compare two elements, where you are then required to fully restore the input array to a valid state. Doubly so if you can't even assume the comparison operator is valid and obeys the strict order semantics.

I am currently in the process of writing a paper I want to publish along glidesort which is also mostly done. Since the linked talk I've also had some more performance wins making it ~4.5 times faster than std::stable_sort for uniform random integers and much more than that for low cardinality data and input patterns.

Glidesort can use arbitrary amounts of auxiliary memory if necessary, but is fastest when given a fraction of the original input array worth of memory - I'm currently planning on releasing it with n / 8 as the default. I haven't looked at blitsort in detail but I am skeptical of the O(n log n) claim, I believe it is O(n (log n)^2) like glidesort is when given a constant amount of memory, as blitsort's partition and merge functions are recursive. Not that this really matters in practice - ultimately the real runtime is what matters. But I wouldn't be surprised to see blitsort slow down more relative to the competition for larger inputs.

scandum 58 days ago [-]

It should be O(n log n) comparisons and technically O(n (log n)^2) moves. The moves are reduced by a relatively large constant however, and blitsort might qualify as O(n log n) moves when given sqrt(n) aux.

In your Youtube presentation you do seem to skip mentioning that many of the performance innovations in glidesort were derived from quadsort and fluxsort. Some credit in your upcoming paper would be much appreciated. Feel free to email me if you have any questions, some things like my first publication of a "branchless" binary search in Aug 2014 may be hard to find, though there might be prior claim.

~4.5 times faster than std::stable_sort for uniform random integers is pretty impressive. Is this primarily from increasing the memory regions from 2 to 4 for parity merges / partitions? I'm benching on somewhat dated hardware and had mixed results (including slowdowns), so I never went further down that rabbit hole.

orlp 58 days ago [-]

Hi Igor,

> It should be O(n log n) comparisons and technically O(n (log n)^2) moves.

Yes, the same applies to glidesort. Looking at your repository again more carefully you do indeed only claim O(n log n) comparisons, I was not careful enough.

> blitsort might qualify as O(n log n) moves when given sqrt(n) aux

I have a similar conjecture for glidesort but I'll have to do some thinking to see if I can prove it.

> In your Youtube presentation you do seem to skip mentioning that many of the performance innovations in glidesort were derived from quadsort and fluxsort. Some credit in your upcoming paper would be much appreciated.

Good artists borrow, great artists steal. I do try to give credit where due though, so to clear my name...

...I do cite you in the presentation in the section on ping-pong merges (at 10 minutes in the presentation), and in the upcoming paper I do also cite you for what you call parity merges (something I did not use at the time of the presentation, but I do use now in the small sorting routine), where you don't have to do bounds checks when merging two equal-size arrays from both ends. As you could notice, I didn't have a large time slot for my presentation, so I could not go into more details regarding the branchless nature of merging and partitioning. I'll make sure to mention you in the paper for inspiring the out-of-place branchless partition as well. I do take full credit myself for making the bidirectional branchless out-of-place partition however: https://i.imgur.com/EiVi8Y2.png.

> Feel free to email me if you have any questions, some things like my first publication of a "branchless" binary search in Aug 2014 may be hard to find, though there might be prior claim.

I'll email you for sure, and I have a summary I posted on HN 7 months earlier as well: https://news.ycombinator.com/item?id=31101056. Hopefully this does show my good intent and that I have never tried to cover up the inspiration I took from your work. That said, glidesort does not use a branchless binary search.

> Is this primarily from increasing the memory regions from 2 to 4 for parity merges / partitions? I'm benching on somewhat dated hardware and had mixed results (including slowdowns), so I never went further down that rabbit hole.

I believe it is primarily from having multiple independent interleaved loops, which can use instruction-level parallelism. The speed-up is most significant on Apple M1, less so on my AMD Threadripper machine. I disagree with the term 'parity partition' entirely however, and when I say 'parity merge' I specifically refer to your trick where the optimization where bounds checks can be eliminated entirely when merging two equal-sized arrays. I call my partitioning method bidirectional partitioning. Similarly, how I see it is that every parity merge is a bidirectional merge, but not every bidirectional merge is a parity merge.

def-lkb 57 days ago [-]

I wrote a similar sorting algorithm around 14 years ago (fast, inplace, adaptive merge sort algorithm using rotation/swap, and I think, stable). This is all I remember from the time...

I was able to prove (on paper) that the whole algorithm was actually O(n log n). The moves seems to have a bad recursion pattern, but assuming my proof was right, it is possible to have a well behaved implementation in practice.
However, the recursion pattern didn't fit the master theorem, I remember having to do a taylor expansion of some formula giving an upper bound on the number of computation steps to prove the O(n log n) bound.

And runtime measurements graph were showing the expected O(n log n) curve, as in your cases. So maybe they are not O(n log n ^ 2) as you feared...
I did not keep much information from that time :P.

def-lkb 57 days ago [-]

Correction, technically, it uses O(log n) space for the recursion, so it is not stricly-speaking inplace, which would mean O(1) space.

scandum 57 days ago [-]

It's possible to implement a 64 item stack so you can call it O(1) and in-place.

But that's such an effort in futility that I refuse to do so. So it's theoretically and practically in-place, but not technically.

def-lkb 57 days ago [-]

Of course, this is nitpicking :). For all practical purpose, this O(log n) is O(1).

If you are interested, I can try to recover the proof that the rotation step can be done in O(n), thus allowing to apply the master theorem on the main recursion and getting the O(n log n) result.

scandum 57 days ago [-]

I'm not that interested as I'd prefer to count each move and prove it that way, but perhaps Peter (orlp) is interested?

slekker 58 days ago [-]

Oh HN, only here you can see these kind of discussions :ˆ) Amazing!

jqr- 57 days ago [-]

Right?! It made me feel like a was witnessing a letter exchange between great mathematicians of lore.

scandum 57 days ago [-]

Hoi Peter,

>I have a similar conjecture for glidesort but I'll have to do some thinking to see if I can prove it.

It probably is technically O(n (log n)^2) moves, but I suspect that when the rotations are fast enough (trinity / bridge rotations) the cost saving from localization nullify the additional moves.

>I do cite you in the presentation in the section on ping-pong merges

Wikisort does have prior claim to ping-pong merges, and it likely goes further back than that. I did independently implement them in quadsort.

>I do also cite you for what you call parity merges (something I did not use at the time of the presentation

This feels like a tricky claim to me, since I assume your bidirectional merge was fully inspired by my parity merge. If so you weren't the first to implement one, I did so while working on the parity merge, but I figured it was self-evident the 2x gain would apply for a more traditional merge, and I did state the gain was from accessing two memory regions in the quadsort article.

I've since made a quadsort release that uses a galloping 'bidirectional merge', though it still takes advantage of the parity principle, so properly speaking it's a bidirectional galloping parity merge hybrid that sorts 2 blocks of 8 elements at a time, without boundary checks for the 16 merge operations.

>As you could notice, I didn't have a large time slot for my presentation

I did take that into account, I guess it's human nature to get butthurt over stuff like this and I couldn't help myself from whining a little. It is better than bottling it up and I apologize if I caused you unease in turn.

>Hopefully this does show my good intent

It does, and I appreciate it.

>That said, glidesort does not use a branchless binary search.

I assume that's one of the advantages of building upon powersort? The blocky nature of quadsort is both a blessing and a curse in that regard.

>I call my partitioning method bidirectional partitioning.

I meant "parity merge" / "partition". If you think about it quicksort is naturally bidirectional, which is why branchless partitioning works without further work, and undoubtedly why it wasn't self-evident to most. I did try to write a bidirectional partition early on, but it's either my hardware or I messed up somehow.

I agree it's possible to write a bidirectional merge without utilizing the parity principle. One of the trickiest things in programming is properly naming things.

As for instruction-level parallelism, I do think it's actually almost entirely memory-level parallelism. I could be wrong though.

orlp 57 days ago [-]

> Hoi Peter,

A nitpick, but my name is Orson Peters, Peters is my last name :)

> I assume that's one of the advantages of building upon powersort? The blocky nature of quadsort is both a blessing and a curse in that regard.

I don't know, I only use a binary search when splitting up merges, and almost no time is spent in this routine. I use this for the low-memory case, as well as to create more parallelism to use for instruction-level parallelism.

> As for instruction-level parallelism, I do think it's actually almost entirely memory-level parallelism. I could be wrong though.

I didn't do specific research into which effect it is, when I say ILP I also mean the memory effects of that.

scandum 57 days ago [-]

>A nitpick, but my name is Orson Peters

I actually knew that, not sure what went wrong in my brain.

>I don't know, I only use a binary search when splitting up merges, and almost no time is spent in this routine.

You'll still get a definite speed-up, worth benching the difference just in case. I guess there's no easy way to avoid a binary search.

>I didn't do specific research into which effect it is, when I say ILP I also mean the memory effects of that.

They're indeed related. I did some empirical testing and I'm quite sure it's cache related. Thinking about it, one issue I may have had was not benching sorting 100M elements, which might be where a bidirectional partition might benefit on my system.

JaneLovesDotNet 58 days ago [-]

Dumb questions from a programmer who is weak at math.

1) Is there a theoretical minimum assymptotic speed for sorting?

2) Some of these newer sorts seem to have arbitrary parameters in them (e.g. do X if size of set is smaller than Y). Are we likely to see more and more complex rule sets like that lead to increased sorting speeds?

magnio 58 days ago [-]

1) All comparison sorting algorithms, i.e. sorting actually involves comparing elements, has a hard lower bound of O(log n!)=O(n log n) worst case complexity. For non-comparison sorting, the limit depends on the algorithm.

2) Hybrid algorithms, i.e. those that change the underlying strategy based on some parameters, are already quite common in real-world implementaions. Examples include timsort (stable, mix of merge sort and insertion sort), used in Python, Java, and Rust, and pdqsort (unstable, mix of quicksort and heap sort), used in Rust and Go.

m12k 58 days ago [-]

1) IIRC the best case is O(n) (you just verify that it is already sorted, can't get much faster than that) while the worst case is O(nlog(n)). But some of the best algorithms in real world usage have worse worst case asymptotic speed.

2) I think things like cache and machine word size have huge impact on real world speed, so it makes sense to have knobs to tweak to fit within those limits, even enough a theoretical analysis does away with constants like that

brap 58 days ago [-]

> the worst case is O(nlog(n))

I think it's worth clarifying that this is the best worst case possible, i.e for every sorting algorithm you could create a certain input in a way that it won't be able to beat O(nlog(n)). In other words, O(nlog(n)) is a minimum hard limit for worst case speed, no algorithm can do better than O(nlog(n)) on all possible inputs (but it can do better on some inputs, o(n) being the hard limit there).

I don't really remember the theory behind this, but hopefully someone here can answer: is it theoretically possible for a sorting algorithm to achieve sub-O(nlog(n)) speeds on 99.99% (or some other %) of randomly selected inputs? Or even O(n)?

GuB-42 57 days ago [-]

> is it theoretically possible for a sorting algorithm to achieve sub-O(nlog(n)) speeds on 99.99% (or some other %) of randomly selected inputs? Or even O(n)?

No, you can get better than nlog(n) for specific cases that may happen a lot in practice, but not in the general case of randomly selected continuous inputs.

The explanation comes from information theory, and it is the same idea as for why you can't compress random data.

In essence, a sorting algorithm has to pick the correct reordering of the sequence, there are n! possible reorderings, so the algorithm needs log(n!) bits of information about the sequence, each comparison yields one bit, and log(n!) is asymptotically equivalent to nlog(n), so you need at least nlog(n) comparisons to cover every case. In practice, some reorderings are more common than others, in particular the "already sorted" case, so it is worth making your algorithm check these cases first, but on randomly selected inputs, these special cases represent a negligible fraction of all the possible cases.

brap 56 days ago [-]

> you can get better than nlog(n) for specific cases that may happen a lot in practice, but not in the general case of randomly selected continuous inputs.

I guess what I'm asking is how "specific" do these cases have to be, or how "general" is the general case? Can specific cases be 0.0001%? How about 1%?

kjeetgill 57 days ago [-]

People forget! The O(nlogn) limit for the best worst case is for comparison sorts. I don't know if you consider it a "special case", but for more cases than not you can do guaranteed linear time to the number of elements: Radix and Bucket Sort. This is O(n) for things like ints because the number of bits are constant but for strings the length plays a factor: O(k*n). This performance isn't dependent on the distribution of items being sorted so I'd consider that pretty general.

You could also consider things like sleep sort or spaghetti sort: googling them I'll leave to the reader. Oh, and sorting networks are a good read too.

devit 57 days ago [-]

This is only for algorithms whose output is solely dependent on the outcomes of comparisons of input values.

Otherwise you can for instance sort an input made of zeroes and ones in O(n) time and O(log(n)) space by counting the ones.

Y_Y 57 days ago [-]

Regarding your last question, Quantum Bogosort Chinese to mind. Don't know about classical algorithms though.

user5678 58 days ago [-]

Paedor 58 days ago [-]

For 1)
The asymptotic best for sorting is O(n log n), if you're only allowed to compare the size of elements. If you allow operations like array indexing with elements, which is possible for integers, it gets as good as O(n) with radix sort.

hakuseki 58 days ago [-]

Assuming all values are unequal, there are n factorial (abbreviated n!) possible cases that we need to distinguish.

If we are sorting by comparison, then each comparison will eliminate at most half of the possible cases. So we need at least log_2(n!) comparisons in the worst case.

58 days ago [-]

scandum 57 days ago [-]

Hard to answer. I think the main thing is that there are a lot of stumbling blocks, things that most people overlook, yet logical once you have it explained.

Occasionally one is discovered and solved, and it can lead to a brief domino effect, but I have no idea how many are left.

blondin 58 days ago [-]

thanks for sharing this with us. i admire this approach that builds on small improvements here and there. and these improvements are interesting in their own way. it reminds me of micro-optimizations with encoding routines. i didn't SIMD in the code at first glance. you might gain significant speed with SIMD.

janwas 57 days ago [-]

:) Indeed, we're seeing 10x speedups from SIMD for some distributions.
It's useful both for sorting networks and quicksort partitioning.
Code here: https://github.com/google/highway/tree/master/hwy/contrib/so... (disclosure: I'm one of the co-authors).

dleslie 58 days ago [-]

Woah, I like the solid memory guarantees. Could be useful for embedded projects.

mmoskal 58 days ago [-]

Seems to use quite a bit of stack for an embedded usage. An interesting thing to note: microcontrollers are typically memory-bound when using common algorithms - while they are maybe 100x slower than a desktop computer they have say 1000000x less RAM. So, for example, a GC cycle would be often in the range of 1ms.

odo1242 58 days ago [-]

Wouldn’t a GC cycle be short if the microcontroller had less RAM? (as a genuine question)

mmoskal 55 days ago [-]

Well yes. The less heap to scan the faster it is. I was just referreing to the unusual ratio between speed and RAM.

naasking 58 days ago [-]

1ms is pretty short. GC cycles on desktops and servers can take hundreds of ms.

adgjlsfhk1 58 days ago [-]

I've seen pathological examples where a gc takes 30 seconds

rurban 57 days ago [-]

But only with very very bad GC's. Like stop the world mark&sweep. Nobody should ever use that, but they are quite common still.

logicallee 58 days ago [-]

GC is the bubblesort of our era. I mean it seems to be a very inefficient way to solve a problem.

I hope computer scientists will come up with something better than GC.

adgjlsfhk1 57 days ago [-]

Honestly I'm not sure. Modern GCs are pretty good, and manual memory management isn't a free lunch either. One major advantage of a GC is it can batch all the `free`s together which is a lot more efficient than the manual approach where you free objects individually in a pseudo-random order.

scandum 57 days ago [-]

Isn't that what Rust is all about?

skelpmargyar 57 days ago [-]

Rust is for memory safety. GC software is a way to achieve some form of memory safety while also being much simpler and faster to write compared to Rust.

layer8 57 days ago [-]

ZGC (JVM) now has guaranteed maximum pause times of 500 microseconds, with average pause times of just 50 microseconds: https://malloc.se/blog/zgc-jdk16

Pause time is not the same as cycle time, but it’s what’s important for reactivity.

naasking 57 days ago [-]

Sure, but deferring collection requires increasing memory use. Since the OP was discussing microcontrollers, that's generally not desirable because memory is in short supply. A 1ms GC cycle is totally tolerable for most microcontroller applications though.

layer8 57 days ago [-]

I was responding to your statement regarding “on desktops and servers”, which I believe doesn’t quite represent the state of the art anymore. I certainly agree that 1 ms is already pretty short.

dleslie 57 days ago [-]

Oof, heavy stack usage is ungood. But that's probably correctable.

FWIW, I'm thinking of devices that don't have an MMU - like the Sega 32X. It's a hobby of mine. Having any GC time at all is too much suffering, even 1ms.

girvo 58 days ago [-]

The bigger problem with GC and standard allocators on embedded system is heap fragmentation, in my experience.

mmoskal 55 days ago [-]

Couldn't agree more! One way I was fighting it (in Static Typescript [0]) is by increasing frequency of scans and allocating as close as possible to one side of the heap. Next time I think I'll just try to compact the heap.

Since the sorting scene seems to be fully assembled in this thread, I take the opportunity to ask quick question about a use case that does not seem uncommon: Is there an algorithm that is especially efficient if the array contains many sorted sequences, and still works alright for fully shuffled arrays? And are there algorithms that should be avoided in these cases.

scandum 57 days ago [-]

Quadsort, fluxsort, blitsort, and crumsort all qualify depending on your needs.

skasort_cpy is pretty good on 32 bit integers if you give it n auxiliary memory.

rhsort is very good and likely the best for 31 bit integers, but a bit rough around the edges still, and doesn't work well on arrays above 1M elements. Avoid radix sorts for 64 bit integers, they're ideal for 16 bit.

glidesort is promising, though I haven't seen it benched against the latest fluxsort / blitsort.

Timsort's main problem is that it's slow on shuffled arrays.

pdqsort and other introsorts aren't good on semi-ordered data.

k2xl 57 days ago [-]

Question: are any of these novel sorting algorithms being used in modern databases or tech stacks?

ismailmaj 57 days ago [-]

pdqsort by Orson Peters is used in Rust std for `sort_unstable`.

Is there a TrollSort? I'm thinking of an algorithm which initially seems to be fast and efficient but takes exponentially longer time with larger arrays and exponentially longer towards the end of sorting.

Radix sort is O[n] . (or [n * number of bytes in Int] or whatever is being compared).

It's omitted from the comparisons, I see. Radix sort can also have predictable memory overhead.

naasking 58 days ago [-]

Radix sort is theoretically O(N), but memory access is logarithmic so in reality you can't do better than O(log N) no matter what algorithm you use. Only constant factors matter at that point.

Edit: I misremembered, memory access is actually O(sqrt(N)):

Nothing theoretical about it: Sorting a list of all IP addresses can absolutely and trivially be done in O(N)

> in reality you can't do better than O(log N)

You can't traverse the list once in <N so the complexity of any sort must be ≥N.

> but memory access is logarithmic

No it's not, but it's also irrelevant: A radix sort doesn't need any reads if the values are unique and dense (such as the case IP addresses, permutation arrays, and so on).

The author ran out of memory; They ran a program that needs 10GB of ram on a machine with only 8GB of ram in it. If you give that program enough memory (I have around 105gb free) it produces a silly graph that looks nothing like O(√N): https://imgur.com/QjegDVI

The latency of accessing memory is not a function of N.

FeepingCreature 57 days ago [-]

The latency of accessing physical memory is asymptotically a function of N for sufficiently large N - ie. big enough that signal propagation delay to the storage element becomes a noticeable factor.

This is not generally relevant for PCs because the distance between cells in the DIMM does not affect timing; ie. the memory is timed based on worst-case delay.

patrec 57 days ago [-]

> The latency of accessing memory is not a function of N.

How could it not be, given that any physical medium has finite information density, and information cannot propagate faster than the speed of light?

And on a practical computer the size of N will determine if you can do all your lookups from registers, L1-L3, or main RAM (plus SSD, unless you disable paging).

geocar 57 days ago [-]

> How could it not be, given that any physical medium has finite information density, and information cannot propagate faster than the speed of light?

You could have a tape of infinite length, and if you only ever request "the next one" then clearly the latency is constant.

> And on a practical computer the size of N ...

Don't be silly. N is the size of the input set, not the size of the universe.

patrec 57 days ago [-]

I feel we must be talking past each other. I thought the examples so far were about random access of N units of memory. The standard convention of pretending this can be done at O(1) always struck me as bizarre, because it's neither true for practical, finite, values of N (memory hierarchies are a thing) nor asymptotically for any physically plausible idealization of reality.

aidenn0 57 days ago [-]

Random access times are irrelevant for radix sort since no random access to the input data is needed.

naasking 57 days ago [-]

You still need to write each element to the target, which costs O(sqrt(N)), so the "strict" running time of radix sort would be O(N^1.5) in layered memory hierarchies. In flat memory hierarchies (no caching) as found in microcontrollers, it reduces back to O(N).

BeeOnRope 57 days ago [-]

But radix sort doesn't write randomly to the output array like that. Yes, each element ends up in the right place, but through a series of steps which each have better locality of reference than random writes.

In this way, radix sort is usually faster than an "omniscient sort" that simply scans over the input writing each element to its final location in the output array.

naasking 57 days ago [-]

> The author ran out of memory; They ran a program that needs 10GB of ram on a machine with only 8GB of ram in it.

No, you can clearly see the O(sqrt(N)) trend at every level of the memory hierarchy.

geocar 57 days ago [-]

> No, you can clearly see the O(sqrt(N)) trend at every level of the memory hierarchy.

Lines goes up, then goes down. Very much unlike sqrt.

naasking 57 days ago [-]

If you can't see the clear sqrt(N) behaviour in those plots, then I recommend using an online graphic calculator to see what such a graph actually looks like. The sqrt trend is plain as day.

hinkley 58 days ago [-]

Also radix is a pretty special case because it assumes you want to sort by some relatively uninteresting criteria (be honest, how often are you sorting things by a number and only a number?). What happens in the real world is that the size of fields you want to sort on tends to grow in log n. If you had half a billion John Smiths using your service you’d use some other identifier that is unique, and unique values grow in length faster than logn.

I’m glad other people are having this conversation now and not just me.

geocar 57 days ago [-]

> be honest, how often are you sorting things by a number and only a number?

All the time.

IP addresses

permutation arrays (⍋⍋)

Sometimes I pretend characters are numbers; short fixed-length strings (like currencies or country codes or even stock tickers) can be numbers.

If I can get away with it, a radix sort is better than anything else.

57 days ago [-]

mschuetz 57 days ago [-]

> be honest, how often are you sorting things by a number and only a number?

More often than not. Sorting by morton code, sorting by index, etc.

hinkley 56 days ago [-]

Here’s a different take: if you find yourself needing to sort a lot of data all the time, maybe you should be storing it sorted in the first place.

Stop faffing about with college text book solutions to problems like you’re filling in a bingo card and use a real database. Otherwise known as Architecture.

I haven’t touched a radix sort for almost twenty years. In that time, hardware has sprouted an entire other layer of caching. I bet you’ll find that on real hardware, with complex objects, a production-ready Timsort is competitive with your hand written radix sort.

mschuetz 54 days ago [-]

I develop programs that take unsorted data as input, and sort them as part of the processing... I haven't touched a database in 10 years, because I do graphics programming, nothing that needs a database. Also, I don't actually use radix sort but a hand written CUDA counting sort that beats the crap out of radix sort for my given use cases, since counting sort is often used as part of radix sort, but simple counting sort is all I need.

> and use a real database.

I'm not going to use a database to store tens of billions of 3D vertices. And I'm not going to use a database to sort them, because it's multiple times, probably orders of magnitude faster to sort them yourself.

It's weird to impose completely out-of-place suggestions onto someone who does something completely different to what you're thinking of.

henrydark 57 days ago [-]

Radix sort works for numbers and so also for characters. Then it also works for lexicographic ordering of finite lists of any type it supports. So it can sort strings. But also tuples like (int, string, float). So it can actually sort all plain old data types.

anonymoushn 57 days ago [-]

MSB Radix sort is a pretty good fit for this John Smiths input and will definitely outperform a comparison-based sort that has to check that "John Smith" equals itself n log n times.

ben-schaaf 58 days ago [-]

random memory access has a non-constant upper bound (assuming infinitely ever larger and slower caches), but radix sort is mostly linear memory access.

gpderetta 58 days ago [-]

It seems a bit of non-sense from very adventurous interpolation. Remove caching and now memory access is O(1).

naasking 57 days ago [-]

Sure, you can force a fiction of O(1) by dramatically increasing latency and strictly limiting the size of memory, as we do with microcontrollers. This would now be O(1) with a very large constant factor overhead, basically pinning memory access to the latency of the memory cells that are furthest from the CPU.

gpderetta 57 days ago [-]

If there is an upper bound on the latency then it is constant no matter how you look at it.

naasking 57 days ago [-]

I'm not disputing you can establish an upper bound on latency. You can always do this by using a system's slowest component as the upper bound and pin everything else's latency to that bound, as I said. I'm just pointing out that this upper bound a) doesn't scale well/leaves a lot of performance on the floor, and b) the upper bound is very sensitive to size and geometry.

In high performance systems, constant time random access is just not constant.

Beating quicksort alone is almost always as easy as swapping insertion sort once you get down to around 14 elements. This is used by C#'s quicksort (which also swaps to heapsort after a depth of 32 IIRC) and also in Timsort in JS, Java, Python, etc.

Further to that, many use cases of Quicksort have unique keys, so stable is the same as unstable. Example: ascending indices appended to the key, when sorting large records where we don't want to move around the entire record.

that 58 days ago [-]

Heads up: No license given in the repo, be careful if you are thinking of using this for a project.

thinkit’s been released yet.[1] https://m.youtube.com/watch?v=2y3IK1l6PI4

No it's not released yet but it's getting real close. The code is virtually done, with some minor cleanup required. Besides some personal stuff, a large source of my delay has been the headache that is panic safety in Rust. Non-trivial sorting code becomes extra difficult if a forced stack unwinding can occur every time you compare two elements, where you are then required to fully restore the input array to a valid state. Doubly so if you can't even assume the comparison operator is valid and obeys the strict order semantics.

I am currently in the process of writing a paper I want to publish along glidesort which is also mostly done. Since the linked talk I've also had some more performance wins making it ~4.5 times faster than std::stable_sort for uniform random integers and much more than that for low cardinality data and input patterns.

Glidesort can use arbitrary amounts of auxiliary memory if necessary, but is fastest when given a fraction of the original input array worth of memory - I'm currently planning on releasing it with n / 8 as the default. I haven't looked at blitsort in detail but I am skeptical of the O(n log n) claim, I believe it is O(n (log n)^2) like glidesort is when given a constant amount of memory, as blitsort's partition and merge functions are recursive. Not that this really matters in practice - ultimately the real runtime is what matters. But I wouldn't be surprised to see blitsort slow down more relative to the competition for larger inputs.

In your Youtube presentation you do seem to skip mentioning that many of the performance innovations in glidesort were derived from quadsort and fluxsort. Some credit in your upcoming paper would be much appreciated. Feel free to email me if you have any questions, some things like my first publication of a "branchless" binary search in Aug 2014 may be hard to find, though there might be prior claim.

~4.5 times faster than std::stable_sort for uniform random integers is pretty impressive. Is this primarily from increasing the memory regions from 2 to 4 for parity merges / partitions? I'm benching on somewhat dated hardware and had mixed results (including slowdowns), so I never went further down that rabbit hole.

> It should be O(n log n) comparisons and technically O(n (log n)^2) moves.

Yes, the same applies to glidesort. Looking at your repository again more carefully you do indeed only claim O(n log n)

comparisons, I was not careful enough.> blitsort might qualify as O(n log n) moves when given sqrt(n) aux

I have a similar conjecture for glidesort but I'll have to do some thinking to see if I can prove it.

> In your Youtube presentation you do seem to skip mentioning that many of the performance innovations in glidesort were derived from quadsort and fluxsort. Some credit in your upcoming paper would be much appreciated.

Good artists borrow, great artists steal. I do try to give credit where due though, so to clear my name...

...I do cite you in the presentation in the section on ping-pong merges (at 10 minutes in the presentation), and in the upcoming paper I do also cite you for what you call parity merges (something I did not use at the time of the presentation, but I do use now in the small sorting routine), where you don't have to do bounds checks when merging two equal-size arrays from both ends. As you could notice, I didn't have a large time slot for my presentation, so I could not go into more details regarding the branchless nature of merging and partitioning. I'll make sure to mention you in the paper for inspiring the out-of-place branchless partition as well. I do take full credit myself for making the

bidirectionalbranchless out-of-place partition however: https://i.imgur.com/EiVi8Y2.png.> Feel free to email me if you have any questions, some things like my first publication of a "branchless" binary search in Aug 2014 may be hard to find, though there might be prior claim.

I'll email you for sure, and I have a summary I posted on HN 7 months earlier as well: https://news.ycombinator.com/item?id=31101056. Hopefully this does show my good intent and that I have never tried to cover up the inspiration I took from your work. That said, glidesort does not use a branchless binary search.

> Is this primarily from increasing the memory regions from 2 to 4 for parity merges / partitions? I'm benching on somewhat dated hardware and had mixed results (including slowdowns), so I never went further down that rabbit hole.

I believe it is primarily from having multiple independent interleaved loops, which can use instruction-level parallelism. The speed-up is most significant on Apple M1, less so on my AMD Threadripper machine. I disagree with the term 'parity partition' entirely however, and when I say 'parity merge' I specifically refer to your trick where the optimization where bounds checks can be eliminated entirely when merging two equal-sized arrays. I call my partitioning method bidirectional partitioning. Similarly, how I see it is that every parity merge is a bidirectional merge, but not every bidirectional merge is a parity merge.

I was able to prove (on paper) that the whole algorithm was actually O(n log n). The moves seems to have a bad recursion pattern, but assuming my proof was right, it is possible to have a well behaved implementation in practice. However, the recursion pattern didn't fit the master theorem, I remember having to do a taylor expansion of some formula giving an upper bound on the number of computation steps to prove the O(n log n) bound.

And runtime measurements graph were showing the expected O(n log n) curve, as in your cases. So maybe they are not O(n log n ^ 2) as you feared... I did not keep much information from that time :P.

But that's such an effort in futility that I refuse to do so. So it's theoretically and practically in-place, but not technically.

If you are interested, I can try to recover the proof that the rotation step can be done in O(n), thus allowing to apply the master theorem on the main recursion and getting the O(n log n) result.

>I have a similar conjecture for glidesort but I'll have to do some thinking to see if I can prove it.

It probably is technically O(n (log n)^2) moves, but I suspect that when the rotations are fast enough (trinity / bridge rotations) the cost saving from localization nullify the additional moves.

>I do cite you in the presentation in the section on ping-pong merges

Wikisort does have prior claim to ping-pong merges, and it likely goes further back than that. I did independently implement them in quadsort.

>I do also cite you for what you call parity merges (something I did not use at the time of the presentation

This feels like a tricky claim to me, since I assume your bidirectional merge was fully inspired by my parity merge. If so you weren't the first to implement one, I did so while working on the parity merge, but I figured it was self-evident the 2x gain would apply for a more traditional merge, and I did state the gain was from accessing two memory regions in the quadsort article.

I've since made a quadsort release that uses a galloping 'bidirectional merge', though it still takes advantage of the parity principle, so properly speaking it's a bidirectional galloping parity merge hybrid that sorts 2 blocks of 8 elements at a time, without boundary checks for the 16 merge operations.

>As you could notice, I didn't have a large time slot for my presentation

I did take that into account, I guess it's human nature to get butthurt over stuff like this and I couldn't help myself from whining a little. It is better than bottling it up and I apologize if I caused you unease in turn.

>Hopefully this does show my good intent

It does, and I appreciate it.

>That said, glidesort does not use a branchless binary search.

I assume that's one of the advantages of building upon powersort? The blocky nature of quadsort is both a blessing and a curse in that regard.

>I call my partitioning method bidirectional partitioning.

I meant "parity merge" / "partition". If you think about it quicksort is naturally bidirectional, which is why branchless partitioning works without further work, and undoubtedly why it wasn't self-evident to most. I did try to write a bidirectional partition early on, but it's either my hardware or I messed up somehow.

I agree it's possible to write a bidirectional merge without utilizing the parity principle. One of the trickiest things in programming is properly naming things.

As for instruction-level parallelism, I do think it's actually almost entirely memory-level parallelism. I could be wrong though.

A nitpick, but my name is Orson Peters, Peters is my last name :)

> I assume that's one of the advantages of building upon powersort? The blocky nature of quadsort is both a blessing and a curse in that regard.

I don't know, I only use a binary search when splitting up merges, and almost no time is spent in this routine. I use this for the low-memory case, as well as to create more parallelism to use for instruction-level parallelism.

> As for instruction-level parallelism, I do think it's actually almost entirely memory-level parallelism. I could be wrong though.

I didn't do specific research into which effect it is, when I say ILP I also mean the memory effects of that.

I actually knew that, not sure what went wrong in my brain.

>I don't know, I only use a binary search when splitting up merges, and almost no time is spent in this routine.

You'll still get a definite speed-up, worth benching the difference just in case. I guess there's no easy way to avoid a binary search.

>I didn't do specific research into which effect it is, when I say ILP I also mean the memory effects of that.

They're indeed related. I did some empirical testing and I'm quite sure it's cache related. Thinking about it, one issue I may have had was not benching sorting 100M elements, which might be where a bidirectional partition might benefit on my system.

1) Is there a theoretical minimum assymptotic speed for sorting?

2) Some of these newer sorts seem to have arbitrary parameters in them (e.g. do X if size of set is smaller than Y). Are we likely to see more and more complex rule sets like that lead to increased sorting speeds?

2) Hybrid algorithms, i.e. those that change the underlying strategy based on some parameters, are already quite common in real-world implementaions. Examples include timsort (stable, mix of merge sort and insertion sort), used in Python, Java, and Rust, and pdqsort (unstable, mix of quicksort and heap sort), used in Rust and Go.

2) I think things like cache and machine word size have huge impact on real world speed, so it makes sense to have knobs to tweak to fit within those limits, even enough a theoretical analysis does away with constants like that

I think it's worth clarifying that this is the

bestworst case possible, i.e for every sorting algorithm you could create a certain input in a way that it won't be able to beat O(nlog(n)). In other words, O(nlog(n)) is a minimum hard limit for worst case speed, no algorithm can do better than O(nlog(n)) on all possible inputs (but it can do better onsomeinputs, o(n) being the hard limit there).I don't really remember the theory behind this, but hopefully someone here can answer: is it

theoreticallypossible for a sorting algorithm to achieve sub-O(nlog(n)) speeds on 99.99% (or some other %) of randomly selected inputs? Or even O(n)?No, you can get better than nlog(n) for specific cases that may happen a lot in practice, but not in the general case of randomly selected continuous inputs.

The explanation comes from information theory, and it is the same idea as for why you can't compress random data.

In essence, a sorting algorithm has to pick the correct reordering of the sequence, there are n! possible reorderings, so the algorithm needs log(n!) bits of information about the sequence, each comparison yields one bit, and log(n!) is asymptotically equivalent to nlog(n), so you need at least nlog(n) comparisons to cover every case. In practice, some reorderings are more common than others, in particular the "already sorted" case, so it is worth making your algorithm check these cases first, but on randomly selected inputs, these special cases represent a negligible fraction of all the possible cases.

I guess what I'm asking is how "specific" do these cases have to be, or how "general" is the general case? Can specific cases be 0.0001%? How about 1%?

You could also consider things like sleep sort or spaghetti sort: googling them I'll leave to the reader. Oh, and sorting networks are a good read too.

Otherwise you can for instance sort an input made of zeroes and ones in O(n) time and O(log(n)) space by counting the ones.

If we are sorting by comparison, then each comparison will eliminate at most half of the possible cases. So we need at least log_2(n!) comparisons in the worst case.

Occasionally one is discovered and solved, and it can lead to a brief domino effect, but I have no idea how many are left.

I hope computer scientists will come up with something better than GC.

Pause time is not the same as cycle time, but it’s what’s important for reactivity.

FWIW, I'm thinking of devices that don't have an MMU - like the Sega 32X. It's a hobby of mine. Having any GC time at all is too much suffering, even 1ms.

[0] https://www.microsoft.com/en-us/research/uploads/prod/2019/0...

skasort_cpy is pretty good on 32 bit integers if you give it n auxiliary memory.

rhsort is very good and likely the best for 31 bit integers, but a bit rough around the edges still, and doesn't work well on arrays above 1M elements. Avoid radix sorts for 64 bit integers, they're ideal for 16 bit.

glidesort is promising, though I haven't seen it benched against the latest fluxsort / blitsort.

Timsort's main problem is that it's slow on shuffled arrays.

pdqsort and other introsorts aren't good on semi-ordered data.

https://doc.rust-lang.org/std/vec/struct.Vec.html#method.sor...

Radix sort is O[n] . (or [n * number of bytes in Int] or whatever is being compared).

It's omitted from the comparisons, I see. Radix sort can also have predictable memory overhead.

Edit: I misremembered, memory access is actually O(sqrt(N)):

https://github.com/emilk/ram_bench

Nothing theoretical about it: Sorting a list of all IP addresses can

absolutelyandtriviallybe done in O(N)> in reality you can't do better than O(log N)

You can't traverse the list once in <N so the complexity of

anysortmustbe ≥N.> but memory access is logarithmic

No it's not, but it's also irrelevant: A radix sort doesn't need

anyreads if the values are unique and dense (such as the case IP addresses, permutation arrays, and so on).> Edit: I misremembered, memory access is actually O(sqrt(N)): https://github.com/emilk/ram_bench

It's not that either.

The author ran out of memory; They ran a program that needs 10GB of ram on a machine with only 8GB of ram in it. If you give that program enough memory (I have around 105gb free) it produces a silly graph that looks nothing like O(√N): https://imgur.com/QjegDVI

The latency of accessing memory is

nota function of N.This is not generally relevant for PCs because the distance between cells in the DIMM does not affect timing; ie. the memory is timed based on worst-case delay.

How could it not be, given that any physical medium has finite information density, and information cannot propagate faster than the speed of light?

And on a practical computer the size of N will determine if you can do all your lookups from registers, L1-L3, or main RAM (plus SSD, unless you disable paging).

You could have a tape of infinite length, and if you only ever request "the next one" then clearly the latency is constant.

> And on a practical computer the size of N ...

Don't be silly. N is the size of the input set, not the size of the universe.

In this way, radix sort is usually faster than an "omniscient sort" that simply scans over the input writing each element to its final location in the output array.

No, you can clearly see the O(sqrt(N)) trend at every level of the memory hierarchy.

Lines goes up, then goes down. Very much unlike sqrt.

I’m glad other people are having this conversation now and not just me.

All the time.

IP addresses

permutation arrays (⍋⍋)

Sometimes I pretend characters are numbers; short fixed-length strings (like currencies or country codes or even stock tickers) can be numbers.

If I can get away with it, a radix sort is better than anything else.

More often than not. Sorting by morton code, sorting by index, etc.

Stop faffing about with college text book solutions to problems like you’re filling in a bingo card and use a real database. Otherwise known as Architecture.

I haven’t touched a radix sort for almost twenty years. In that time, hardware has sprouted an entire other layer of caching. I bet you’ll find that on real hardware, with complex objects, a production-ready Timsort is competitive with your hand written radix sort.

> and use a real database.

I'm not going to use a database to store tens of billions of 3D vertices. And I'm not going to use a database to sort them, because it's multiple times, probably orders of magnitude faster to sort them yourself.

It's weird to impose completely out-of-place suggestions onto someone who does something completely different to what you're thinking of.

randommemory access has a non-constant upper bound (assuming infinitely ever larger and slower caches), but radix sort is mostly linear memory access.In high performance systems, constant time random access is just not constant.

And djb's vectorized sorting networks are pretty great: https://sorting.cr.yp.to/

It is slower than it's unstable brother, aptly named crumsort. https://github.com/scandum/crumsort

EDIT: Retracted -- did a search like for "license", but apparently the search results omits variants like "sublicense" which would have caught the MIT license at the beginning of the source file: https://github.com/scandum/blitsort/search?q=license vs. https://github.com/scandum/blitsort/search?q=sublicense