Sorted Containers in Ruby inspired by Python

Posted on April 26, 2024

Introduction

I converted Grant Jenks’s Python library Sorted Containers to Ruby. If you are interested in the details of this data structure, I recommend reading his website. His documentation is much more detailed, and I used it as a reference for this implementation.

The library provides a fast sorted array, sorted set, and sorted hash implemented in pure Ruby with no dependencies.

SortedArray, SortedSet, and SortedHash are meant to be drop-in replacements for Array, Set, and Hash but with the extra property that the elements can be accessed in sorted order.

I compare the performance of SortedContainers to SortedSet a C extension red-black tree implementation. You can see the benchmarks below. The performance is comparable for add and delete, and much better for iteration, initialization, and lookup.

Some methods from Array, Set, and Hash have not been implemented in SortedContainers. I want to complete these before version 1.0.0. Feel free to open an issue if you would like a method added or add a pull request if you would like to contribute.

Feedback is welcome. I hope you find this library useful.

How it works

Modern computers are good at shifting arrays. For that reason, it’s often faster to keep an array sorted than to use the usual tree-based data structures.

For example, if you have the array [1,2,4,5] and want to insert the element 3, you can shift 4, 5 to the right and insert 3 in the correct position. This is a O(n) operation, but in practice it’s fast.

You also save memory by not having to store pointers to children nodes, and you benifit from the cache locality of arrays. When you iterate over a sorted array, you are more likely to access elements that are close together in memory.

But we can do better if we have a lot of elements. We can break up the array so fewer elements have to be moved when a new element is inserted. For example, if you have the array [[1,2,4],[5,6,7]] and you want to insert the element 3, you can insert 3 into the first array to get [[1,2,3,4],[5,6,7]] and only the element 4 has to be shifted.

This often outperforms the more common tree-based data structures like red-black trees with O(log n) insertions, deletions, and lookup. We sacrifice theoretical time complexity for practical performance.

The size of the subarrays is a trade-off. You can modify how big you want to subarrays by setting the load_factor. The default is set to DEFAULT_LOAD_FACTOR = 1000. The subarray is split when its size is 2*load_factor. There is no perfect value. The ideal value will depend on your use case and may require some experimentation.

SortedSet and SortedHash are implemented using a SortedArray to keep track of the order, and then also use a standard Set and Hash for quick lookups.

Caveats

Benchmarks

SortedSet is a C extension red-black tree implementation. I used it as a benchmark to compare the performance of SortedContainers.

Every test was run 5 times and the average was taken.

You can see that SortedContainers has compariable performance for add and delete, and much better performance for iteration, initialization, and include.

I did my best to make the tests as fair as possible, but it’s possible that I made a mistake. It’s also difficult to predict real-world performance from these tests. If you have any suggestions for improvement, please let me know by opening an issue. The code for the benchmarks can be found in the github repository.

Results (Lower is better)

Initialize a Sorted Set with N elements Graph showing Initialization performance comparison
Add N elements to a Sorted Set Graph showing Add performance comparison
Delete N elements from a Sorted Set Graph showing Delete performance comparison
Iterate over a Sorted Set with N elements Graph showing Iteration performance comparison
Check if N elements are included in a Sorted Set Graph showing Include performance comparison

Conclusion

Feedback is welcome. Please open an issue on Github if you have any suggestions or find any bugs.

If you like Python libraries converted to Ruby, you might also like my conversion of heapq to Ruby called heapify