tl;dr
- keyed: yes/no
- duplicates allowed: yes/no
- ordered: yes/no
ord assoc dict --------- list /| / | / | multi | dict --------- dict | | | | | | | | | | ord | | | set --------|- list | / | / | / | / set ---------- bag
Ugh, fine… Details
Take a peek at the data structures on offer in various high-level languages, and you’re likely to see some patterns.
For our purposes here, we’ll brush aside variations based on performance such as HashMap
vs. TreeMap
or LinkedList
vs. ArrayList
.
Everybody has some notion of list and map/dictionary in easy reach;
other common entrants are sets and ordered dictionaries;
poke around a bit and you might find ordered set and multi-dictionary.
To summarize (and clarify my terminology):
- list: not keyed, allows duplicate values, ordered
- dict: keyed, no duplicate keys, not ordered
- set : not keyed, no duplicate values, not ordered
- ordered dict: keyed, no duplicate keys, ordered
- ordered set: not keyed, no duplicate values, ordered
- multi-dict: keyed, allows duplicate keys, not ordered
The other two entries in this pattern are fairly widely-used, but don’t appear in standard libraries so much.
-
association list: keyed, allows duplicate keys, ordered
Easily implemented as a list of pairs.
-
bag: not keyed, allows duplicate values, not ordered
Present in combinatorics (it even has its own delimiter symbols
⟅1,2,2,5⟆
analogous to curly braces for sets), but not so often used in programming. When I’ve needed this, I generally use a dictionary with integer values, but it’s a bit of a pain to add the first item with a given key.
Eight possibilities generated by three boolean parameters is just begging me to draw a cube. I guess I’ll draw it again, with the magic of copy-paste!
ord assoc dict --------- list /| / | / | multi | dict --------- dict | | | | | | | | | | ord | | | set --------|- list | / | / | / | / set ---------- bag
On the bottom we have the common the non-keyed data structures, and on top are the keyed ones. In the front are those which do not care about order, and in the back are those that do. The only oddity in discussing the left-right dimension is that on top (keyed data structures) we are interested in duplicates keys, whereas on the bottom (non-keyed data structures) we are interested in duplicate values. Let’s just say that on the right are those which allow duplicates (keys if keyed or values if non-keyed), and on the left are those which do not. We’ll see in the next section that this double-meaning of duplicate is quite natural idea, despite seeming ad-hoc at the moment.
Implementation Relationships Spelled Out
The association list (a.k.a. assoc list, or a-list) it the most general of these data structures. That is, it is very easy to implement all the others in terms of a-lists. To see exactly how, let’s assume we’re given some module with the following interface (I’ll be using a Haskell-like syntax):
module AList where type AList k v empty :: AList k v hasKey :: AList k v -> k -> Bool index :: AList -> Nat -> Maybe (k, v) lookup :: AList k v -> k -> [v] insert :: AList k v -> k -> v -> AList k v delete :: AList k v -> k -> AList k v -- we also have the following non-obvious properties: -- index (insert alist x y) 0 === Just (x, y) -- index (insert alist x y) (n + 1) === index alist n -- (delete (insert alist x y) x) === alist
Moving left, we have the ordered dictionary, which is obtained simply by simply overwriting existing keys:
module OrdDict where import qualified AList type OrdDict k v = AList k v empty = AList.empty hasKey = AList.hasKey index = AList.index lookup odict k = case AList.lookup odict key of { [] -> Nothing; [v] -> Just v } insert odict k v' = AList.insert (AList.delete odict k) k v' delete = AList.delete
Honestly, I haven’t used an ordered dictionary before, but it apparently comes up in enough problems that it got implemented in Python’s standard library.
Moving forward from AList
, all we need to do is not export the order-related functions:
module MultiDict where import qualified AList type MultiDict k v = AList k v empty = AList.empty hasKey = AList.hasKey lookup = AList.lookup insert = AList.insert delete = AList.delete
The most common place you’ve seen this used is in http request parameters.
Somehow, this hasn’t made its way into Python’s standard library yet,
but then again, I have a feeling that only the minority of web programmers know that ?id=1&id=2
is a valid and meaningful query string.
For the last vertex of the top face, we have the ubiquitous Dict
, which I won’t spell out here.
It’s just OrdDict
but missing the index
function, or MultiDict
but with insert
implemented like OrdDict
‘s.
Now we look at the bottom face, starting by implementing lists from a-lists. This we simply do by choosing a unit type (one which carries no information) for the second type parameter. Since we aren’t interested in key-value mapping, we also drop any functions that focus on them.
module List where import qualified AList type List a = AList a () empty = AList.empty index = fmap fst . AList.index prepend x = AList.insert x () delete = AList.delete
To keep it brief, I’ll skip over ordered sets and bags, but I would like to demonstrate sets:
module Set where import qualified AList type Set a = AList a () empty = AList.empty hasElem = AList.hasKey insert set v' = if hasElem set v' then set else AList.insert set k () delete = AList.delete
Previously, we made mention of structures along the bottom being interested in duplicate values, whereas those on the top are interested in duplicate keys. We now see—already with list, but especially with sets—that the “values” of the bottom are implemented most naturally by the same mechanisms as the keys from the top. That is, if we carry over the meaning of “allows duplicates” from a-lists (which naturally means duplicate keys) over to any of the non-keyed data structures, the English meaning shifts into allowing duplicate values. The math stays unchanged; it’s human language that is inconsistent.
Functoriality
Since lists are the ur-functor of functional programming, I figured it would be worth a look into how the other data structures in this cube might implement a functor interface as well. Recall the functor laws:
fmap id === id
fmap (f . g) === fmap f . fmap g
It turns out… they’re all functors, and the keyed data structures are bi-functors! This is easiest to see by noting that a-lists are bifunctors (because pairs are bifunctors and lists are functors). So simply perform the operation on the a-list representation, then ensure the resulting data structure’s invariants by managing duplicates. Of course, it’s unclear whether to use the earlier or later (or some other) index for ordered dicts/sets whose keys are mapped to the same value, or which function should be used to combine values a multi-dict under the same circumstances. Still, the fact that there are choices to be made doesn’t detract from the functoriality.
Conclusion
This isn’t the most insightful pattern to have recognized; I’d be astounded if I were the first to spot it. (That said, I’m pretty sure I tried to find this pattern some years ago and failed because I attempted to also account for fixed-size containers like tuples, but they don’t fit especially well.) What’s interesting to me is that these data structures don’t (to my knowledge) come together in a single, popular library. Admittedly, the implementations I’ve laid out here (in addition to being woefully incomplete for practical use) are dreadfully inefficient; mature libraries should use the textbook implementations on performance grounds. Nevertheless, if I saw a library actually expose a consistent api for all for these data structures, I’d think “Wow, these people know how to design a good api!”
So, if you’re writing (more likely maintaining) a popular containers api, maybe try to fill out your cube. You likely already have high-performance versions of the hits; perhaps it’s okay to piggyback off of those for the less-well-used vertices. Sure, ordered sets, multi-dicts, and bags could be implemented on the spot every time someone needs them. Then again, despite how rarely bags appear in standard libraries, there’s a common beginner exercise that just begs for bags. If you, the api maintainer, take an hour to implement the “easy” stuff, the law of large numbers takes control and you can save other people combined orders of magnitude more.
At the very least, every high-level language should “include batteries” by providing a multi-dict data structure for http arguments! Seriously, it’s the information age, people!