Data Cube

published: 2021-03-10
tags: computing, programming

tl;dr

keyed: yes/no
duplicates allowed: yes/no
ordered: yes/no

    ord           assoc
    dict --------- list
    /|            / |
   / |        multi |
dict --------- dict |
 |   |          |   |
 |   |          |   |
 |  ord         |   |
 |  set --------|- list
 |  /           |  /
 | /            | /
set ---------- bag

Ugh, fine… Details

Take a peek at the data structures on offer in various high-level languages, and you’re likely to see some patterns. For our purposes here, we’ll brush aside variations based on performance such as HashMap vs. TreeMap or LinkedList vs. ArrayList. Everybody has some notion of list and map/dictionary in easy reach; other common entrants are sets and ordered dictionaries; poke around a bit and you might find ordered set and multi-dictionary. To summarize (and clarify my terminology):

list: not keyed, allows duplicate values, ordered
dict: keyed, no duplicate keys, not ordered
set : not keyed, no duplicate values, not ordered
ordered dict: keyed, no duplicate keys, ordered
ordered set: not keyed, no duplicate values, ordered
multi-dict: keyed, allows duplicate keys, not ordered

The other two entries in this pattern are fairly widely-used, but don’t appear in standard libraries so much.

association list: keyed, allows duplicate keys, ordered

Easily implemented as a list of pairs.
bag: not keyed, allows duplicate values, not ordered

Present in combinatorics (it even has its own delimiter symbols ⟅1,2,2,5⟆ analogous to curly braces for sets), but not so often used in programming. When I’ve needed this, I generally use a dictionary with integer values, but it’s a bit of a pain to add the first item with a given key.

Eight possibilities generated by three boolean parameters is just begging me to draw a cube. I guess I’ll draw it again, with the magic of copy-paste!

    ord           assoc
    dict --------- list
    /|            / |
   / |        multi |
dict --------- dict |
 |   |          |   |
 |   |          |   |
 |  ord         |   |
 |  set --------|- list
 |  /           |  /
 | /            | /
set ---------- bag

On the bottom we have the common the non-keyed data structures, and on top are the keyed ones. In the front are those which do not care about order, and in the back are those that do. The only oddity in discussing the left-right dimension is that on top (keyed data structures) we are interested in duplicates keys, whereas on the bottom (non-keyed data structures) we are interested in duplicate values. Let’s just say that on the right are those which allow duplicates (keys if keyed or values if non-keyed), and on the left are those which do not. We’ll see in the next section that this double-meaning of duplicate is quite natural idea, despite seeming ad-hoc at the moment.

Implementation Relationships Spelled Out

The association list (a.k.a. assoc list, or a-list) it the most general of these data structures. That is, it is very easy to implement all the others in terms of a-lists. To see exactly how, let’s assume we’re given some module with the following interface (I’ll be using a Haskell-like syntax):

module AList where

type AList k v

empty :: AList k v
hasKey :: AList k v -> k -> Bool
index :: AList -> Nat -> Maybe (k, v)
lookup :: AList k v -> k -> [v]
insert :: AList k v -> k -> v -> AList k v
delete :: AList k v -> k -> AList k v

-- we also have the following non-obvious properties:
--   index (insert alist x y) 0 === Just (x, y)
--   index (insert alist x y) (n + 1) === index alist n
--   (delete (insert alist x y) x) === alist

Moving left, we have the ordered dictionary, which is obtained simply by simply overwriting existing keys:

module OrdDict where
import qualified AList

type OrdDict k v = AList k v
empty = AList.empty
hasKey = AList.hasKey
index = AList.index
lookup odict k = case AList.lookup odict key of { [] -> Nothing; [v] -> Just v }
insert odict k v' = AList.insert (AList.delete odict k) k v'
delete = AList.delete

Honestly, I haven’t used an ordered dictionary before, but it apparently comes up in enough problems that it got implemented in Python’s standard library.

Moving forward from AList, all we need to do is not export the order-related functions:

module MultiDict where
import qualified AList

type MultiDict k v = AList k v
empty = AList.empty
hasKey = AList.hasKey
lookup = AList.lookup
insert = AList.insert
delete = AList.delete

The most common place you’ve seen this used is in http request parameters. Somehow, this hasn’t made its way into Python’s standard library yet, but then again, I have a feeling that only the minority of web programmers know that ?id=1&id=2 is a valid and meaningful query string.

For the last vertex of the top face, we have the ubiquitous Dict, which I won’t spell out here. It’s just OrdDict but missing the index function, or MultiDict but with insert implemented like OrdDict‘s.

Now we look at the bottom face, starting by implementing lists from a-lists. This we simply do by choosing a unit type (one which carries no information) for the second type parameter. Since we aren’t interested in key-value mapping, we also drop any functions that focus on them.

module List where
import qualified AList

type List a = AList a ()
empty = AList.empty
index = fmap fst . AList.index
prepend x = AList.insert x ()
delete = AList.delete

To keep it brief, I’ll skip over ordered sets and bags, but I would like to demonstrate sets:

module Set where
import qualified AList

type Set a = AList a ()
empty = AList.empty
hasElem = AList.hasKey
insert set v' = if hasElem set v' then set else AList.insert set k ()
delete = AList.delete

Previously, we made mention of structures along the bottom being interested in duplicate values, whereas those on the top are interested in duplicate keys. We now see—already with list, but especially with sets—that the “values” of the bottom are implemented most naturally by the same mechanisms as the keys from the top. That is, if we carry over the meaning of “allows duplicates” from a-lists (which naturally means duplicate keys) over to any of the non-keyed data structures, the English meaning shifts into allowing duplicate values. The math stays unchanged; it’s human language that is inconsistent.

Functoriality

Since lists are the ur-functor of functional programming, I figured it would be worth a look into how the other data structures in this cube might implement a functor interface as well. Recall the functor laws:

fmap id === id
fmap (f . g) === fmap f . fmap g

It turns out… they’re all functors, and the keyed data structures are bi-functors! This is easiest to see by noting that a-lists are bifunctors (because pairs are bifunctors and lists are functors). So simply perform the operation on the a-list representation, then ensure the resulting data structure’s invariants by managing duplicates. Of course, it’s unclear whether to use the earlier or later (or some other) index for ordered dicts/sets whose keys are mapped to the same value, or which function should be used to combine values a multi-dict under the same circumstances. Still, the fact that there are choices to be made doesn’t detract from the functoriality.

Conclusion

This isn’t the most insightful pattern to have recognized; I’d be astounded if I were the first to spot it. (That said, I’m pretty sure I tried to find this pattern some years ago and failed because I attempted to also account for fixed-size containers like tuples, but they don’t fit especially well.) What’s interesting to me is that these data structures don’t (to my knowledge) come together in a single, popular library. Admittedly, the implementations I’ve laid out here (in addition to being woefully incomplete for practical use) are dreadfully inefficient; mature libraries should use the textbook implementations on performance grounds. Nevertheless, if I saw a library actually expose a consistent api for all for these data structures, I’d think “Wow, these people know how to design a good api!”

So, if you’re writing (more likely maintaining) a popular containers api, maybe try to fill out your cube. You likely already have high-performance versions of the hits; perhaps it’s okay to piggyback off of those for the less-well-used vertices. Sure, ordered sets, multi-dicts, and bags could be implemented on the spot every time someone needs them. Then again, despite how rarely bags appear in standard libraries, there’s a common beginner exercise that just begs for bags. If you, the api maintainer, take an hour to implement the “easy” stuff, the law of large numbers takes control and you can save other people combined orders of magnitude more.

At the very least, every high-level language should “include batteries” by providing a multi-dict data structure for http arguments! Seriously, it’s the information age, people!