More NumPy, Please
What are the good ideas from NumPy?
To what can Numpy attribute its wild success? I’d say, along with others, it’s due to several features.
NumPy:
- enables numerical computing with Python via the N-dimensional array, where arrays of homogeneous simple types are paramount,
- Provides efficient vectorized functions that work well with ndarrays,
- allows flexible manipulations of an ndarray’s metadata with views, reshapes, transposes, all with O(1) performance,
- compared to a Python list, NumPy’s ufuncs get you to the next order-of-magnitude of performance,
- compared to built-in Python containers, NumPy arrays shrink your memory footprint and keep your data dense, removing pointer indirections,
- enables efficient interfacing with C code,
- and enables threaded programming, since ndarray elements are often ints or floats, not Python objects, allowing C code to release the GIL and benefit from true parallelism.
The cumulative effect is that, with NumPy, Python is a very productive language for both high-level programming that can get to good-enough performance, and easily interface with C-compatible languages. Given timings and the vagaries of history, that has helped to make Python one of the most widely used languages around.
What is NumPy not good at? Some of these are not fair, exactly, since they’re things that I would never expect or want a simple ndarray library to support, but they have come up in various contexts in the past:
- Fast lookup – to know if an element is in an array requires, in general, O(n) operations. Wouldn’t it be great if there was a way to index a numpy array to allow fast O(1) lookups?
- Missing values – NumPy gets NaN for free due to IEEE-754, but missing values
(like a
None
equivalent for all data types) for any other data type is not a first-class entity and is inefficient (I’m looking at you, masked arrays) - Strings – NumPy only supports arrays of fixed-length strings, unless you use an object array, which is flexible but has many downsides,
- Getting values – Extracting a single element, one-at-a-time, is quite slow due to object-creation overhead.
I’ve come across a few ML libraries, implemented in C with Python bindings,
that use NumPy arrays extensively, but have to go through painful and
error-prone contortions because C has so few built-in data structures. For
instance, one library had to resort to sorting arrays and doing a binary search
via libc’s bsearch
to do a set-contains equivalent.
What if we apply ideas from NumPy to other data structures?
What if we did for the dictionary, or set (or myriad other data structures, but let’s start with the basics) what NumPy did for the List? By that I mean, what if we had a Python dict-like data structure (let’s call it a “dtyped-dictionary” for lack of a better term) that had typed keys and values and was backed by an efficient, low-level hash-map? This is almost certainly not a new idea and has likely been implemented in various forms all over the place. I think it’s still worth exploring, particularly in conjunction with some other ideas.
In particular, if this could be made to work well with the other dtype’d data structures, then here are a few things this would enable:
- Creating a dictionary from a numpy array would return a dtyped-dictionary,
- Extracting the keys or values from a dtyped-dictionary could return (immutable) numpy arrays, with all the attendant performance and interoperability improvements,
- Indexing NumPy arrays would be a natural extension of this, and we could build a simple Indexed array object that composes a dtyped-dictionary built from an ndarray,
- We could define other common ufuncs to work on the other data structures that
give efficient, vectorized operations analogous to numpy’s ufuncs – think
toolz.dicttoolz
with all the attendant performance improvements that static typing can bring, - Python only “sees” a single top-level Python object that holds a reference to all the underlying data. This effectively squeezes out all the python objects from the data structure, providing memory and time efficiencies, and making it easier to release the GIL.
We’d still have to address some of the NA issues with NumPy, and we’d have to
implement O(n) to_python()
and from_python()
functions or methods that
convert between pure python versions and the dtyped versions, and we’d have to
think carefully about nested JSON-like data, but I think this could be really
interesting, especially if the pieces compose well together.
Other thoughts
- With dictionaries we often want nested data for the values (and sometimes for the keys, too). Think of a multi-level dictionary of dictionary of list of tuples. At one end of the expressiveness spectrum we’d be able to represent arbitrary JSON data, but without nested Python objects.
- Numba-like kernels for fast computations could be interesting here.