Creating NumPy Arrays
Published:
NumPy arrays are a core part of the numerical computing stack in Python. NumPy provides a large number of functions for creating these arrays, of which np.array
is most well known (due to ubiquitous use in tutorials). However, np.array
is not always the right function to use. This post will explore some other common array creation functions.
Creating NumPy Arrays
NumPy is at the base of must numerical computing in Python. The ndarray
is the central class provided by NumPy, which implements a C-style array.
Background
Unlike the List object from base Python, ndarrays
are statically typed, containing only one type. In most cases, this makes ndarrays
significantly faster than base Python objects. The static type of an ndarray
is termed the ‘data-type’ of the array and is denoted with the keyword dtype
.
In addition to a data-type, an array has a ‘shape’. An array shape is a Tuple representing how the array is structured. The shape primarily has to do with how the ndarray
interpreted and the types of operations that are allowed. Array shape will not be discussed in detail here. For the purposes of this post, I will focus 1-D arrays as the ideas are easily extended to other array shapes.
Basic Creation Functions
The most common array creation functions from NumPy can be put three categories:
- Arrays with uninitialized values
- Arrays with initialized values
- Arrays built from iterable objects
Arrays with Uninitialized Values
These are the most basic type of functions for creating NumPy arrays.
np.empty
: creates andarray
with uninitialized values i.e. its empty.np.empty_like
: create an emptyndarray
based on another array (same shape anddtype
)
These functions are typically used when an array must be filled with a loop. In the typical way that list.append()
might be used. However, it should be used when the size of the array is known (like in a for-loop) as iteratively resizing a NumPy array is slow.
As an example, np.empty
could be used to collect the silhouette scores of k-means clustering for given values of k. In the code below, the empty array is built based on the number of k values that will be scored. The silhouette score is calculated for each k and stored in the array (indexed on the enumeration of k).
# initialize variables
k_values = [k for k in range(2, 7)]
silhouette_scores = np.empty(len(k_values))
# iterate over k values, enumerate to access array
for i, k in enumerate(k_values):
# init KMeans with k clusters
km = KMeans(n_clusters = k, random_state = 42)
# get cluster labels
labels = km.fit_predict(X)
# calc and store score
silhouette_scores[i] = silhouette_score(X, labels)
Arrays with Initialized Values
Single-Valued
The following functions create arrays and initialize the values to a particular value.
np.zeros
: create an array and initialize all values to zero.np.zeros_like
: create an array like another (shape anddtype
) and initialize all values to zero.np.ones
: create an array and initialize all values to one.np.ones_like
: create an array like another (shape anddtype
) and initialize all values to one.np.full
: create an array and initialize all values to a given value.np.full_like
: create an array like another (shape anddtype
) and initialize all values to a given value.
These functions would be used to create arrays with a given initial value. np.full
is the most general of this set of functions, while np.zeros
and np.ones
are essentially special cases of np.full
. The usage is straightforward.
# create a 2 element vector with values initialized to 10
>>> np.full(2, 10)
array([10., 10.])
# create a 3 element vector with values initialized to 0
>>> np.zeros(3)
array([0., 0., 0.])
# create a ones vector like the 3 element zeros vector
>>> np.ones_like(np.zeros(3))
array([1., 1., 1.])
Sequences and Spaces
The following functions create array containing sequences or numeric spaces.
np.arange
: create an array containing a sequence of numbers.np.linspace
: create an array containing evenly spaced numbers over an interval.np.logspace
: create an array containing evenly spaced numbers over an interval on a log scale.
np.arange
can be used to create a general sequence of numbers. Often, np.arange
will be used to generate sequences of model hyperparameters for tuning like k values for k-Means clustering. For instance, the variable creation section in the example above could be written with np.arange
:
# initialize variables
k_values = np.arange(2, 7)
# note: dtype is overwritten because
silhouette_scores = np.empty_like(k_values, dtype = 'float')
The following list comprehension, np.array([v for v in range(start, stop)])
, is sometimes found in data science code. These instances are a good oppertunity to use np.arange
instead of using list comprehension and np.array
.
An example use case of np.linspace
and np.logspace
is building sequences for model hyperparameters searches. For instance, it is common to search for a good regularization value when using Ridge or LASSO regression. The code example below shows creation of a logspace that is used for grid searching the regularization hyperparameter in logistic regression.
# create a logspace from 0.00001 to 10000 with 20 evenly spaced values
tuned_parameters = [{'C':np.logspace(-4, 4, 20)}]
# grid search 'C' with logistic regression
clf = GridSearchCV(
LogisticRegression(random_state = random_state),
tuned_parameters,
scoring='%s_macro' % 'recall'
)
clf.fit(X_train, y_train)
Arrays Built from Iterable Objects
The following functions will create arrays from iterables.
np.array
: creates annp.ndarray
from an existing object. For instance coverting aList
to anndarray
.np.asarray
: converts an object to anndarray
, but does notcreate a copy of an existingndarray
.
np.array
is by far the most well known array creation function in NumpP. An example it provided below.
# create an ndarray from a list
>>> np.array([2, 1, 4])
array([2, 1, 4])
np.asarray
is a really useful function for coercing arguments of a function to ndarrays
. In functions that typically accept arrays, it is common also accept any ‘list_like’ or ‘array_like’ object. Any function that takes an ‘array_like’ object, expecting to use it like an ndarray
, must convert it to an array internally. However, if an ndarray
is passed to the function, there is no need to use np.array
which will create a copy of the array. Instead, use np.asarray
which will check if the input object is an array or exposes an __array__
(i.e. it contains an array representation like pd.DataFrame, pd.Series) and will only create a new ndarray
when necessary.
def to_array(iterable_object):
"""Convert a list like object to a numpy array
"""
# do not create a new array if object is an ndarray
# or contains an ndarray representation
array = np.asarray(iterable_object)
return array
Note: ‘list_like’ or ‘array_like’ are colloquialisms used in the PyData stack to describe an object that ‘looks’ like a list or an array.
References
- [1] NumPy Developers, “NumPy v1.18”, https://numpy.org/doc/stable/
- [2] Jake VanderPlas, “Understanding Data Types in Python”, https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html