4. NumPy Basics: Arrays and Vectorized Computation

Learning Objectives

Learn about NumPy, a package for numerical computing in Python
Use NumPy for array-based data: operations, algorithms

Import NumPy

import numpy as np # Recommended standard NumPy convention

Array-based operations

A fast, flexible container for large datasets in Python
Stores multiple items of the same type together
Can perform operations on whole blocks of data with similar syntax

Image of an array with 10 length and the first index, 8th element, and indicies denoted by text

Create an array
Perform operation

arr = np.array([[1.5, -0.1, 3], [0, -3, 6.5]])
arr

array([[ 1.5, -0.1,  3. ],
       [ 0. , -3. ,  6.5]])

All of the elements have been multiplied by 10.

arr * 10

array([[ 15.,  -1.,  30.],
       [  0., -30.,  65.]])

Every array has a shape indicating the size of each dimension
and a dtype, an object describing the data type of the array

Shape
dtype

arr.shape

(2, 3)

arr.dtype

dtype('float64')

ndarray

Generic one/multi-dimensional container where all elements are the same type
Created using numpy.array function

1D
Multi-dimensional

data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1

array([6. , 7.5, 8. , 0. , 1. ])

print(arr1.ndim)
print(arr1.shape)

1
(5,)

data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

print(arr2.ndim)
print(arr2.shape)

2
(2, 4)

Special array creation

numpy.zeros creates an array of zeros with a given length or shape
numpy.ones creates an array of ones with a given length or shape
numpy.empty creates an array without initialized values
numpy.arange creates a range
Pass a tuple for the shape to create a higher dimensional array

Zeros
Multi-dimensional

np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

np.zeros((3, 6))

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

numpy.empty does not return an array of zeros, though it may look like it.

np.empty(1)

array([0.])

Wes provides a table of array creation functions in the book.

Data types for ndarrays

Unless explicitly specified, numpy.array tries to infer a good data created arrays.
Data type is stored in a special dtype metadata object.
Can be explict or converted (cast)
It is important to care about the general kind of data you’re dealing with.

arr1.dtype

dtype('float64')

arr2 = np.array([1, 2, 3], dtype=np.int32)
arr2.dtype

dtype('int32')

float_arr = arr1.astype(np.float64)
float_arr.dtype

dtype('float64')

int_array = arr1.astype(arr2.dtype)
int_array.dtype

dtype('int32')

Calling astype always creates a new array (a copy of the data), even if the new data type is the same as the old data type.

Wes provides a table of supported data types in the book.

Arithmetic with NumPy Arrays

Batch operations on data without for loops

arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr * arr

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

Propagate the scalar argument to each element in the array

1 / arr

array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])

of the same size yield boolean arrays

arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])

arr2 > arr

array([[False,  True, False],
       [ True, False,  True]])

Basic Indexing and Slicing

select a subset of your data or individual elements

arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Array views are on the original data. Data is not copied, and any modifications to the view will be reflected in the source array. If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array—for example, arr[5:8].copy().

arr[5]

arr[5:8]

array([5, 6, 7])

arr[5:8] = 12

Example of “not copied data”

Original

arr_slice = arr[5:8]
arr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

Change values in new array

Notice that arr is now changed.

arr_slice[1] = 123
arr

array([  0,   1,   2,   3,   4,  12, 123,  12,   8,   9])

Change all values in an array

This is done with bare slice [:]:

arr_slice[:] = 64
arr_slice

array([64, 64, 64])

Higher dimensional arrays have 1D arrays at each index:

arr2d = np.array([[1,2,3], [4,5,6], [7,8,9]])
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

To slice, can pass a comma-separated list to select individual elements:

arr2d[0][2]

Omitting indicies will reduce number of dimensions:

arr2d[0]

array([1, 2, 3])

Can assign scalar values or arrays:

arr2d[0] = 9
arr2d

array([[9, 9, 9],
       [4, 5, 6],
       [7, 8, 9]])

Or create an array of the indices. This is like indexing in two steps:

arr2d = np.array([[1,2,3], [4,5,6], [7,8,9]])
arr2d[1,0]

Indexing with slices

ndarrays can be sliced with the same syntax as Python lists:

arr = np.arange(10)

arr[1:6]

array([1, 2, 3, 4, 5])

This slices a range of elements (“select the first row of arr2d”):

# arr2d[row, column]
arr2d[:1]

array([[1, 2, 3]])

Can pass multiple indicies:

arr2d[:3, :1] # colons keep the dimensions
# arr2d[0:3, 0] # does not keep the dimensions

array([[1],
       [4],
       [7]])

Boolean Indexing

names = np.array(["Bob", "Joe", "Will", "Bob", "Will", "Joe", "Joe"])
names

array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')

data = np.array([[4, 7], [0, 2], [-5, 6], [0, 0], [1, 2], [-12, -4], [3, 4]])
data

array([[  4,   7],
       [  0,   2],
       [ -5,   6],
       [  0,   0],
       [  1,   2],
       [-12,  -4],
       [  3,   4]])

Like arithmetic operations, comparisons (such as ==) with arrays are also vectorized.

names == "Bob"

array([ True, False, False,  True, False, False, False])

This boolean array can be passed when indexing the array:

data[names == "Bob"]

array([[4, 7],
       [0, 0]])

Select from the rows where names == “Bob” and index the columns, too:

data[names == "Bob", 1:]

array([[7],
       [0]])

Select everything but “Bob”:

names != "Bob" # or ~(names == "Bob")

array([False,  True,  True, False,  True,  True,  True])

Use boolean arithmetic operators like & (and) and | (or):

mask = (names == "Bob") | (names == "Will")
mask

array([ True, False,  True,  True,  True, False, False])

Selecting data from an array by boolean indexing and assigning the result to a new variable always creates a copy of the data.

Setting values with boolean arrays works by substituting the value or values on the righthand side into the locations where the boolean array’s values are True.

data[data < 0] = 0

You can also set whole rows or columns using a one-dimensional boolean array:

data[names != "Joe"] = 7

Fancy Indexing

A term adopted by NumPy to describe indexing using integer arrays.

arr = np.zeros((8, 4)) # 8 × 4 array

for i in range(8):
  arr[i] = i

arr

array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.],
       [4., 4., 4., 4.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]])

Pass a list or ndarray of integers specifying the desired order to subset rows in a particular order:

arr[[4, 3, 0, 6]]

array([[4., 4., 4., 4.],
       [3., 3., 3., 3.],
       [0., 0., 0., 0.],
       [6., 6., 6., 6.]])

Use negative indices selects rows from the end:

arr[[-3, -5, -7]]

array([[5., 5., 5., 5.],
       [3., 3., 3., 3.],
       [1., 1., 1., 1.]])

Passing multiple index arrays selects a one-dimensional array of elements corresponding to each tuple of indices (go down then across):

arr = np.arange(32).reshape((8, 4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

Here, the elements (1, 0), (5, 3), (7, 1), and (2, 2) are selected.

arr[[1, 5, 7, 2], [0, 3, 1, 2]]

array([ 4, 23, 29, 10])

Fancy indexing, unlike slicing, always copies the data into a new array when assigning the result to a new variable.

Transposing Arrays and Swapping Axes

Transposing is a special form of reshaping using the special T attribute:

arr = np.arange(15).reshape((3, 5))
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

arr.T

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

np.dot(arr.T, arr)

array([[125, 140, 155, 170, 185],
       [140, 158, 176, 194, 212],
       [155, 176, 197, 218, 239],
       [170, 194, 218, 242, 266],
       [185, 212, 239, 266, 293]])

arr.T @ arr

array([[125, 140, 155, 170, 185],
       [140, 158, 176, 194, 212],
       [155, 176, 197, 218, 239],
       [170, 194, 218, 242, 266],
       [185, 212, 239, 266, 293]])

ndarray has the method swapaxes, which takes a pair of axis numbers and switches the indicated axes to rearrange the data:

arr = np.array([[0, 1, 0], [1, 2, -2], [6, 3, 2], [-1, 0, -1], [1, 0, 1], [3, 5, 6]])
arr
arr.swapaxes(0, 1)

array([[ 0,  1,  6, -1,  1,  3],
       [ 1,  2,  3,  0,  0,  5],
       [ 0, -2,  2, -1,  1,  6]])

Pseudorandom Number Generation

The numpy.random module supplements the built-in Python random module with functions for efficiently generating whole arrays of sample values from many kinds of probability distributions.

Much faster than Python’s built-in random module

samples = np.random.standard_normal(size=(4, 4))
samples

array([[ 0.17488936,  1.40484911,  0.15183398, -1.02194459],
       [ 0.69530047,  1.69838274, -0.5782449 , -0.32245913],
       [ 1.30932161, -0.48999345, -0.13171682,  0.67943756],
       [-0.12637043, -0.82355441, -0.86697578, -0.06906716]])

Can use an explicit generator:

seed determines initial state of generator

rng = np.random.default_rng(seed=12345)
data = rng.standard_normal((2, 3))
data

array([[-1.42382504,  1.26372846, -0.87066174],
       [-0.25917323, -0.07534331, -0.74088465]])

Wes provides a table of NumPy random number generator methods

Universal Functions: Fast Element-Wise Array Functions

A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays.

Many ufuncs are simple element-wise transformations:

One array

arr = np.arange(10)
np.sqrt(arr)

array([0.        , 1.        , 1.41421356, 1.73205081, 2.        ,
       2.23606798, 2.44948974, 2.64575131, 2.82842712, 3.        ])

arr1 = rng.standard_normal(10)
arr2 = rng.standard_normal(10)
np.maximum(arr1, arr2)

array([ 0.78884434,  0.6488928 ,  0.57585751,  1.39897899,  2.34740965,
        0.96849691,  0.90291934,  0.90219827, -0.15818926,  0.44948393])

remainder, whole_part = np.modf(arr1)
remainder

array([-0.3677927 ,  0.6488928 ,  0.36105811, -0.95286306,  0.34740965,
        0.96849691, -0.75938718,  0.90219827, -0.46695317, -0.06068952])

Use the out argument to assign results into an existing array rather than create a new one:

out = np.zeros_like(arr)
np.add(arr, 1, out=out)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

Array-Oriented Programming with Arrays

Evaluate the function sqrt(x^2 + y^2) across a regular grid of values: use the numpy.meshgrid function takes two one-dimensional arrays and produce two two-dimensional matrices corresponding to all pairs of (x, y) in the two arrays:

points = np.arange(-5, 5, 0.01) # 100 equally spaced points
xs, ys = np.meshgrid(points, points)
xs

array([[-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       ...,
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99],
       [-5.  , -4.99, -4.98, ...,  4.97,  4.98,  4.99]])

ys

array([[-5.  , -5.  , -5.  , ..., -5.  , -5.  , -5.  ],
       [-4.99, -4.99, -4.99, ..., -4.99, -4.99, -4.99],
       [-4.98, -4.98, -4.98, ..., -4.98, -4.98, -4.98],
       ...,
       [ 4.97,  4.97,  4.97, ...,  4.97,  4.97,  4.97],
       [ 4.98,  4.98,  4.98, ...,  4.98,  4.98,  4.98],
       [ 4.99,  4.99,  4.99, ...,  4.99,  4.99,  4.99]])

Evaluate the function as if it were two points:

z = np.sqrt(xs ** 2 + ys ** 2)
z

array([[7.07106781, 7.06400028, 7.05693985, ..., 7.04988652, 7.05693985,
        7.06400028],
       [7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
        7.05692568],
       [7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
        7.04985815],
       ...,
       [7.04988652, 7.04279774, 7.03571603, ..., 7.0286414 , 7.03571603,
        7.04279774],
       [7.05693985, 7.04985815, 7.04278354, ..., 7.03571603, 7.04278354,
        7.04985815],
       [7.06400028, 7.05692568, 7.04985815, ..., 7.04279774, 7.04985815,
        7.05692568]])

Bonus: matplotlib visualization

import matplotlib.pyplot as plt

plt.imshow(z, cmap=plt.cm.gray)  #, extent=[-25, 10, -10, 10])
plt.colorbar() 
plt.title("Image plot of $\sqrt{x^2 + y^2}$ for a grid of values")

Text(0.5, 1.0, 'Image plot of $\\sqrt{x^2 + y^2}$ for a grid of values')

plt.close("all")

Expressing Conditional Logic as Array Operations

The numpy.where function is a vectorized version of the ternary expression x if condition else.

second and third arguments to numpy.where can also be scalars
can also combine scalars and arrays

xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])

Take a value from xarr whenever the corresponding value in cond is True, and otherwise take the value from yarr:

x if condition else
numpy.where

result = [(x if c else y)
  for x, y, c in zip(xarr, yarr, cond)]
  
result

[1.1, 2.2, 1.3, 1.4, 2.5]

result = np.where(cond, xarr, yarr)
result

array([1.1, 2.2, 1.3, 1.4, 2.5])

Can also do this with scalars, or combine arrays and scalars:

arr = rng.standard_normal((4,4))
arr

array([[-1.34360107, -0.08168759,  1.72473993,  2.61815943],
       [ 0.77736134,  0.8286332 , -0.95898831, -1.20938829],
       [-1.41229201,  0.54154683,  0.7519394 , -0.65876032],
       [-1.22867499,  0.25755777,  0.31290292, -0.13081169]])

np.where(arr > 0, 2, -2)

array([[-2, -2,  2,  2],
       [ 2,  2, -2, -2],
       [-2,  2,  2, -2],
       [-2,  2,  2, -2]])

# set only positive to 2
np.where(arr > 0,2,arr)

array([[-1.34360107, -0.08168759,  2.        ,  2.        ],
       [ 2.        ,  2.        , -0.95898831, -1.20938829],
       [-1.41229201,  2.        ,  2.        , -0.65876032],
       [-1.22867499,  2.        ,  2.        , -0.13081169]])

Mathematical and Statistical Methods

Use “aggregations’ like sum, mean, and std

If using NumPy, must pass the array you want to aggregate as the first argument

arr = rng.standard_normal((5, 4))

arr.mean()

0.06622379901441691

np.mean(arr)

0.06622379901441691

Can use axis to specify which axis to computer the statistic

“compute across the columns”
“compute across the rows”

arr.mean(axis=1)

array([ 0.00066383,  0.40377331,  0.44452789, -0.36983452, -0.14801151])

arr.mean(axis=0)

array([ 0.54494867, -0.10500845,  0.15080113, -0.32584615])

Other methods like cumsum and cumprod do not aggregate, instead producing an array of the intermediate results:

arr.cumsum()

array([1.26998312e+00, 1.17702066e+00, 1.11086977e+00, 2.65530664e-03,
       1.38612157e-01, 1.48568992e+00, 1.54683394e+00, 1.61774854e+00,
       2.05140308e+00, 2.32888674e+00, 2.85913913e+00, 3.39586010e+00,
       4.01421011e+00, 3.21919265e+00, 3.51922360e+00, 1.91652201e+00,
       2.18332084e+00, 9.21697056e-01, 8.50426250e-01, 1.32447598e+00])

In multidimensional arrays, accumulation functions like cumsum compute along the indicated axis:

“compute across the columns”
“compute across the rows”

arr.cumsum(axis=1)

array([[ 1.26998312,  1.17702066,  1.11086977,  0.00265531],
       [ 0.13595685,  1.48303461,  1.54417864,  1.61509324],
       [ 0.43365454,  0.7111382 ,  1.24139058,  1.77811155],
       [ 0.61835001, -0.17666744,  0.1233635 , -1.47933809],
       [ 0.26679883, -0.99482495, -1.06609576, -0.59204603]])

arr.cumsum(axis=0)

array([[ 1.26998312, -0.09296246, -0.06615089, -1.10821447],
       [ 1.40593997,  1.25411531, -0.00500687, -1.03729987],
       [ 1.83959451,  1.53159897,  0.52524552, -0.5005789 ],
       [ 2.45794452,  0.73658151,  0.82527646, -2.10328049],
       [ 2.72474335, -0.52504227,  0.75400566, -1.62923076]])

Methods for Boolean Arrays

Boolean values are coerced to 1 (True) and 0 (False) in the preceding methods. Thus, sum is often used as a means of counting True values in a boolean array:

(arr > 0).sum() # Number of positive values

any tests whether one or more values in an array is True, while all checks if every value is True:

bools = np.array([False, False, True, False])
bools.any()

True

Sorting

NumPy arrays can be sorted in place with the sort method:

arr = rng.standard_normal(6)
arr.sort()
arr

array([-1.64041784, -1.15452958, -0.85725882, -0.41485376,  0.0977165 ,
        0.68828179])

Can sort multidimensional section by providing an axis:

arr = rng.standard_normal((5, 3))

“compute across the columns”
“compute across the rows”

arr.cumsum(axis=1)

array([[ 0.65045239, -0.73790756, -1.64529002],
       [-1.09542531, -1.08827961, -0.55391971],
       [-1.06580785, -1.24728059,  0.37467121],
       [-0.31739195, -1.13320691, -0.7466279 ],
       [-0.22363893, -0.92532973, -2.72104291]])

arr.cumsum(axis=0)

array([[ 0.65045239, -1.38835995, -0.90738246],
       [-0.44497292, -1.38121426, -0.37302255],
       [-1.51078076, -1.562687  ,  1.24892924],
       [-1.82817271, -2.37850196,  1.63550826],
       [-2.05181164, -3.08019277, -0.16020491]])

The top-level method numpy.sort returns a sorted copy of an array (like the Python built-in function sorted) instead of modifying the array in place:

arr2 = np.array([5, -10, 7, 1, 0, -3])
sorted_arr2 = np.sort(arr2)
sorted_arr2

array([-10,  -3,   0,   1,   5,   7])

Unique and Other Set Logic

numpy.unique returns the sorted unique values in an array:

np.unique(names)

array(['Bob', 'Joe', 'Will'], dtype='<U4')

numpy.in1d tests membership of the values in one array in another, returning a boolean array:

np.in1d(arr1, arr2)

array([False, False, False, False, False, False, False, False, False,
       False])

File Input and Output with Arrays

NumPy is able to save np.save and load np.load data to and from disk in some text or binary formats.

Arrays are saved by default in an uncompressed raw binary format with file extension .npy:

arr = np.arange(10)
np.save("some_array", arr)

np.load("some_array.npy")

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Save multiple arrays in an uncompressed archive using numpy.savez
If your data compresses well, use numpy.savez_compressed instead

4.6 Linear Algebra

Linear algebra operations, like matrix multiplication, decompositions, determinants, and other square matrix math, can be done with Numpy (np.dot(y) vs x.dot(y)):

np.dot(arr1, arr)

7.221776767282354

Example: Random Walks

import matplotlib.pyplot as plt
#! blockstart
import random
position = 0
walk = [position]
nsteps = 1000
for _ in range(nsteps):
    step = 1 if random.randint(0, 1) else -1
    position += step
    walk.append(position)
#! blockend

plt.plot(walk[:100])
plt.show()