Zodiac Wang
  • Home
  • Categories
  • Tags
  • Archives

一些有用的NumPy函数与方法


使用 NumPy 过程中遇到的问题,方法和有用的函数

Table of Contents

  • 1  Funcs
    • 1.1  numpy.roll
    • 1.2  统计函数
    • 1.3  Matrix library
    • 1.4  numpy.random.choice
    • 1.5  numpy.allclose
    • 1.6  numpy.array_split
    • 1.7  numpy.linalg.norm
    • 1.8  numpy.apply_along_axis
    • 1.9  numpy.bincount
    • 1.10  numpy.unique
    • 1.11  np.hypot
    • 1.12  np.unravel_index
  • 2  Routines
    • 2.1  自定义 dtype
    • 2.2  NumPy 导入导出
      • 2.2.1  导出
        • 2.2.1.1  savetxt
        • 2.2.1.2  tofile
        • 2.2.1.3  np.save
      • 2.2.2  导入
        • 2.2.2.1  loadtxt
        • 2.2.2.2  genfromtxt
        • 2.2.2.3  fromfile
        • 2.2.2.4  np.load
    • 2.3  ndarray 转换为 DataFrame(多维->2维)
      • 2.3.1  比较保守的转换
      • 2.3.2  比较激进的转换
    • 2.4  NumPy 函数式编程
      • 2.4.1  numpy.vectorize
    • 2.5  给array增加新的维度
      • 2.5.1  使用 newaxis 或者 None
      • 2.5.2  利用reshape
    • 2.6  获取 ndarray 中出现次数最多的元素
      • 2.6.1  1维
      • 2.6.2  多维
    • 2.7  交换ndarray的轴[广义转置]
  • 3  Issues
    • 3.1  mgrid, ogrid 与 meshgrid
      • 3.1.1  mgrid
      • 3.1.2  ogrid
      • 3.1.3  meshgrid
    • 3.2  ndarray 和 matrix
    • 3.3  reshape自动降维
    • 3.4  (n, )和(n, 1)的 broadcast 原则
    • 3.5  numpy.argwhere 和 numpy.where
      • 3.5.1  numpy.where
        • 3.5.1.1  同时给定 condition 和 x, y
        • 3.5.1.2  不给定 x y 只给定一个条件
      • 3.5.2  numpy.argwhere
    • 3.6  numpy.tile 和 numpy.repeat
      • 3.6.1  tile
    • 3.7  NumPy 中的 reshape 操作
    • 3.8  NumPy 交换数据和比较操作
      • 3.8.1  交换
      • 3.8.2  比较
    • 3.9  ufunc.outer
    • 3.10  numpy.argsort 与 numpy.sort
      • 3.10.1  numpy.sort
      • 3.10.2  numpy.argsort
In [1]:
import numpy as np

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Funcs¶

numpy.roll¶

沿指定轴顺序移动 ndarray

np.roll(a, shift, axis=None)
In [2]:
s = np.random.randn(3, 3)
s
np.roll(s, 2)
np.roll(s, 2, 0)
np.roll(s, -2, 1)
Out[2]:
array([[ 0.09212846, -2.17681524,  1.66132633],
       [-0.40566573, -1.16840867,  0.74303226],
       [-0.07633158,  1.12224718,  0.76305757]])
Out[2]:
array([[ 1.12224718,  0.76305757,  0.09212846],
       [-2.17681524,  1.66132633, -0.40566573],
       [-1.16840867,  0.74303226, -0.07633158]])
Out[2]:
array([[-0.40566573, -1.16840867,  0.74303226],
       [-0.07633158,  1.12224718,  0.76305757],
       [ 0.09212846, -2.17681524,  1.66132633]])
Out[2]:
array([[ 1.66132633,  0.09212846, -2.17681524],
       [ 0.74303226, -0.40566573, -1.16840867],
       [ 0.76305757, -0.07633158,  1.12224718]])

统计函数¶

NumPy 里的一些统计函数

scipy.stats 基本统计数据

In [3]:
from scipy import stats
arr = np.random.random((3, 3))


stats.describe(arr)
Out[3]:
DescribeResult(nobs=3, minmax=(array([0.04891304, 0.08643263, 0.16321102]), array([0.61292039, 0.85464147, 0.63543607])), mean=array([0.29364756, 0.42151269, 0.40286062]), variance=array([0.08369304, 0.15474637, 0.05578666]), skewness=array([ 0.44191581,  0.42945111, -0.0549739 ]), kurtosis=array([-1.5, -1.5, -1.5]))
In [4]:
# 计算所有元素的和

np.sum(arr)
Out[4]:
3.3540626174876325
In [5]:
# 对每一列求和,注意axis是0

np.sum(arr, axis=0)
Out[5]:
array([0.88094269, 1.26453807, 1.20858186])
In [6]:
# 对每一行求和,注意axis是1

np.sum(arr, axis=1)
Out[6]:
array([0.54528044, 1.09959537, 1.7091868 ])
In [7]:
# 对每一个元素求累积和(从上到下,从左到右的元素顺序),即每移动一次就把当前数字加到和值

np.cumsum(arr)
Out[7]:
array([0.04891304, 0.13534567, 0.54528044, 1.15820083, 1.4816648 ,
       1.64487582, 1.86398508, 2.71862655, 3.35406262])
In [8]:
# 计算每一列的累积和,并返回二维数组

np.cumsum(arr, axis=0)
Out[8]:
array([[0.04891304, 0.08643263, 0.40993478],
       [0.66183342, 0.4098966 , 0.57314579],
       [0.88094269, 1.26453807, 1.20858186]])
In [9]:
# 计算每一行的累计积,并返回二维数组

np.cumprod(arr, axis=1)
Out[9]:
array([[0.04891304, 0.00422768, 0.00173307],
       [0.61292039, 0.19825766, 0.03235783],
       [0.21910926, 0.18725986, 0.11899167]])
In [10]:
# 计算所有元素的最小值

np.min(arr)
Out[10]:
0.048913036920193664
In [11]:
# 计算每一列的最大值

np.max(arr, axis=0)
Out[11]:
array([0.61292039, 0.85464147, 0.63543607])
In [12]:
# 计算所有元素的均值

np.mean(arr)
Out[12]:
0.3726736241652925
In [13]:
# 计算每一行的均值

np.mean(arr, axis=1)
Out[13]:
array([0.18176015, 0.36653179, 0.56972893])
In [14]:
# 计算所有元素的中位数

np.median(arr)
Out[14]:
0.3234639717516794
In [15]:
# 计算每一列的中位数

np.median(arr, axis=0)
Out[15]:
array([0.21910926, 0.32346397, 0.40993478])
In [16]:
# 计算所有元素的方差

np.var(arr)
Out[16]:
0.0685641129690908
In [17]:
# 计算每一行的标准差

np.std(arr, axis=1)
Out[17]:
array([0.16206928, 0.18610169, 0.2635822 ])

此外还有:

  • unique(x): 计算x的唯一元素,并返回有序结果
  • intersect(x,y): 计算x和y的公共元素,即交集
  • union1d(x,y): 计算x和y的并集
  • setdiff1d(x,y): 计算x和y的差集,即元素在x中,不在y中
  • setxor1d(x,y): 计算集合的对称差,即存在于一个数组中,但不同时存在于两个数组中
  • in1d(x,y): 判断x的元素是否包含于y中

Matrix library¶

numpy.matlib

这个库拥有所有 NumPy 命名空间的函数,只是针对 matrix 替换了以下函数。

numpy namespace 中返回 matrix 的函数

  • mat(data[, dtype]) #Interpret the input as a matrix.
  • matrix # Returns a matrix from an array-like object, or from a string of data.
  • asmatrix(data[, dtype]) Interpret the input as a matrix.
  • bmat(obj[, ldict, gdict]) # Build a matrix object from a string, nested sequence, or array.
  • matlib 库中替换了的函数

  • empty(shape[, dtype, order]) # Return a new matrix of given shape and type, without initializing entries.

  • zeros(shape[, dtype, order]) # Return a matrix of given shape and type, filled with zeros.
  • ones(shape[, dtype, order]) # Matrix of ones.
  • eye(n[, M, k, dtype]) # Return a matrix with ones on the diagonal and zeros elsewhere.
  • identity(n[, dtype]) # Returns the square identity matrix of given size.
  • repmat(a, m, n) # Repeat a 0-D to 2-D array or matrix MxN times.
  • rand(*args) # Return a matrix of random values with given shape.
  • randn(*args) # Return a random matrix with data from the “standard normal” distribution.

区分一下不同 shape 叠加之后的结果,大体上 (10,) (10,1) 表现类似

In [18]:
x = np.arange(10)  # (10,) shape
y = x.reshape(-1, 1)  # (10, 1) shape
z = x.reshape(1, -1)  # (1, 10) shape

np.vstack([x, x]).shape
np.hstack([x, x]).shape

np.vstack([y, y]).shape
np.hstack([y, y]).shape

np.vstack([z, z]).shape
np.hstack([z, z]).shape
Out[18]:
(2, 10)
Out[18]:
(20,)
Out[18]:
(20, 1)
Out[18]:
(10, 2)
Out[18]:
(2, 10)
Out[18]:
(1, 20)

numpy.random.choice¶

Generates a random sample from a given 1-D array

对给定的 1-D array 进行随机采样

numpy.random.choice(a, size=None, replace=True, p=None)

  • https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html

可以指定 replace 参数,控制是否重复选择

random.choices[Python内置函数] 与其类似

In [19]:
np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])

aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
Out[19]:
array([2, 3, 2], dtype=int64)
Out[19]:
array([3, 2, 0])
Out[19]:
array(['Christopher', 'pooh', 'piglet', 'pooh', 'Christopher'],
      dtype='<U11')

numpy.allclose¶

numpy.allclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False)

Returns True if two arrays are element-wise equal within a tolerance.

如果两个 array 每一项误差都在可容忍范围内则返回 True

  • https://docs.scipy.org/doc/numpy/reference/generated/numpy.allclose.html

默认两个 nan 是不相等的,可通过 equal_nan=True 设置

另见:

  • np.np.array_equal
In [20]:
np.allclose([1e10, 1e-7], [1.00001e10, 1e-8])

np.allclose([1e10, 1e-8], [1.00001e10, 1e-9])

np.allclose([1e10, 1e-8], [1.0001e10, 1e-9])

np.allclose([1.0, np.nan], [1.0, np.nan])

np.allclose([1.0, np.nan], [1.0, np.nan], equal_nan=True)
Out[20]:
False
Out[20]:
True
Out[20]:
False
Out[20]:
False
Out[20]:
True

numpy.array_split¶

numpy.array_split(ary, indices_or_sections, axis=0)

将 arr 分成几个 subarr,返回列表

https://docs.scipy.org/doc/numpy/reference/generated/numpy.array_split.html

In [21]:
x = np.arange(8.0)

np.array_split(x, 3)
Out[21]:
[array([0., 1., 2.]), array([3., 4., 5.]), array([6., 7.])]
In [22]:
x = np.arange(7.0)

np.array_split(x, 3)
Out[22]:
[array([0., 1., 2.]), array([3., 4.]), array([5., 6.])]

注意不够整数时的处理方式

按维度拆分

In [23]:
x = np.arange(12).reshape(3, 4)

np.array_split(x, 2, axis=1)  # 按轴拆分
Out[23]:
[array([[0, 1],
        [4, 5],
        [8, 9]]), array([[ 2,  3],
        [ 6,  7],
        [10, 11]])]
In [24]:
np.array_split(x, 3, axis=1)
Out[24]:
[array([[0, 1],
        [4, 5],
        [8, 9]]), array([[ 2],
        [ 6],
        [10]]), array([[ 3],
        [ 7],
        [11]])]

类似函数 numpy.split()

numpy.linalg.norm¶

numpy.linalg.norm(x, ord=None, axis=None, keepdims=False)

返回各种范数

  • https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html
  • Algebra Routine
ord norm for matrices   norm for vectors
None    Frobenius norm  2-norm
‘fro’   Frobenius norm  –
‘nuc’   nuclear norm    –
inf max(sum(abs(x), axis=1))    max(abs(x))
-inf    min(sum(abs(x), axis=1))    min(abs(x))
0   –   sum(x != 0)
1   max(sum(abs(x), axis=0))    as below
-1  min(sum(abs(x), axis=0))    as below
2   2-norm (largest sing. value)    as below
-2  smallest singular value as below
other   –   sum(abs(x)**ord)**(1./ord)

The Frobenius norm is given by:

$$ ||A||_F = [\sum_{i,j} abs(a_{i,j})^2]^{1/2} $$
The nuclear norm is the sum of the singular values.
In [25]:
from numpy import linalg as LA
a = np.arange(9) - 4
a

b = a.reshape((3, 3))
b
Out[25]:
array([-4, -3, -2, -1,  0,  1,  2,  3,  4])
Out[25]:
array([[-4, -3, -2],
       [-1,  0,  1],
       [ 2,  3,  4]])
In [26]:
LA.norm(a)
LA.norm(b)
LA.norm(b, 'fro')
LA.norm(a, np.inf)
LA.norm(b, np.inf)
LA.norm(a, -np.inf)
LA.norm(b, -np.inf)
Out[26]:
7.745966692414834
Out[26]:
7.745966692414834
Out[26]:
7.745966692414834
Out[26]:
4.0
Out[26]:
9.0
Out[26]:
0.0
Out[26]:
2.0

numpy.apply_along_axis¶

numpy.apply_along_axis(func1d, axis, arr, *args, **kwargs)
  • https://docs.scipy.org/doc/numpy/reference/generated/numpy.apply_along_axis.html

沿某轴执行函数

In [27]:
def my_func(a):
    """Average first and last element of a 1-D array"""
    return (a[0] + a[-1]) * 0.5


b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
np.apply_along_axis(my_func, 0, b)

np.apply_along_axis(my_func, 1, b)
Out[27]:
array([4., 5., 6.])
Out[27]:
array([2., 5., 8.])

返回结果的基本原则是保持和 arr 形状一致 如果有多出的维度则增加维度

In [28]:
b = np.array([[8, 1, 7], [4, 3, 9], [5, 2, 6]])
np.apply_along_axis(sorted, 1, b)
Out[28]:
array([[1, 7, 8],
       [3, 4, 9],
       [2, 5, 6]])
In [29]:
b = np.array([[1, 2], [4, 5], [7, 8]])
res = np.apply_along_axis(np.diag, -1, b)
res
res.shape
Out[29]:
array([[[1, 0],
        [0, 2]],

       [[4, 0],
        [0, 5]],

       [[7, 0],
        [0, 8]]])
Out[29]:
(3, 2, 2)

numpy.bincount¶

numpy.bincount(x, weights=None, minlength=0)
  • https://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html

返回 x 中各元素的计数,没出现的补 0

输入必须为 int

In [30]:
np.bincount(np.arange(5))

np.bincount(np.array([0, 1, 1, 3, 2, 1, 7]))
Out[30]:
array([1, 1, 1, 1, 1], dtype=int64)
Out[30]:
array([1, 3, 1, 1, 0, 0, 0, 1], dtype=int64)
In [31]:
x = np.array([0, 1, 1, 3, 2, 1, 7, 23])

np.bincount(x).size
np.bincount(x).size == np.max(x)+1
Out[31]:
24
Out[31]:
True

numpy.unique¶

numpy.unique(ar, return_index=False, return_inverse=False, return_counts=False, axis=None)
  • https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html

返回排序后的 unique elements 默认 axis=None, return_index 控制是否返回相应 indice,counts 控制是否返回计数

In [32]:
a = np.array([[1, 2, 1], [2, 3, 4]])
np.unique(a)
Out[32]:
array([1, 2, 3, 4])
In [33]:
a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])
np.unique(a, axis=0)
Out[33]:
array([[1, 0, 0],
       [2, 3, 4]])

return_inverse 控制是否返回相对原始 ndarray的 indice

In [34]:
a = np.array([[1, 2, 1], [2, 3, 4]])
u, indices = np.unique(a, return_index=True)

u
indices

a[np.unravel_index(indices, a.shape)]
Out[34]:
array([1, 2, 3, 4])
Out[34]:
array([0, 1, 4, 5], dtype=int64)
Out[34]:
array([1, 2, 3, 4])

return_inverse 控制是否返回相完整的,相对 unique 元素的 indice 可用于重新构造原始 ndarray

In [35]:
u, indices = np.unique(a, return_inverse=True)
u
indices

u[indices].reshape(a.shape)
Out[35]:
array([1, 2, 3, 4])
Out[35]:
array([0, 1, 0, 1, 2, 3], dtype=int64)
Out[35]:
array([[1, 2, 1],
       [2, 3, 4]])

counts 控制是否返回计数

In [36]:
u, counts = np.unique(a, return_counts=True)
u
counts
Out[36]:
array([1, 2, 3, 4])
Out[36]:
array([2, 2, 1, 1], dtype=int64)

np.hypot¶

给定直角三角形的两条直角边 求出斜边

In [37]:
np.hypot(3*np.ones((3, 3)), 4*np.ones((3, 3)))
Out[37]:
array([[5., 5., 5.],
       [5., 5., 5.],
       [5., 5., 5.]])

np.unravel_index¶

unravel_index(indices, shape, order='C')

将一维化坐标按目标 ndarray 形状转换为 tuple of coordinates,即每一项对应坐标的一个维度。

In [38]:
np.unravel_index([22, 41, 37], (7, 6))

np.unravel_index([1621, 1929], (6, 7, 8, 9))
Out[38]:
(array([3, 6, 6], dtype=int64), array([4, 5, 1], dtype=int64))
Out[38]:
(array([3, 3], dtype=int64),
 array([1, 5], dtype=int64),
 array([4, 6], dtype=int64),
 array([1, 3], dtype=int64))
In [39]:
x = np.arange(2000)

np.unravel_index(x, (1000, 2))
Out[39]:
(array([  0,   0,   1, ..., 998, 999, 999], dtype=int64),
 array([0, 1, 0, ..., 1, 0, 1], dtype=int64))

Routines¶

自定义 dtype¶

创建 array 时自定义 dtype 类型,也可以包含 str 类型

In [40]:
custom_ndarray = np.zeros(5, dtype=[('position', float, 2),
                                    ('size', float, 1),
                                    ('growth', float, 1),
                                    ('color', float, 4),
                                    ('name', str, 1)])
custom_ndarray
custom_ndarray[0]
custom_ndarray[0]['position']
Out[40]:
array([([0., 0.], 0., 0., [0., 0., 0., 0.], ''),
       ([0., 0.], 0., 0., [0., 0., 0., 0.], ''),
       ([0., 0.], 0., 0., [0., 0., 0., 0.], ''),
       ([0., 0.], 0., 0., [0., 0., 0., 0.], ''),
       ([0., 0.], 0., 0., [0., 0., 0., 0.], '')],
      dtype=[('position', '<f8', (2,)), ('size', '<f8'), ('growth', '<f8'), ('color', '<f8', (4,)), ('name', '<U1')])
Out[40]:
([0., 0.], 0., 0., [0., 0., 0., 0.], '')
Out[40]:
array([0., 0.])

NumPy 导入导出¶

NumPy 提供了多种应对各种情况的导入导出方式,如果条件允许,官方推荐使用 save 和 load 函数进行导入导出 npy 文件,npy 格式支持一键导入导出,无缝衔接

导出¶

savetxt¶

np.savetxt(fname, X, fmt='%.18e', delimiter=' ', newline='\n', header='', footer='', comments='# ', encoding=None)

savetxt 默认导出格式为科学计数

使用 savetxt 保存数据时,最好指定编码格式 encoding,可以指定 header,comments 和 encoding

In [41]:
arr = np.arange(6).reshape((2, 3))

# 导出为 csv
np.savetxt('data/output.csv', arr, delimiter=',', header='',
           comments='', encoding='utf-8')  # float 格式

# 导出为 txt
np.savetxt('data/output.txt', arr, delimiter=' ',
           header='', comments='', encoding='utf-8')

tofile¶

a.tofile(fid, sep="", format="%s")
In [42]:
x = np.zeros(
    (2,), dtype=[('time', [('min', int), ('sec', int)]), ('temp', float)])
x[0]['time']['min'] = 10
x['temp'] = 98.25
x
Out[42]:
array([((10, 0), 98.25), (( 0, 0), 98.25)],
      dtype=[('time', [('min', '<i4'), ('sec', '<i4')]), ('temp', '<f8')])
In [43]:
import os
x.tofile('data/temp.b')

np.save¶

np.save(file, arr, allow_pickle=True, fix_imports=True)
In [44]:
np.save("output-1.npy", arr)
np.save("output-2.npy", x)

导入¶

loadtxt¶

np.loadtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0, encoding='bytes')
In [45]:
np.loadtxt('data/output.csv', delimiter=',')
Out[45]:
array([[0., 1., 2.],
       [3., 4., 5.]])

genfromtxt¶

np.genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None, usecols=None, names=None, excludelist=None, deletechars=None, replace_space='_', autostrip=False, case_sensitive=True, defaultfmt='f%i', unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None, encoding='bytes')
In [46]:
np.genfromtxt('data/output.txt', delimiter=' ')
Out[46]:
array([[0., 1., 2.],
       [3., 4., 5.]])

fromfile¶

numpy.fromfile(file, dtype=float, count=-1, sep='')

Construct an array from data in a text or binary file

In [47]:
np.fromfile("data/temp.b",
            dtype=[('time', [('min', int), ('sec', int)]), ('temp', float)])
Out[47]:
array([((10, 0), 98.25), (( 0, 0), 98.25)],
      dtype=[('time', [('min', '<i4'), ('sec', '<i4')]), ('temp', '<f8')])

从二进制文件导入

In [48]:
import struct
from struct import Struct

# 二进制写入


def write_records(records, format, f):
    '''
    Write a sequence of tuples to a binary file of structures.
    '''
    record_struct = Struct(format)
    for r in records:
        f.write(record_struct.pack(*r))


records = [(1, 2.3, 4.5),
           (6, 7.8, 9.0),
           (12, 13.4, 56.7)]

with open('data/data.b', 'wb') as f:
    write_records(records, '<idd', f)

records = np.fromfile('data/data.b', dtype='<i,<d,<d')

records
Out[48]:
array([( 1,  2.3,  4.5), ( 6,  7.8,  9. ), (12, 13.4, 56.7)],
      dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')])

np.load¶

np.load(
    ['file', 'mmap_mode=None', 'allow_pickle=True', 'fix_imports=True', "encoding='ASCII'"],
)
In [49]:
np.load("output-1.npy")
np.load("output-2.npy")
Out[49]:
array([[0, 1, 2],
       [3, 4, 5]])
Out[49]:
array([((10, 0), 98.25), (( 0, 0), 98.25)],
      dtype=[('time', [('min', '<i4'), ('sec', '<i4')]), ('temp', '<f8')])

ndarray 转换为 DataFrame(多维->2维)¶

多维 ndarray 从数据结构上来说是比较高效的,但如果需要使用 pandas 进行数据处理则有些麻烦,因为 pandas 处理的数据大多是 2-D 的,此时需要将 ndarray 中多余的维度转换到 2-D DataFrame 中的一列。

比较保守的转换¶

假设现有维度为 (50, 100, 3) 的数据,第一维度对应时间,第二维度对应个体ID,第三维度对应个体坐标 x,y,z。若使用 pandas 进行处理,希望将 ndarray 转换为 (5000, 5) 的二维 DataFrame,其中 5000对应 50x100,第二维度在x,y,z基础上增加两列 t 和 ID,对应列标签 t, ID, x, y, z.

In [50]:
data = np.load("data/sample.npy")
data.shape
data
Out[50]:
(50, 100, 3)
Out[50]:
array([[[ 2.42442956e+02,  7.76911920e+01,  6.64777151e-01],
        [ 2.61380074e+02,  2.01793185e+02,  2.94516922e+00],
        [ 4.12767690e+02,  1.35482822e+02, -4.92483385e-01],
        ...,
        [ 4.10753164e+02,  2.02361917e+02, -1.47121999e-01],
        [ 2.69633830e+02,  2.68148789e+02,  1.27458590e+00],
        [ 3.30322105e+02,  1.75890005e+02, -7.97956043e-01]],

       [[ 2.45365704e+02,  7.83676099e+01,  2.27428156e-01],
        [ 2.58458717e+02,  2.02475590e+02,  2.91211536e+00],
        [ 4.14913593e+02,  1.33386372e+02, -7.73741839e-01],
        ...,
        [ 4.11856535e+02,  1.99572191e+02, -1.19416454e+00],
        [ 2.70467390e+02,  2.71030660e+02,  1.28923763e+00],
        [ 3.32290845e+02,  1.73626366e+02, -8.54962584e-01]],

       [[ 2.48239647e+02,  7.75071160e+01, -2.90917521e-01],
        [ 2.55461149e+02,  2.02596349e+02, -3.18185644e+00],
        [ 4.17462899e+02,  1.31804904e+02, -5.55250026e-01],
        ...,
        [ 4.09707081e+02,  1.97479383e+02, -2.36954656e+00],
        [ 2.68405895e+02,  2.73210163e+02,  2.32837616e+00],
        [ 3.33943130e+02,  1.71122378e+02, -9.87520127e-01]],

       ...,

       [[ 3.56975004e+02,  5.97239259e+00, -5.75102134e-01],
        [ 1.30143542e+02,  1.64376675e+02, -2.87853593e+00],
        [ 4.74988523e+02,  1.15517297e+01, -8.53119091e-01],
        ...,
        [ 4.27604192e+02,  8.84718553e+01, -7.13370919e-01],
        [ 1.53469992e+02,  2.17281949e+02, -3.01382082e+00],
        [ 3.95602341e+02,  5.33207351e+01, -9.52802521e-01]],

       [[ 3.58316991e+02,  3.28928418e+00, -1.10701967e+00],
        [ 1.27151855e+02,  1.64599848e+02, -3.21605251e+00],
        [ 4.75983562e+02,  8.72155337e+00, -1.23271295e+00],
        ...,
        [ 4.29080172e+02,  8.58600584e+01, -1.05641846e+00],
        [ 1.50495321e+02,  2.16892942e+02, -3.01155720e+00],
        [ 3.97560036e+02,  5.10475365e+01, -8.59831857e-01]],

       [[ 3.60269963e+02,  1.01202655e+00, -8.61907762e-01],
        [ 1.24279619e+02,  1.63733672e+02, -2.84869734e+00],
        [ 4.77039363e+02,  5.91347873e+00, -1.21116002e+00],
        ...,
        [ 4.31150993e+02,  8.36894137e+01, -8.08928892e-01],
        [ 1.47569414e+02,  2.16230318e+02, -2.91888170e+00],
        [ 3.99619625e+02,  4.88662316e+01, -8.14090703e-01]]])
In [51]:
dim_1, dim_2, dim_3 = data.shape

# 生成用于填充新增维度的数值
indice = np.mgrid[0:dim_1, 0:dim_2]
indice
indice = indice.reshape((2,-1))
indice

indice = indice.T
indice
Out[51]:
array([[[ 0,  0,  0, ...,  0,  0,  0],
        [ 1,  1,  1, ...,  1,  1,  1],
        [ 2,  2,  2, ...,  2,  2,  2],
        ...,
        [47, 47, 47, ..., 47, 47, 47],
        [48, 48, 48, ..., 48, 48, 48],
        [49, 49, 49, ..., 49, 49, 49]],

       [[ 0,  1,  2, ..., 97, 98, 99],
        [ 0,  1,  2, ..., 97, 98, 99],
        [ 0,  1,  2, ..., 97, 98, 99],
        ...,
        [ 0,  1,  2, ..., 97, 98, 99],
        [ 0,  1,  2, ..., 97, 98, 99],
        [ 0,  1,  2, ..., 97, 98, 99]]])
Out[51]:
array([[ 0,  0,  0, ..., 49, 49, 49],
       [ 0,  1,  2, ..., 97, 98, 99]])
Out[51]:
array([[ 0,  0],
       [ 0,  1],
       [ 0,  2],
       ...,
       [49, 97],
       [49, 98],
       [49, 99]])
In [52]:
data = data.reshape((-1, dim_3))
data.shape
data
Out[52]:
(5000, 3)
Out[52]:
array([[242.44295567,  77.69119197,   0.66477715],
       [261.38007362, 201.79318451,   2.94516922],
       [412.76769009, 135.48282173,  -0.49248338],
       ...,
       [431.15099334,  83.68941366,  -0.80892889],
       [147.56941378, 216.23031823,  -2.9188817 ],
       [399.61962538,  48.86623161,  -0.8140907 ]])
In [53]:
result = np.empty((dim_1*dim_2, dim_3+2))

result[:, :2] = indice
result[:, 2:] = data
result
Out[53]:
array([[  0.        ,   0.        , 242.44295567,  77.69119197,
          0.66477715],
       [  0.        ,   1.        , 261.38007362, 201.79318451,
          2.94516922],
       [  0.        ,   2.        , 412.76769009, 135.48282173,
         -0.49248338],
       ...,
       [ 49.        ,  97.        , 431.15099334,  83.68941366,
         -0.80892889],
       [ 49.        ,  98.        , 147.56941378, 216.23031823,
         -2.9188817 ],
       [ 49.        ,  99.        , 399.61962538,  48.86623161,
         -0.8140907 ]])

转换为 DataFrame

In [54]:
import pandas as pd
df = pd.DataFrame(result, columns=["t", "ID", "x", "y", "z"])
df.head()
Out[54]:
t ID x y z
0 0.0 0.0 242.442956 77.691192 0.664777
1 0.0 1.0 261.380074 201.793185 2.945169
2 0.0 2.0 412.767690 135.482822 -0.492483
3 0.0 3.0 135.073158 406.116724 2.671991
4 0.0 4.0 235.803192 187.694907 2.775133

比较激进的转换¶

假设希望将最后一维的三个值也坍缩到新增维度里,即将 (50, 100, 3) 的数据转换为 (15000, 4) 的数据,对应列标签 t, ID, cat, value,其中 cat 中包含 x, y, z 三个种类。

这应该是最适合用 pandas 进行处理的数据格式了。

In [55]:
data = np.load("data/sample.npy")
data.shape
data
Out[55]:
(50, 100, 3)
Out[55]:
array([[[ 2.42442956e+02,  7.76911920e+01,  6.64777151e-01],
        [ 2.61380074e+02,  2.01793185e+02,  2.94516922e+00],
        [ 4.12767690e+02,  1.35482822e+02, -4.92483385e-01],
        ...,
        [ 4.10753164e+02,  2.02361917e+02, -1.47121999e-01],
        [ 2.69633830e+02,  2.68148789e+02,  1.27458590e+00],
        [ 3.30322105e+02,  1.75890005e+02, -7.97956043e-01]],

       [[ 2.45365704e+02,  7.83676099e+01,  2.27428156e-01],
        [ 2.58458717e+02,  2.02475590e+02,  2.91211536e+00],
        [ 4.14913593e+02,  1.33386372e+02, -7.73741839e-01],
        ...,
        [ 4.11856535e+02,  1.99572191e+02, -1.19416454e+00],
        [ 2.70467390e+02,  2.71030660e+02,  1.28923763e+00],
        [ 3.32290845e+02,  1.73626366e+02, -8.54962584e-01]],

       [[ 2.48239647e+02,  7.75071160e+01, -2.90917521e-01],
        [ 2.55461149e+02,  2.02596349e+02, -3.18185644e+00],
        [ 4.17462899e+02,  1.31804904e+02, -5.55250026e-01],
        ...,
        [ 4.09707081e+02,  1.97479383e+02, -2.36954656e+00],
        [ 2.68405895e+02,  2.73210163e+02,  2.32837616e+00],
        [ 3.33943130e+02,  1.71122378e+02, -9.87520127e-01]],

       ...,

       [[ 3.56975004e+02,  5.97239259e+00, -5.75102134e-01],
        [ 1.30143542e+02,  1.64376675e+02, -2.87853593e+00],
        [ 4.74988523e+02,  1.15517297e+01, -8.53119091e-01],
        ...,
        [ 4.27604192e+02,  8.84718553e+01, -7.13370919e-01],
        [ 1.53469992e+02,  2.17281949e+02, -3.01382082e+00],
        [ 3.95602341e+02,  5.33207351e+01, -9.52802521e-01]],

       [[ 3.58316991e+02,  3.28928418e+00, -1.10701967e+00],
        [ 1.27151855e+02,  1.64599848e+02, -3.21605251e+00],
        [ 4.75983562e+02,  8.72155337e+00, -1.23271295e+00],
        ...,
        [ 4.29080172e+02,  8.58600584e+01, -1.05641846e+00],
        [ 1.50495321e+02,  2.16892942e+02, -3.01155720e+00],
        [ 3.97560036e+02,  5.10475365e+01, -8.59831857e-01]],

       [[ 3.60269963e+02,  1.01202655e+00, -8.61907762e-01],
        [ 1.24279619e+02,  1.63733672e+02, -2.84869734e+00],
        [ 4.77039363e+02,  5.91347873e+00, -1.21116002e+00],
        ...,
        [ 4.31150993e+02,  8.36894137e+01, -8.08928892e-01],
        [ 1.47569414e+02,  2.16230318e+02, -2.91888170e+00],
        [ 3.99619625e+02,  4.88662316e+01, -8.14090703e-01]]])
In [56]:
dim_1, dim_2, dim_3 = data.shape

# 生成用于填充新增维度的数值
indice = np.mgrid[0:dim_1, 0:dim_2]
indice
indice = indice.reshape((2,-1))
indice

indice = indice.T
indice

data = data.reshape(-1) # 此处也可以分别取3列,再组合起来

indice = np.repeat(indice, dim_3, axis=0)
indice.shape

cat = np.tile(np.arange(dim_3).reshape((dim_3,-1)), (dim_1*dim_2,1))
cat
Out[56]:
array([[[ 0,  0,  0, ...,  0,  0,  0],
        [ 1,  1,  1, ...,  1,  1,  1],
        [ 2,  2,  2, ...,  2,  2,  2],
        ...,
        [47, 47, 47, ..., 47, 47, 47],
        [48, 48, 48, ..., 48, 48, 48],
        [49, 49, 49, ..., 49, 49, 49]],

       [[ 0,  1,  2, ..., 97, 98, 99],
        [ 0,  1,  2, ..., 97, 98, 99],
        [ 0,  1,  2, ..., 97, 98, 99],
        ...,
        [ 0,  1,  2, ..., 97, 98, 99],
        [ 0,  1,  2, ..., 97, 98, 99],
        [ 0,  1,  2, ..., 97, 98, 99]]])
Out[56]:
array([[ 0,  0,  0, ..., 49, 49, 49],
       [ 0,  1,  2, ..., 97, 98, 99]])
Out[56]:
array([[ 0,  0],
       [ 0,  1],
       [ 0,  2],
       ...,
       [49, 97],
       [49, 98],
       [49, 99]])
Out[56]:
(15000, 2)
Out[56]:
array([[0],
       [1],
       [2],
       ...,
       [0],
       [1],
       [2]])
In [57]:
result = np.empty((dim_1*dim_2*dim_3, 3+1))

result[:, :2] = indice
result[:, 2] = cat.reshape(-1)
result[:, -1] = data
result
Out[57]:
array([[  0.        ,   0.        ,   0.        , 242.44295567],
       [  0.        ,   0.        ,   1.        ,  77.69119197],
       [  0.        ,   0.        ,   2.        ,   0.66477715],
       ...,
       [ 49.        ,  99.        ,   0.        , 399.61962538],
       [ 49.        ,  99.        ,   1.        ,  48.86623161],
       [ 49.        ,  99.        ,   2.        ,  -0.8140907 ]])
In [58]:
import pandas as pd
df = pd.DataFrame(result, columns=["t", "ID", "cat", "value"])
df.head()
Out[58]:
t ID cat value
0 0.0 0.0 0.0 242.442956
1 0.0 0.0 1.0 77.691192
2 0.0 0.0 2.0 0.664777
3 0.0 1.0 0.0 261.380074
4 0.0 1.0 1.0 201.793185

替换cat的值为string

In [59]:
df.loc[df['cat'] == 0, "cat"] = "x"
df.loc[df['cat'] == 1, "cat"] = "y"
df.loc[df['cat'] == 2, "cat"] = "z"
df.head()
Out[59]:
t ID cat value
0 0.0 0.0 x 242.442956
1 0.0 0.0 y 77.691192
2 0.0 0.0 z 0.664777
3 0.0 1.0 x 261.380074
4 0.0 1.0 y 201.793185

NumPy 函数式编程¶

参考:

  • NumPy Functional programming

NumPy 函数式编程主要有以下几种方式

  • apply_along_axis(func1d, axis, arr, *args, ...) Apply a function to 1-D slices along the given axis.
  • apply_over_axes(func, a, axes) Apply a function repeatedly over multiple axes.
  • vectorize(pyfunc[, otypes, doc, excluded, ...]) Generalized function class.
  • frompyfunc(func, nin, nout) Takes an arbitrary Python function and returns a NumPy ufunc.
  • piecewise(x, condlist, funclist, *args, **kw) Evaluate a piecewise-defined function.

numpy.vectorize¶

class numpy.vectorize(pyfunc, otypes=None, doc=None, excluded=None, cache=False, signature=None)

Define a vectorized function which takes a nested sequence of objects or numpy arrays as inputs and returns an single or tuple of numpy array as output. The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.

基于输入的 python func 返回一个向量化的函数

In [60]:
def myfunc(a, b):
    "Return a-b if a>b, otherwise return a+b"
    if a > b:
        return a - b
    else:
        return a + b


vfunc = np.vectorize(myfunc)
vfunc([1, 2, 3, 4], 2)
Out[60]:
array([3, 4, 1, 2])

给array增加新的维度¶

使用 newaxis 或者 None¶

In [61]:
from numpy import newaxis

x = np.arange(6).reshape((2, 3))
x.shape
Out[61]:
(2, 3)
In [62]:
x[:, newaxis].shape
x[:, None].shape
Out[62]:
(2, 1, 3)
Out[62]:
(2, 1, 3)
In [63]:
x[newaxis, :].shape
x[None, :].shape
Out[63]:
(1, 2, 3)
Out[63]:
(1, 2, 3)

利用reshape¶

In [64]:
x = np.arange(6).reshape((2, 3))
x.shape

x.reshape((-1, 2, 3)).shape
x.reshape((2, 3, -1)).shape
Out[64]:
(2, 3)
Out[64]:
(1, 2, 3)
Out[64]:
(2, 3, 1)

获取 ndarray 中出现次数最多的元素¶

https://stackoverflow.com/questions/12297016/how-to-find-most-frequent-values-in-numpy-ndarray

1维¶

In [65]:
arr = np.array([5, 4, -2, 1, -2, 0, 4, 4, -6, -1])
u, indices = np.unique(arr, return_inverse=True)
u
indices

count = np.bincount(indices)
count

u[np.argmax(count)]
Out[65]:
array([-6, -2, -1,  0,  1,  4,  5])
Out[65]:
array([6, 5, 1, 4, 1, 3, 5, 5, 0, 2], dtype=int64)
Out[65]:
array([1, 2, 1, 1, 1, 3, 1], dtype=int64)
Out[65]:
4

多维¶

获取某一维度上出现次数最多的元素

In [66]:
arr = np.array([[5, 5, 5, 5, -2, 0, 4, 4, -6, -1],
                [0, 1,  1, 2,  3, 4, 5, 6,  7,  8]])

u, indices = np.unique(arr, return_inverse=True)
u
indices


# 这里需要指定 bincount 的 minlenghth
counted = np.apply_along_axis(np.bincount, 1, indices.reshape(arr.shape),
                              None, np.max(indices) + 1)
counted

u[np.argmax(counted, axis=1)]
Out[66]:
array([-6, -2, -1,  0,  1,  2,  3,  4,  5,  6,  7,  8])
Out[66]:
array([ 8,  8,  8,  8,  1,  3,  7,  7,  0,  2,  3,  4,  4,  5,  6,  7,  8,
        9, 10, 11], dtype=int64)
Out[66]:
array([[1, 1, 1, 1, 0, 0, 0, 2, 4, 0, 0, 0],
       [0, 0, 0, 1, 2, 1, 1, 1, 1, 1, 1, 1]], dtype=int64)
Out[66]:
array([5, 1])

交换ndarray的轴[广义转置]¶

  • numpy.transpose
numpy.transpose(a, axes=None)[source]

不是就地操作,返回 view

In [67]:
x = np.arange(4).reshape((2, 2))
x

np.transpose(x)

x = np.ones((1, 2, 3))
np.transpose(x, (1, 0, 2)).shape
Out[67]:
array([[0, 1],
       [2, 3]])
Out[67]:
array([[0, 2],
       [1, 3]])
Out[67]:
(2, 1, 3)
In [68]:
x = np.arange(24).reshape((2, 3, 4))
x
np.transpose(x, (0, 2, 1))
Out[68]:
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
Out[68]:
array([[[ 0,  4,  8],
        [ 1,  5,  9],
        [ 2,  6, 10],
        [ 3,  7, 11]],

       [[12, 16, 20],
        [13, 17, 21],
        [14, 18, 22],
        [15, 19, 23]]])

Issues¶

mgrid, ogrid 与 meshgrid¶

生成一种类似网格坐标的东西

mgrid¶

np.mgrid 本身甚至不是个函数,返回 ndarray

  • https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.mgrid.html
In [69]:
arr = np.mgrid[0:4:2, 0:6:2]  # 可以加 step
arr
Out[69]:
array([[[0, 0, 0],
        [2, 2, 2]],

       [[0, 2, 4],
        [0, 2, 4]]])
In [70]:
arr.shape
Out[70]:
(2, 2, 3)

ndarray 形状为 (n, d1, d2),这是一种类似 grid 坐标的东西。

In [71]:
np.mgrid[-1:1:5j]  # 也可以是复数
Out[71]:
array([-1. , -0.5,  0. ,  0.5,  1. ])

ogrid¶

返回列表

  • https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.ogrid.html
In [72]:
ls = np.ogrid[0:4:2, 0:6:2]
ls
Out[72]:
[array([[0],
        [2]]), array([[0, 2, 4]])]

排列方式和 mgrid 是一样的,只不过没有重复元素,且本身为列表

In [73]:
np.ogrid[-1:1:5j]  # 也可以是复数
Out[73]:
array([-1. , -0.5,  0. ,  0.5,  1. ])

meshgrid¶

返回列表

numpy.meshgrid(*xi, **kwargs)
  • https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.meshgrid.html

如果从两个ndarray的相同位置取一个值,则更符合坐标轴的习惯,适用于评估多元函数值

基于给定的 array 或 list 生成 grid

In [74]:
x = np.arange(0, 4, 2)
y = np.arange(0, 6, 2)
ls = np.meshgrid(x, y)
ls
Out[74]:
[array([[0, 2],
        [0, 2],
        [0, 2]]), array([[0, 0],
        [2, 2],
        [4, 4]])]

与 mgrid 的区别除了返回类型为列表外,就是排列顺序和 mgrid 正好相反。

ndarray 和 matrix¶

参考:What are the differences between numpy arrays and matrices? Which one should I use?

matrix 是严格 2 维的,而 ndarray 可以是 n 维的,matrix 是 ndarray 的一个子集,拥有全部 ndarray 的方法。matrix 主要的好处是可以方便的进行矩阵乘法,a*b 操作即为矩阵乘法

In [75]:
a = np.mat('4 3; 2 1')
a

b = np.mat('1 2; 3 4')
a

a*b
Out[75]:
matrix([[4, 3],
        [2, 1]])
Out[75]:
matrix([[4, 3],
        [2, 1]])
Out[75]:
matrix([[13, 20],
        [ 5,  8]])

不过在 Python 3.5 以后的版本,NumPy 也支持对 ndaaray 的 @ 操作符,同样也是矩阵乘法

In [76]:
a@b
Out[76]:
matrix([[13, 20],
        [ 5,  8]])

matrix 和 ndarray 都有 .T 方法,但是 matrix 还有 .I 逆矩阵和 .H 共轭矩阵方法,由于 * 操作符功能的不同, ** 操作符的功能也不一样

可通过 np.asmatrix 和 np.asarray 相互转换两种类型

reshape自动降维¶

ndarray 会在切片选择时自动将长度只有 1 的维度隐去,比如(n, m) shape 的 array 取一列,shape 自动变为 (n,)

这两种 shape 在进行矩阵运算时是很不一样的,毕竟 (n,) shape 的 array 转置 T 还是本身,而 matrix 对象默认 2 维

In [77]:
s = np.random.random((3, 3))
s

sliced = s[0, :]
sliced.shape  # 不是(1,3)
sliced.T.shape
Out[77]:
array([[0.07986677, 0.00445615, 0.82639448],
       [0.43608133, 0.20068197, 0.30155395],
       [0.88031929, 0.6758882 , 0.67013298]])
Out[77]:
(3,)
Out[77]:
(3,)
In [78]:
s_mat = np.asmatrix(s)
s_mat

sliced_mat = s_mat[0, :]
sliced_mat.shape
sliced_mat.T.shape
Out[78]:
matrix([[0.07986677, 0.00445615, 0.82639448],
        [0.43608133, 0.20068197, 0.30155395],
        [0.88031929, 0.6758882 , 0.67013298]])
Out[78]:
(1, 3)
Out[78]:
(3, 1)

shape 为 (n,)的 ndarray进行计算可能会出现不可预测的结果

In [79]:
sliced@np.ones((3, 1))

np.ones((1, 3))@sliced

try:
    sliced@np.ones((1, 3))
except Exception as e:
    print(e)

try:
    np.ones((3, 1))@sliced
except Exception as e:
    print(e)
Out[79]:
array([0.9107174])
Out[79]:
array([0.9107174])
matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 1 is different from 3)
matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 3 is different from 1)

为了保持情况可控,最好将 shape 为 (n,) 的 ndarray 先 reshape 为 (n,1) 或 (1,n)

(n, )和(n, 1)的 broadcast 原则¶

In [80]:
arr = np.random.random((3))
arr
mask = np.random.random((3, 3)) > 0.5
mask
Out[80]:
array([0.77198178, 0.16089539, 0.96831058])
Out[80]:
array([[ True, False,  True],
       [False,  True,  True],
       [ True, False,  True]])
In [81]:
arr * mask
Out[81]:
array([[0.77198178, 0.        , 0.96831058],
       [0.        , 0.16089539, 0.96831058],
       [0.77198178, 0.        , 0.96831058]])
In [82]:
arr.reshape(3, 1)*mask
Out[82]:
array([[0.77198178, 0.        , 0.77198178],
       [0.        , 0.16089539, 0.16089539],
       [0.96831058, 0.        , 0.96831058]])

两者给出的结果完全不同

为了保证 ndarray 维度可控,不要使用类似 (5,) 形状的 ndarray。

解决方案:

  • 在可能出现 1维 ndarray 的地方增加一个 reshape(n, 1)操作,必要的时候放一个 assert 语句保证不出错
  • 使用 keepdims 参数,不过在切片时似乎并不能使用这一参数
In [83]:
sum_ = np.sum(s, axis=0)
sum_
sum_.shape

sum__ = np.sum(s, axis=0, keepdims=True)
sum__
sum__.shape
Out[83]:
array([1.39626739, 0.88102632, 1.7980814 ])
Out[83]:
(3,)
Out[83]:
array([[1.39626739, 0.88102632, 1.7980814 ]])
Out[83]:
(1, 3)

numpy.argwhere 和 numpy.where¶

numpy.where¶

numpy.where(condition[, x, y])

取决于条件,从 x 或 y 中返回值组成结果。

  • https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.where.html?highlight=where#numpy.where

numpy.where 有两种用途

  • 给定 condition 和 x, y
  • 只给定 condition

同时给定 condition 和 x, y¶

如果 x, y 为 1-D,则类似于相当于

[xv if c else yv for (c,xv,yv) in zip(condition,x,y)]

不是只对 1-D 生效,而是以 1-D 为例作解释

In [84]:
np.where([[True, False], [True, True]], [[1, 2], [3, 4]], [[9, 8], [7, 6]])
Out[84]:
array([[1, 8],
       [3, 4]])

不给定 x y 只给定一个条件¶

则返回一个“坐标”(不符合直觉的“坐标”,是一个列表,列表的每一项对应一个维度上所有元素的坐标值,这种坐标可用于反向 indexing 得到对应数据)

In [85]:
np.where([[0, 1], [1, 0]])  # 相当于 np.where(np.array([[0, 1], [1, 0]])!=0)

np.where(np.array([[0, 1], [1, 0]]) != 0)
Out[85]:
(array([0, 1], dtype=int64), array([1, 0], dtype=int64))
Out[85]:
(array([0, 1], dtype=int64), array([1, 0], dtype=int64))
In [86]:
x = np.arange(9.).reshape(3, 3)
x

np.where(x > 5)  # 返回“坐标”
Out[86]:
array([[0., 1., 2.],
       [3., 4., 5.],
       [6., 7., 8.]])
Out[86]:
(array([2, 2, 2], dtype=int64), array([0, 1, 2], dtype=int64))

采用这种 indexing 方式,得到的结果为 1-D

In [87]:
indice = np.where(x > 5)
indice

x[indice]
Out[87]:
(array([2, 2, 2], dtype=int64), array([0, 1, 2], dtype=int64))
Out[87]:
array([6., 7., 8.])
In [88]:
# 类似 [xv if c else yv for (c,xv,yv) in zip(condition,x,y)]
np.where(x < 5, x, -1)
Out[88]:
array([[ 0.,  1.,  2.],
       [ 3.,  4., -1.],
       [-1., -1., -1.]])

numpy.argwhere¶

numpy.argwhere(a)

Find the indices of array elements that are non-zero, grouped by element.

  • https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.argwhere.html

返回的结果是一个个的 坐标 (符合直觉的坐标,每个元素都由(x,y,...)构成,即一个坐标)

In [89]:
x = np.arange(6).reshape(3, 2)
x

np.argwhere(x > 1)
Out[89]:
array([[0, 1],
       [2, 3],
       [4, 5]])
Out[89]:
array([[1, 0],
       [1, 1],
       [2, 0],
       [2, 1]], dtype=int64)

这种坐标用于indice,因为只有一个维度(ndarray 被视为一个整体),结果为对第一个维度的indexing

In [90]:
x[np.argwhere(x > 1)]
Out[90]:
array([[[2, 3],
        [0, 1]],

       [[2, 3],
        [2, 3]],

       [[4, 5],
        [0, 1]],

       [[4, 5],
        [2, 3]]])

注意 where 与 argwhere 给出“坐标”的排列方式的区别

此外还有类似的 numpy.nonzero()

给出的结果也是 不符合直觉的坐标

In [91]:
x = np.array([[1, 0, 0], [0, 2, 0], [1, 1, 0]])
x

np.nonzero(x)

x[np.nonzero(x)]  # 反向 indexing 得到对应数据
Out[91]:
array([[1, 0, 0],
       [0, 2, 0],
       [1, 1, 0]])
Out[91]:
(array([0, 1, 2, 2], dtype=int64), array([0, 1, 0, 1], dtype=int64))
Out[91]:
array([1, 2, 1, 1])

numpy.tile 和 numpy.repeat¶

tile¶

numpy.tile(A, reps)

Construct an array by repeating A the number of times given by reps.

  • https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.tile.html

整体重复

In [92]:
a = np.array([0, 1, 2])

np.tile(a, 2)

np.tile(a, (2, 2))

np.tile(a, (2, 1, 2))
Out[92]:
array([0, 1, 2, 0, 1, 2])
Out[92]:
array([[0, 1, 2, 0, 1, 2],
       [0, 1, 2, 0, 1, 2]])
Out[92]:
array([[[0, 1, 2, 0, 1, 2]],

       [[0, 1, 2, 0, 1, 2]]])

numpy.repeat(a, repeats, axis=None) Repeat elements of an array. 按元素重复

默认 axis=None 即扁平化重复

例子

In [93]:
np.repeat(3, 4)
Out[93]:
array([3, 3, 3, 3])
In [94]:
x = np.array([[1, 2], [3, 4]])

np.repeat(x, 2)
np.repeat(x, 3, axis=1)
np.repeat(x, [1, 2], axis=0)
Out[94]:
array([1, 1, 2, 2, 3, 3, 4, 4])
Out[94]:
array([[1, 1, 1, 2, 2, 2],
       [3, 3, 3, 4, 4, 4]])
Out[94]:
array([[1, 2],
       [3, 4],
       [3, 4]])

NumPy 中的 reshape 操作¶

关于 numpy 中的 array,改变其 shape,有时可以有时不可以,一开始觉得很奇怪。

In [95]:
arr = np.zeros((4, 4))

sliced = arr[:3, :]
sliced.shape
Out[95]:
(3, 4)
In [96]:
sliced.shape = 4, 3
sliced
Out[96]:
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])
In [97]:
sliced_1 = arr[:, :3]
sliced_1.shape
Out[97]:
(4, 3)
In [98]:
try:
    sliced_1.shape = 3, 4
except AttributeError as e:
    print(e)
incompatible shape for a non-contiguous array

这时切片无法改变 shape,这主要和 arr 在内存中的存储形式有关,在初始化 arr 的时候,里面的数据就按顺序排好了,而切片取前三列后如果想进行改变形状的操作,就需要在内存中跳跃,这对计算机来说是很困难的。使用 resize 同样无法改变形状

In [99]:
try:
    sliced_1.resize(3, 4, refcheck=False)
except ValueError as e:
    print(e)
resize only works on single-segment arrays

resieze 给出的提示更加清晰 only works on single-segment arrays 即因为数据分段了。可以用 flags 属性查看一下,sliced_1 并不具有元数据的所有权,而是引用 base 数组的数据。

没有数据且引用的数据在内存中不连续就无法 resize

In [100]:
sliced_1.flags
Out[100]:
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
In [101]:
sliced_1.base
Out[101]:
array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

reshape 返回一个具有新 shape 的 view 而不改变原 array,所以不存在上述问题,而 resize 就地操作原 array,没有返回值

In [102]:
sliced_1.reshape(3, 4)
Out[102]:
array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

之所以用 view 来举例是因为对 2-D array 取前几列会产生数据在内存中的不连续,且为原 array 的 view,不过获取 view 有很多种方式,比如直接 newarr = arr.view()

NumPy 中还有些函数要求 one segment array

In [103]:
a = np.random.random((5, 1))
a

b = a[::2]
b
b.flags
Out[103]:
array([[0.70848474],
       [0.98770374],
       [0.26119058],
       [0.75281279],
       [0.941384  ]])
Out[103]:
array([[0.70848474],
       [0.26119058],
       [0.941384  ]])
Out[103]:
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
In [104]:
s = np.sort(b)
s.flags
Out[104]:
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

np.sort(b) 返回一个新 array 因为 b 只是 a 的一个切片 view 而不是 one segment array,所以函数必须先把 b 复制到一块新的内存上再做排序

NumPy 交换数据和比较操作¶

交换¶

In [105]:
a = [1, 2, 3, 4]
b = [5, 6, 7, 8]

a[1:3], b[1:3] = b[1:3], a[1:3]
a
b
Out[105]:
[1, 6, 7, 4]
Out[105]:
[5, 2, 3, 8]
In [106]:
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

a[1:3], b[1:3] = b[1:3], a[1:3]
a
b
Out[106]:
array([1, 6, 7, 4])
Out[106]:
array([5, 6, 7, 8])

造成这种现象的原因是 NumPy 的切片返回 view,不能像原生 Python 中那样交换变量

解决方案,使用 copy

In [107]:
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])

a[1:3], b[1:3] = b[1:3].copy(), a[1:3].copy()
a
b
Out[107]:
array([1, 6, 7, 4])
Out[107]:
array([5, 2, 3, 8])

比较¶

In [108]:
a == b
Out[108]:
array([False, False, False, False])

NumPy 的比较是 itemwise 的,所以用 numpy.array_equal() 或 numpy.allclose() 替代

In [109]:
np.array_equal(a, b)
Out[109]:
False

ufunc.outer¶

ufunc.outer(A, B, **kwargs)
  • https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ufunc.outer.html

对所有的 a in A 和 b in B 组合执行该函数

执行机制类似于双层for循环

r = empty(len(A),len(B))
for i in range(len(A)):
    for j in range(len(B)):
        r[i,j] = op(A[i], B[j]) # op = ufunc in question
In [110]:
A = np.array([1, 2, 3])
B = np.array([4, 5, 6])

np.multiply.outer(A, B)
Out[110]:
array([[ 4,  5,  6],
       [ 8, 10, 12],
       [12, 15, 18]])

维度比较复杂时

In [111]:
A = np.array([[1, 2, 3], [4, 5, 6]])
A.shape

B = np.array([[1, 2], [3, 4]])
B.shape

C = np.multiply.outer(A, B)
C
C.shape
Out[111]:
(2, 3)
Out[111]:
(2, 2)
Out[111]:
array([[[[ 1,  2],
         [ 3,  4]],

        [[ 2,  4],
         [ 6,  8]],

        [[ 3,  6],
         [ 9, 12]]],


       [[[ 4,  8],
         [12, 16]],

        [[ 5, 10],
         [15, 20]],

        [[ 6, 12],
         [18, 24]]]])
Out[111]:
(2, 3, 2, 2)

(2,3) (1,4) -> (2,3,1,4)

numpy.argsort 与 numpy.sort¶

numpy.sort¶

numpy.sort(a, axis=-1, kind='quicksort', order=None)

https://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html

返回排序后的 ndarray

In [112]:
a = np.array([[1, 4], [3, 2]])
np.sort(a)  # 排序最后一维

np.sort(a, axis=None)  # 扁平化排序

np.sort(a, axis=0)  # 沿指定轴排序
Out[112]:
array([[1, 4],
       [2, 3]])
Out[112]:
array([1, 2, 3, 4])
Out[112]:
array([[1, 2],
       [3, 4]])

按 key 排序

In [113]:
dtype = [('name', 'S10'), ('height', float), ('age', int)]
values = [('Arthur', 1.8, 41), ('Lancelot', 1.9, 38),
          ('Galahad', 1.7, 38)]
a = np.array(values, dtype=dtype)       # create a structured array

np.sort(a, order='height')
Out[113]:
array([(b'Galahad', 1.7, 38), (b'Arthur', 1.8, 41),
       (b'Lancelot', 1.9, 38)],
      dtype=[('name', 'S10'), ('height', '<f8'), ('age', '<i4')])
In [114]:
np.sort(a, order=['age', 'height'])
Out[114]:
array([(b'Galahad', 1.7, 38), (b'Lancelot', 1.9, 38),
       (b'Arthur', 1.8, 41)],
      dtype=[('name', 'S10'), ('height', '<f8'), ('age', '<i4')])

numpy.argsort¶

numpy.argsort(a, axis=-1, kind='quicksort', order=None)
  • https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html

返回按顺序排列的 index 而不是数据

In [115]:
x = np.array([[0, 3, 4], [2, 2, 2]])
x
Out[115]:
array([[0, 3, 4],
       [2, 2, 2]])
In [116]:
np.argsort(x, axis=None)  # 全排序 扁平化
np.argsort(x, axis=0)  # 按轴排序
np.argsort(x, axis=1)
Out[116]:
array([0, 3, 4, 5, 1, 2], dtype=int64)
Out[116]:
array([[0, 1, 1],
       [1, 0, 0]], dtype=int64)
Out[116]:
array([[0, 1, 2],
       [0, 1, 2]], dtype=int64)

获取 index (只适用于 axis=None 的情况)

In [117]:
indice = np.unravel_index(np.argsort(x, axis=None), x.shape)
indice

x[indice]  # same as np.sort(x, axis=None)
Out[117]:
(array([0, 1, 1, 1, 0, 0], dtype=int64),
 array([0, 0, 1, 2, 1, 2], dtype=int64))
Out[117]:
array([0, 2, 2, 2, 3, 4])

indice 相当于坐标,第一维度和第二维度分开的两组坐标。

In [118]:
ind = np.unravel_index(np.argsort(x, axis=0), x.shape)
ind

x[ind]  # 和 np.sort(x, axis=0) 不同
Out[118]:
(array([[0, 0, 0],
        [0, 0, 0]], dtype=int64), array([[0, 1, 1],
        [1, 0, 0]], dtype=int64))
Out[118]:
array([[0, 3, 3],
       [3, 0, 0]])

正确获取 axis 不为 None 情况下的 argsort 结果, 获取排序后的 ndarray

其实就是补上第二维度的“坐标”

In [119]:
indice0 = np.argsort(x, axis=0)
indice0

indice1 = np.mgrid[0:2, 0:3][1]
indice1

x[indice0, indice1]
Out[119]:
array([[0, 1, 1],
       [1, 0, 0]], dtype=int64)
Out[119]:
array([[0, 1, 2],
       [0, 1, 2]])
Out[119]:
array([[0, 2, 2],
       [2, 3, 4]])

按 key 排序

In [120]:
x = np.array([(1, 0), (0, 1)], dtype=[('x', '<i4'), ('y', '<i4')])
x

np.argsort(x, order=('x', 'y'))
np.argsort(x, order=('y', 'x'))
Out[120]:
array([(1, 0), (0, 1)], dtype=[('x', '<i4'), ('y', '<i4')])
Out[120]:
array([1, 0], dtype=int64)
Out[120]:
array([0, 1], dtype=int64)

  • « NumPy入门笔记

Published

Jun 6, 2019

Category

posts

Tags

  • NumPy 2

Contact

  • Zodiac Wang - A Fantastic Learner
  • Powered by Pelican. Theme: Elegant by Talha Mansoor