昇思25天学习打卡营第2天|张量、数据集和数据变换

张量 Tensor

张量（Tensor）是一个可用来表示在一些矢量、标量和其他张量之间的线性关系的多线性函数，这些线性关系的基本例子有内积、外积、线性映射以及笛卡儿积。其坐标在 𝑛 维空间内，有 $n^{r}$ 个分量的一种量，其中每个分量都是坐标的函数，而在坐标变换时，这些分量也依照某些规则作线性变换。𝑟称为该张量的秩或阶（与矩阵的秩和阶均无关系）。

与主流深度学习框架类似，张量作为一种特殊的数据结构，在MindSpore中主要用于网络运算，相较于数组和矩阵，张量（Tensor）这种数据结构更适合深度学习框架运算。

在MindSpore中，导入以下库函数用于使用tensor

import numpy as np
import mindspore
from mindspore import ops
from mindspore import Tensor, CSRTensor, COOTensor

创建张量

构造张量时，支持传入Tensor、float、int、bool、tuple、list和numpy.ndarray类型

根据数据直接生成

data = [1, 0, 1, 0]
x_data = Tensor(data)
print(x_data, x_data.shape, x_data.dtype)

输出x_data及其尺寸和类型

[1 0 1 0] (4,) Int64

从NumPy数组生成

np_array = np.array(data)
x_np = Tensor(np_array)
print(x_np, x_np.shape, x_np.dtype)

输出x_data及其尺寸和类型

[1 0 1 0] (4,) Int64

使用init初始化器构造张量

from mindspore.common.initializer import One, Normal

# Initialize a tensor with ones
tensor1 = mindspore.Tensor(shape=(2, 2), dtype=mindspore.float32, init=One())
# Initialize a tensor from normal distribution
tensor2 = mindspore.Tensor(shape=(2, 2), dtype=mindspore.float32, init=Normal())

print("tensor1:\n", tensor1)
print("tensor2:\n", tensor2)

输出结构为：

tensor1:
 [[1. 1.]
 [1. 1.]]
tensor2:
 [[-0.01557738 -0.01042705]
 [ 0.00435887  0.02999963]]

init主要用于并行模式下的延后初始化，在正常情况下不建议使用init对参数进行初始化

继承另一个张量的属性，形成新的张量

from mindspore import ops

x_ones = ops.ones_like(x_data)
print(f"Ones Tensor: \n {x_ones} \n")

x_zeros = ops.zeros_like(x_data)
print(f"Zeros Tensor: \n {x_zeros} \n")

输出结果：

Ones Tensor: 
 [1 1 1 1] 

Zeros Tensor: 
 [0 0 0 0]

在执行以上张量创建方法时，应特别注意内存占用，以防创建的张量过大导致创建失败。

张量的属性

张量的属性包括：形状（shape）、数据类型（dtype）、单个元素大小（itemsize）、占用字节数量（nbytes）、维数（ndim）、元素个数（size）、每一维步长（strides）

张量索引

Tensor索引与Numpy索引类似，索引从0开始编制，负索引表示按倒序编制，冒号 : 和 ... 用于对数据进行切片。

张量运算

张量之间有很多运算，包括算术、线性代数、矩阵处理（转置、标引、切片）、采样等，张量运算和NumPy的使用方式类似

Tensor与NumPy转换

Tensor转换为NumPy

使用Tensor.asnumpy()将Tensor变量转换为NumPy变量

t = Tensor([1., 1., 1., 1., 1.])
print(f"t: {t}", type(t))
n = t.asnumpy()
print(f"n: {n}", type(n))

输出结果为：

t: [1. 1. 1. 1. 1.] <class 'mindspore.common.tensor.Tensor'>
n: [1. 1. 1. 1. 1.] <class 'numpy.ndarray'>

NumPy转换为Tensor

使用Tensor()将NumPy变量转换为Tensor变量

n = np.ones(5)
t = Tensor.from_numpy(n)
np.add(n, 1, out=n)
print(f"n: {n}", type(n))
print(f"t: {t}", type(t))

输出结果为：

n: [2. 2. 2. 2. 2.] <class 'numpy.ndarray'>
t: [2. 2. 2. 2. 2.] <class 'mindspore.common.tensor.Tensor'>

稀疏张量

稀疏张量是一种特殊张量，其中绝大部分元素的值为零。

CSRTensor

indptr = Tensor([0, 1, 2])
indices = Tensor([0, 1])
values = Tensor([1, 2], dtype=mindspore.float32)
shape = (2, 4)

# Make a CSRTensor
csr_tensor = CSRTensor(indptr, indices, values, shape)

print(csr_tensor.astype(mindspore.float64).dtype)
print(csr_tensor)

其中

indptr: 一维整数张量, 表示稀疏数据每一行的非零元素在values中的起始位置和终止位置

indices: 一维整数张量，表示稀疏张量非零元素在列中的位置, 与values长度相等

values: 一维张量，表示CSRTensor相对应的非零元素的值

shape: 表示被压缩的稀疏张量的形状，数据类型为Tuple，目前仅支持二维CSRTensor

输出结果为：

Float64
CSRTensor(shape=[2, 4], dtype=Float32, indptr=Tensor(shape=[3], dtype=Int64, value=[0 1 2]), indices=Tensor(shape=[2], dtype=Int64, value=[0 1]), values=Tensor(shape=[2], dtype=Float32, value=[ 1.00000000e+00  2.00000000e+00]))

相当于生成了如下CSRTensor

$\begin{bmatrix} 1& 0& 0& 0\\ 0& 2& 0& 0 \end{bmatrix}$

COOTensor

表示某一张量在给定索引上非零元素的集合

indices = Tensor([[0, 1], [1, 2]], dtype=mindspore.int32)
values = Tensor([1, 2], dtype=mindspore.float32)
shape = (3, 4)

# Make a COOTensor
coo_tensor = COOTensor(indices, values, shape)

print(coo_tensor.values)
print(coo_tensor.indices)
print(coo_tensor.shape)
print(coo_tensor.astype(mindspore.float64).dtype)  # COOTensor to float64

其中

indices: 二维整数张量，每行代表非零元素下标

values: 一维张量，表示相对应的非零元素的值

shape: 表示被压缩的稀疏张量的形状，目前仅支持二维COOTensor

输出结果为：

[1. 2.]
[[0 1]
 [1 2]]
(3, 4)
Float64

相当于生成如下COOTensor：

$\begin{bmatrix} 0 &1 &0 &0 \\ 0&0 &2 &0 \\ 0& 0 &0 &0 \end{bmatrix}$

数据集 Dataset

数据是深度学习的基础，高质量的数据输入将在整个深度神经网络中起到积极作用。

导入以下库函数

import numpy as np
from mindspore.dataset import vision
from mindspore.dataset import MnistDataset, GeneratorDataset
import matplotlib.pyplot as plt

数据集加载

由于mindspore.dataset仅支持解压后的数据文件，因此需要使用download库下载数据集并解压

# Download data from open datasets
from download import download

url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/" \
      "notebook/datasets/MNIST_Data.zip"
path = download(url, "./", kind="zip", replace=True)

文件解压并得到训练数据集

train_dataset = MnistDataset("MNIST_Data/train", shuffle=False)

数据集常用操作

shuffle

用于消除数据排列造成分布不均问题

具体使用如下：

train_dataset = train_dataset.shuffle(buffer_size=64)

map

针对数据集指定列添加数据变换，将数据变换应用于该列数据的每个元素

比如将图像统一除以255，数据类型由uint8转为了float32

train_dataset = train_dataset.map(vision.Rescale(1.0 / 255.0, 0), input_columns='image')

batch

将数据集打包为固定大小的batch，在有限硬件资源下保证梯度下降的随机性和优化计算量

如设置批量大小（batch_size）为32：

train_dataset = train_dataset.batch(batch_size=32)

数据变换 Transforms

通常情况下，直接加载的原始数据并不能直接送入神经网络进行训练，此时需要对其进行数据预处理

导入以下库函数：

import numpy as np
from PIL import Image
from download import download
from mindspore.dataset import transforms, vision, text
from mindspore.dataset import GeneratorDataset, MnistDataset

Commen Transforms

Compose

Compose接收一个数据增强操作序列，然后将其组合成单个数据增强操作

# Download data from open datasets

url = "https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/" \
      "notebook/datasets/MNIST_Data.zip"
path = download(url, "./", kind="zip", replace=True)

train_dataset = MnistDataset('MNIST_Data/train')
image, label = next(train_dataset.create_tuple_iterator())
print(image.shape)
composed = transforms.Compose(
    [
        vision.Rescale(1.0 / 255.0, 0),
        vision.Normalize(mean=(0.1307,), std=(0.3081,)),
        vision.HWC2CHW()
    ]
)
train_dataset = train_dataset.map(composed, 'image')
image, label = next(train_dataset.create_tuple_iterator())
print(image.shape)

经过compose之后的图像尺寸分别为(28, 28, 1)和(1, 28, 28)

Vision Transforms

Rescale

Rescale变换用于调整图像像素值的大小，包括两个参数：

rescale: 缩放因子

shift: 平移因子

图像输出-输出像素值符合以下关系

$output_{i}=input _{i}*rescale+shift$

代码实现如下：

random_np = np.random.randint(0, 255, (48, 48), np.uint8)
random_image = Image.fromarray(random_np)
rescale = vision.Rescale(1.0 / 255.0, 0)
rescaled_image = rescale(random_image)

输出值实现了像素值的缩放

Normalize

Normalize变换用于对输入图像的归一化，包括三个参数：

mean: 图像每个通道的均值

std: 图像每个通道的标准差

is_hwc: bool值，输入图像的格式。True为(height, width, channel)，False为(channel, height, width)

图像输出-输出像素值符合以下关系

$output_{c}=\frac{input_{c}-mean_{c}}{std_{c}}$

代码实现如下：

normalize = vision.Normalize(mean=(0.1307,), std=(0.3081,))
normalized_image = normalize(rescaled_image)

HWC2CHW

Normalize变换用于转换图像格式

代码实现如下：

hwc_image = np.expand_dims(normalized_image, -1)
hwc2chw = vision.HWC2CHW()
chw_image = hwc2chw(hwc_image)
print(hwc_image.shape, chw_image.shape)

输出结果为：

(48, 48, 1) (1, 48, 48)

可见变换前后的尺寸发生了变化

Text Transforms

定义三段文本，用于后续处理

texts = ['Welcome to Beijing']
test_dataset = GeneratorDataset(texts, 'text')

PythonTokenizer

Tokenizer允许用户自由实现分词策略。随后利用map操作将此分词器应用到输入的文本中，对其进行分词

def my_tokenizer(content):
    return content.split()

test_dataset = test_dataset.map(text.PythonTokenizer(my_tokenizer))
print(next(test_dataset.create_tuple_iterator()))

输出结果：

[Tensor(shape=[3], dtype=String, value= ['Welcome', 'to', 'Beijing'])]

Lookup

Lookup为词表映射变换，用来将Token转换为Index。在使用Lookup前，需要构造词表，一般可以加载已有的词表，或使用Vocab生成词表。

代码实现如下：

vocab = text.Vocab.from_dataset(test_dataset)
print(vocab.vocab())
test_dataset = test_dataset.map(text.Lookup(vocab))
print(next(test_dataset.create_tuple_iterator()))

输出结果为：

{'to': 2, 'Welcome': 1, 'Beijing': 0}
[Tensor(shape=[3], dtype=Int32, value= [1, 2, 0])]

Lambda Tranforms

Lambda函数是一种不需要名字、由一个单独表达式组成的匿名函数，表达式会在调用时被求值。

此处使用Lambda函数对输入数据乘2

test_dataset = GeneratorDataset([1, 2, 3], 'data', shuffle=False)
test_dataset = test_dataset.map(lambda x: x * 2)
print(list(test_dataset.create_tuple_iterator()))

输出结果为：

[[Tensor(shape=[], dtype=Int64, value= 2)], [Tensor(shape=[], dtype=Int64, value= 4)], [Tensor(shape=[], dtype=Int64, value= 6)]]

此外，也可以定义函数，并配合Lambda函数实现数据处理

def func(x):
    return x * x + 2

test_dataset = test_dataset.map(lambda x: func(x))
print(list(test_dataset.create_tuple_iterator()))

得到结果为

[[Tensor(shape=[3], dtype=Int32, value= [3, 6, 2])]]

总结

张量处理、数据集导入与切分、数据变换方法在深度学习中是十分重要的，这些部分也是不少论文中写道并作为创新点的。在一些数据集较为复杂的情况下，预处理手段、批量设定等方面往往会成为决定模型能否顺利训练下去的关键。