linux-aarch64 编译安装tensorflow
1. 明确版本
首先要明确你的系统,本文是在ubuntu20.04
(内核架构为aarch64)上安装tensorflow1.14.0
,由于在网上找到的.whl文件不好使,选择编译安装。参考tensorflow官网,以下是对应版本。
Tensorflow | GCC | Bazel | Python | Numpy |
---|---|---|---|---|
1.14.0 | 5.3.1 | 0.24.1 | 3.7.6 | 1.16.5 |
1.13.1 | 5.3.1 | 0.19.2 | 3.5.3 | 1.16.5 |
Note:
tensorflow官网上给的gcc版本为4.8,在编译的时候会由于C89和C99的区别导致编译问题,故在此选用了5.3.1。
Note:
numpy的版本最好不要高于1.19.0。
2. 管理版本
由于编译过程中各种版本选择不是一次就能成功的,本文采用ubuntu的update-alternatives
命令对Python以及GCC版本进行管理。参考链接:https://blog.csdn.net/a1809032425/article/details/122729307。
There are 4 choices for the alternative python (providing /usr/bin/python).
Selection Path Priority Status
------------------------------------------------------------
0 /usr/local/bin/python3.5 3 auto mode
1 /usr/bin/python2.7 1 manual mode
2 /usr/bin/python3.8 1 manual mode
3 /usr/local/bin/python3.5 3 manual mode
* 4 /usr/local/bin/python3.7 1 manual mode
Press <enter> to keep the current choice[*], or type selection number:
There are 5 choices for the alternative gcc (providing /usr/bin/gcc).
Selection Path Priority Status
------------------------------------------------------------
0 /usr/bin/aarch64-linux-gnu-gcc-4.8 2 auto mode
1 /usr/bin/aarch64-linux-gnu-gcc-4.8 2 manual mode
* 2 /usr/bin/aarch64-linux-gnu-gcc-5 1 manual mode
3 /usr/bin/aarch64-linux-gnu-gcc-9 1 manual mode
4 /usr/bin/gcc-4.8 1 manual mode
5 /usr/bin/gcc-9 1 manual mode
Press <enter> to keep the current choice[*], or type selection number:
3. 编译安装Bazel-0.24.1
3.1 pip安装python相关包
若系统中有多个版本的python,一定要管理好,用对应版本的pip去安装相应的包。这里没有按照tensorflow官网中那样使用--user
安装python包,目的是为了让不同版本的python将包装在自己的文件夹下从而不造成冲突。
sudo pip3.7 install numpy==1.16.5 wheel -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
sudo pip3.7 install keras_preprocessing --no-deps -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
3.2 编译Bazel
同样地,通过脚本安装Bazel不好使,所以仍然采取编译安装bazel的方式。具体参考Bazel官方文档中的Build Bazel from scratch (bootstrapping)。
Note:
从Github上下载Bazel源码的时候一定要注意下载形如bazel-<version>-dist.zip
的文件,否则可能会出现无法编译的问题。
在编译Bazel的时候,出现了error: ambiguating new declaration of ‘long int gettid()’
,这是在编译grpc的时候产生的问题,为此需要修改如下两个文件(将gettid
更改为sys_gettid
)。
vim bazel/third_party/grpc/src/core/lib/gpr/log_linux.cc:
/*修改之前*/
43:static long gettid(void) { return syscall(__NR_gettid); }
73: if (tid == 0) tid = gettid();
/*修改之后*/
43:static long sys_gettid(void) { return syscall(__NR_gettid); }
73: if (tid == 0) tid = sys_gettid();
vim bazel/third_party/grpc/src/core/lib/gpr/log_posix.cc:
/*修改之前*/
33:static intptr_t gettid(void) { return (intptr_t)pthread_self(); }
86: gpr_asprintf(&prefix, "%s%s.%09d %7tu %s:%d]",
gpr_log_severity_string(args->severity), time_buffer,
(int)(now.tv_nsec), gettid(), display_file, args->line);
/*修改之后*/
33:static intptr_t sys_gettid(void) { return (intptr_t)pthread_self(); }
86: gpr_asprintf(&prefix, "%s%s.%09d %7tu %s:%d]",
gpr_log_severity_string(args->severity), time_buffer,
(int)(now.tv_nsec), sys_gettid(), display_file, args->line);
修改完之后,再次输入env EXTRA_BAZEL_ARGS="--host_javabase=@local_jdk//:jdk" bash ./compile.sh
命令编译,即可成功安装。
Note:
Bazel的编译需要配置java环境,jdk最好选用1.8。
4. 编译tensorflow1.14.0
4.1 源码下载
终于要编译tensorflow了,首先要从github上下载1.14.0的源码,下载地址https://github.com/tensorflow/tensorflow/releases?q=1.14.0&expanded=true。
4.2 设置编译参数
解压后进入tensorflow1.14.0文件夹,然后运行./configure脚本设置编译参数
Extracting Bazel installation...
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.24.1- (@non-git) installed.
Please specify the location of python. [Default is /usr/bin/python]:
Found possible Python library paths:
/usr/local/lib/python3.7/site-packages
Please input the desired Python library path to use. Default is [/usr/local/lib/python3.7/site-packages]
Do you wish to build TensorFlow with XLA JIT support? [Y/n]:
XLA JIT support will be enabled for TensorFlow.
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]:
No OpenCL SYCL support will be enabled for TensorFlow.
Do you wish to build TensorFlow with ROCm support? [y/N]:
No ROCm support will be enabled for TensorFlow.
Do you wish to build TensorFlow with CUDA support? [y/N]:
No CUDA support will be enabled for TensorFlow.
Do you wish to download a fresh release of clang? (Experimental) [y/N]:
Clang will not be downloaded.
Do you wish to build TensorFlow with MPI support? [y/N]:
No MPI support will be enabled for TensorFlow.
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]:
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:
Not configuring the WORKSPACE for Android builds.
Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
--config=gdr # Build with GDR support.
--config=verbs # Build with libverbs support.
--config=ngraph # Build with Intel nGraph support.
--config=numa # Build with NUMA support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=noignite # Disable Apache Ignite support.
--config=nokafka # Disable Apache Kafka support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished
Note:
建议都选否,cuda如果有可以选。
4.3 编译tensorflow
输入以下命令进行编译,--local_resources
是指定系统给编译过程中分配的资源,分别分配内存大小(MB)、CPU核心数(个)、可利用IO的工作站(平均为1.0)。
bazel build --conlyopt="-std=gnu99" --conlyopt="-w" --local_resources 14436,4.0,1.0 //tensorflow/tools/pip_package:build_pip_package
4.3.1 依赖下载失败
编译tensorflow的时候需要许多的依赖包,由于本文不是离线编译,依赖包可以通过联网下载,但有一些包会下载失败,为此,需要手动下载这些包。并将./WORKSPACE
、./third_party/icu/workspace.bzl
、./tensorflow/workspace.bzl
、
./third_party/flatbuffers/workspace.bzl
4个文件中的相应包的路径换成下载包的路径。
Note:
下载不下来包的大多是网络的问题,当然可以不去手动下载,重新尝试几次。
4.3.2 C++ compilation of rule ‘@grpc//:gpr_base’ failed (Exit 1):
这个错误与Bazel编译失败类似,也是在编译grpc中产生的错误,类似地,修改/.cache/bazel/_bazel_wsn/a7132f72a8b641c1ebbdd6a7fd1fd5fb/external/grpc/src/core/lib/gpr/log_linux.cc
与home/wsn/.cache/bazel/_bazel_wsn/a7132f72a8b641c1ebbdd6a7fd1fd5fb/external/grpc/src/core/lib/gpr/log_posix.cc
两个文件中的gettid
为’sys_gettid’即可成功编译。
4.3.3 depthwiseconv_uint8_3x3_filter.h:3957:58: error:
在编译该文件的时候,产生 cannot convert ‘uint8x16_t {aka __vector(16) unsigned char}’ to ‘const int8x16_t {aka const __vector(16) signed char}’
的错误,通过修改./tensorflow/lite/build_def.bzl
文件添加如下内容即可成功编译。参考链接https://github.com/tensorflow/tensorflow/pull/29515。
"/DTF_COMPILE_LIBRARY",
"/wd4018", # -Wno-sign-compare
],
+ str(Label("//tensorflow:linux_aarch64")): [
+ "-flax-vector-conversions",
+ "-fomit-frame-pointer",
+ ],
"//conditions:default": [
"-Wno-sign-compare",
],
之后就可以等待tensorflow编译完成了,这是一个漫长的过程…
5. 构建.whl文件,安装tensorflow
输入下面命令构建.whl文件,输出在/tmp/tensorflow_pkg
文件夹下
./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
进入输出目录,输入下面进行安装tensorflow1.14.0
sudo pip3.7 install tensorflow-1.14.0-cp37-cp37m-linux_aarch64.whl -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
测试import tensorflow
,出现错误`TypeError: Descriptors cannot not be created directly,按照下面提示降低protobuf的版本。
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
1. Downgrade the protobuf package to 3.20.x or lower.
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
sudo pip3.7 install tensorflow-1.14.0-cp37-cp37m-linux_aarch64.whl -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
重新测试,安装成功。
>>> import tensorflow
>>> print(tensorflow.__version__)
1.14.0
6. 安装tensorflow1.13.1的坑
6.1 编译时的坑
除了上述4.3.1和4.3.2的问题外,由于tensorflow1.13.1中aws-sdk对linux-aarch64支持不好, 会在编译时产生ImportError: ....undefined symbol: _ZN3Aws11Environment6GetEnvB5cxx11EPKc
错误。解决方法是通过修改./tensorflow/BUILD
文件添加如下内容:
visibility = ["//visibility:public"],
)
+ config_setting(
+ name = "linux_aarch64",
+ values = {"cpu": "aarch64"},
+ visibility = ["//visibility:public"],
+ )
config_setting(
name = "linux_x86_64",
values = {"cpu": "k8"},
修改./third_party/aws/BUILD.bazel
,添加如下内容:
cc_library(
name = "aws",
srcs = select({
+ "@org_tensorflow//tensorflow:linux_aarch64": glob([
+ "aws-cpp-sdk-core/source/platform/linux-shared/*.cpp",
+ ]),
"@org_tensorflow//tensorflow:linux_x86_64": glob([
"aws-cpp-sdk-core/source/platform/linux-shared/*.cpp",
]),
参考链接:https://github.com/tensorflow/tensorflow/pull/22856。
6.2 安装时的坑
在安装tensorflow的时候,依赖库h5py安装失败,本想找.whl文件安装,找了半天发现aarch64架构只能找到python3.7
版本,于是放弃了tensorflow1.13.1
,重新编译python3.7
,bazel-0.24.1
,tensorflow1.14.0
,完成安装。但好像也有其他解决方法?当时一气之下重新安装了,没有动脑子…
Note:
根据tensorflow官方文档,用python3.7编译tensorflow1.13.1也可以,但没有尝试。