Bootstrap

linux-aarch64编译安装tensorflow1.14.0

1. 明确版本

       首先要明确你的系统,本文是在ubuntu20.04(内核架构为aarch64)上安装tensorflow1.14.0,由于在网上找到的.whl文件不好使,选择编译安装。参考tensorflow官网,以下是对应版本。

TensorflowGCCBazelPythonNumpy
1.14.05.3.10.24.13.7.61.16.5
1.13.15.3.10.19.23.5.31.16.5

Note: tensorflow官网上给的gcc版本为4.8,在编译的时候会由于C89和C99的区别导致编译问题,故在此选用了5.3.1。
 
Note:numpy的版本最好不要高于1.19.0。

2. 管理版本

       由于编译过程中各种版本选择不是一次就能成功的,本文采用ubuntu的update-alternatives命令对Python以及GCC版本进行管理。参考链接:https://blog.csdn.net/a1809032425/article/details/122729307

There are 4 choices for the alternative python (providing /usr/bin/python).

  Selection    Path                      Priority   Status
------------------------------------------------------------
  0            /usr/local/bin/python3.5   3         auto mode
  1            /usr/bin/python2.7         1         manual mode
  2            /usr/bin/python3.8         1         manual mode
  3            /usr/local/bin/python3.5   3         manual mode
* 4            /usr/local/bin/python3.7   1         manual mode

Press <enter> to keep the current choice[*], or type selection number: 
There are 5 choices for the alternative gcc (providing /usr/bin/gcc).

  Selection    Path                                Priority   Status
------------------------------------------------------------
  0            /usr/bin/aarch64-linux-gnu-gcc-4.8   2         auto mode
  1            /usr/bin/aarch64-linux-gnu-gcc-4.8   2         manual mode
* 2            /usr/bin/aarch64-linux-gnu-gcc-5     1         manual mode
  3            /usr/bin/aarch64-linux-gnu-gcc-9     1         manual mode
  4            /usr/bin/gcc-4.8                     1         manual mode
  5            /usr/bin/gcc-9                       1         manual mode

Press <enter> to keep the current choice[*], or type selection number: 

3. 编译安装Bazel-0.24.1

3.1 pip安装python相关包

       若系统中有多个版本的python,一定要管理好,用对应版本的pip去安装相应的包。这里没有按照tensorflow官网中那样使用--user安装python包,目的是为了让不同版本的python将包装在自己的文件夹下从而不造成冲突。

sudo pip3.7 install numpy==1.16.5 wheel -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
sudo pip3.7 install keras_preprocessing --no-deps -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

3.2 编译Bazel

       同样地,通过脚本安装Bazel不好使,所以仍然采取编译安装bazel的方式。具体参考Bazel官方文档中的Build Bazel from scratch (bootstrapping)。

Note:从Github上下载Bazel源码的时候一定要注意下载形如bazel-<version>-dist.zip的文件,否则可能会出现无法编译的问题。

       在编译Bazel的时候,出现了error: ambiguating new declaration of ‘long int gettid()’,这是在编译grpc的时候产生的问题,为此需要修改如下两个文件(将gettid更改为sys_gettid)。

vim bazel/third_party/grpc/src/core/lib/gpr/log_linux.cc:
/*修改之前*/
43:static long gettid(void) { return syscall(__NR_gettid); }
73:  if (tid == 0) tid = gettid();
/*修改之后*/
43:static long sys_gettid(void) { return syscall(__NR_gettid); }
73:  if (tid == 0) tid = sys_gettid();
vim bazel/third_party/grpc/src/core/lib/gpr/log_posix.cc:
/*修改之前*/
33static intptr_t gettid(void) { return (intptr_t)pthread_self(); }
86gpr_asprintf(&prefix, "%s%s.%09d %7tu %s:%d]",
               gpr_log_severity_string(args->severity), time_buffer,
               (int)(now.tv_nsec), gettid(), display_file, args->line);
/*修改之后*/
33static intptr_t sys_gettid(void) { return (intptr_t)pthread_self(); }
86gpr_asprintf(&prefix, "%s%s.%09d %7tu %s:%d]",
               gpr_log_severity_string(args->severity), time_buffer,
               (int)(now.tv_nsec), sys_gettid(), display_file, args->line);

       修改完之后,再次输入env EXTRA_BAZEL_ARGS="--host_javabase=@local_jdk//:jdk" bash ./compile.sh命令编译,即可成功安装。

Note:Bazel的编译需要配置java环境,jdk最好选用1.8。

4. 编译tensorflow1.14.0

4.1 源码下载

       终于要编译tensorflow了,首先要从github上下载1.14.0的源码,下载地址https://github.com/tensorflow/tensorflow/releases?q=1.14.0&expanded=true

4.2 设置编译参数

       解压后进入tensorflow1.14.0文件夹,然后运行./configure脚本设置编译参数

Extracting Bazel installation...
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.24.1- (@non-git) installed.
Please specify the location of python. [Default is /usr/bin/python]: 


Found possible Python library paths:
  /usr/local/lib/python3.7/site-packages
Please input the desired Python library path to use.  Default is [/usr/local/lib/python3.7/site-packages]

Do you wish to build TensorFlow with XLA JIT support? [Y/n]: 
XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: 
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with ROCm support? [y/N]: 
No ROCm support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: 
No CUDA support will be enabled for TensorFlow.

Do you wish to download a fresh release of clang? (Experimental) [y/N]: 
Clang will not be downloaded.

Do you wish to build TensorFlow with MPI support? [y/N]: 
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]: 


Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: 
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
	--config=mkl         	# Build with MKL support.
	--config=monolithic  	# Config for mostly static monolithic build.
	--config=gdr         	# Build with GDR support.
	--config=verbs       	# Build with libverbs support.
	--config=ngraph      	# Build with Intel nGraph support.
	--config=numa        	# Build with NUMA support.
	--config=dynamic_kernels	# (Experimental) Build kernels into separate shared objects.
Preconfigured Bazel build configs to DISABLE default on features:
	--config=noaws       	# Disable AWS S3 filesystem support.
	--config=nogcp       	# Disable GCP support.
	--config=nohdfs      	# Disable HDFS support.
	--config=noignite    	# Disable Apache Ignite support.
	--config=nokafka     	# Disable Apache Kafka support.
	--config=nonccl      	# Disable NVIDIA NCCL support.
Configuration finished

Note:建议都选否,cuda如果有可以选。

4.3 编译tensorflow

       输入以下命令进行编译,--local_resources是指定系统给编译过程中分配的资源,分别分配内存大小(MB)、CPU核心数(个)、可利用IO的工作站(平均为1.0)。

bazel build --conlyopt="-std=gnu99" --conlyopt="-w" --local_resources 14436,4.0,1.0 //tensorflow/tools/pip_package:build_pip_package

4.3.1 依赖下载失败

       编译tensorflow的时候需要许多的依赖包,由于本文不是离线编译,依赖包可以通过联网下载,但有一些包会下载失败,为此,需要手动下载这些包。并将./WORKSPACE./third_party/icu/workspace.bzl./tensorflow/workspace.bzl
./third_party/flatbuffers/workspace.bzl4个文件中的相应包的路径换成下载包的路径。

Note:下载不下来包的大多是网络的问题,当然可以不去手动下载,重新尝试几次。

4.3.2 C++ compilation of rule ‘@grpc//:gpr_base’ failed (Exit 1):

       这个错误与Bazel编译失败类似,也是在编译grpc中产生的错误,类似地,修改/.cache/bazel/_bazel_wsn/a7132f72a8b641c1ebbdd6a7fd1fd5fb/external/grpc/src/core/lib/gpr/log_linux.cchome/wsn/.cache/bazel/_bazel_wsn/a7132f72a8b641c1ebbdd6a7fd1fd5fb/external/grpc/src/core/lib/gpr/log_posix.cc两个文件中的gettid为’sys_gettid’即可成功编译。

4.3.3 depthwiseconv_uint8_3x3_filter.h:3957:58: error:

       在编译该文件的时候,产生 cannot convert ‘uint8x16_t {aka __vector(16) unsigned char}’ to ‘const int8x16_t {aka const __vector(16) signed char}’的错误,通过修改./tensorflow/lite/build_def.bzl文件添加如下内容即可成功编译。参考链接https://github.com/tensorflow/tensorflow/pull/29515

            "/DTF_COMPILE_LIBRARY",
            "/wd4018",  # -Wno-sign-compare
        ],
+       str(Label("//tensorflow:linux_aarch64")): [
+           "-flax-vector-conversions",
+           "-fomit-frame-pointer",
+       ],
        "//conditions:default": [
            "-Wno-sign-compare",
        ],

之后就可以等待tensorflow编译完成了,这是一个漫长的过程…

5. 构建.whl文件,安装tensorflow

       输入下面命令构建.whl文件,输出在/tmp/tensorflow_pkg文件夹下

./bazel-bin/tensorflow/tools/pip_package/build_pip_package  /tmp/tensorflow_pkg

       进入输出目录,输入下面进行安装tensorflow1.14.0

sudo pip3.7 install tensorflow-1.14.0-cp37-cp37m-linux_aarch64.whl -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

       测试import tensorflow,出现错误`TypeError: Descriptors cannot not be created directly,按照下面提示降低protobuf的版本。

If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
sudo pip3.7 install tensorflow-1.14.0-cp37-cp37m-linux_aarch64.whl -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

       重新测试,安装成功。

>>> import tensorflow
>>> print(tensorflow.__version__)
1.14.0

6. 安装tensorflow1.13.1的坑

6.1 编译时的坑

       除了上述4.3.1和4.3.2的问题外,由于tensorflow1.13.1中aws-sdk对linux-aarch64支持不好, 会在编译时产生ImportError: ....undefined symbol: _ZN3Aws11Environment6GetEnvB5cxx11EPKc错误。解决方法是通过修改./tensorflow/BUILD文件添加如下内容:

		   visibility = ["//visibility:public"],
		)
+		config_setting(
+		    name = "linux_aarch64",
+		    values = {"cpu": "aarch64"},
+		    visibility = ["//visibility:public"],
+		)
		config_setting(
		    name = "linux_x86_64",
		    values = {"cpu": "k8"},

修改./third_party/aws/BUILD.bazel,添加如下内容:

		cc_library(
		    name = "aws",
		    srcs = select({
+		        "@org_tensorflow//tensorflow:linux_aarch64": glob([
+		            "aws-cpp-sdk-core/source/platform/linux-shared/*.cpp",
+		        ]),
		        "@org_tensorflow//tensorflow:linux_x86_64": glob([
		            "aws-cpp-sdk-core/source/platform/linux-shared/*.cpp",
		        ]),

参考链接:https://github.com/tensorflow/tensorflow/pull/22856

6.2 安装时的坑

       在安装tensorflow的时候,依赖库h5py安装失败,本想找.whl文件安装,找了半天发现aarch64架构只能找到python3.7版本,于是放弃了tensorflow1.13.1,重新编译python3.7bazel-0.24.1tensorflow1.14.0,完成安装。但好像也有其他解决方法?当时一气之下重新安装了,没有动脑子…

Note:根据tensorflow官方文档,用python3.7编译tensorflow1.13.1也可以,但没有尝试。

;