TensorFlow on Nvidia TX1

Question

TensorFlow on Nvidia TX1

Has anyone got a tensor working on an Nvidia Tegra X1?

I found several sources indicating that this is possible on TK1 or with significant hacking / bugs on TX1, but there is still no final recipe.

I am using Jetson 2.3 install but haven't gotten it working yet - any tips you liked the most.

+5

tensorflow nvidia tegra bazel

Dwight crow 30 sept '16 at 4:53

source share

2 answers

Dwight crow · Answer 1 · 2016-10-04T06:46:57+0000

Get TensorFlow R0.9 running on TX1 with Bazel 0.2.1, CUDA 8.0, CUDNN5.1, L4T24.2 and the new JetPack 2.3 installation. I tested it with core networks MLP, Conv and LSTM using BN, Sigmoid, ReLU, etc. No mistakes. I removed sparse_matmul_op, although otherwise I think compilation should be fully functional. Many of these steps come directly from MaxCuda's excellent blog , so many thanks to them for providing.

Plan to continue the hammer on R0.10 / R0.11 (the gRPC binary is preventing Bazel 0.3.0 right now), but until I decided to publish the formula R0.9. As below:

Get java first

sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer

Install other decks

 sudo apt-get install git zip unzip autoconf automake libtool curl zlib1g-dev maven swig

You need to build protobuf 3.0.0-beta-2 itself

 git clone https://github.com/google/protobuf.git cd protobuf # autogen.sh downloads broken gmock.zip in d5fb408d git checkout master ./autogen.sh git checkout d5fb408d ./configure --prefix=/usr make -j 4 sudo make install cd java mvn package

Get the Basel. We need version 0.2.1, it does not require the gRPC binary, unlike 0.3.0, which I still can’t build (maybe soon!)

 git clone https://github.com/bazelbuild/bazel.git cd bazel git checkout 0.2.1 cp /usr/bin/protoc third_party/protobuf/protoc-linux-arm32.exe cp ../protobuf/java/target/protobuf-java-3.0.0-beta-2.jar third_party/protobuf/protobuf-java-3.0.0-beta-1.jar

You need to edit the bazel file to recognize aarch64 as ARM

 --- a/src/main/java/com/google/devtools/build/lib/util/CPU.java +++ b/src/main/java/com/google/devtools/build/lib/util/CPU.java @@ -25,7 +25,7 @@ import java.util.Set; public enum CPU { X86_32("x86_32", ImmutableSet.of("i386", "i486", "i586", "i686", "i786", "x86")), X86_64("x86_64", ImmutableSet.of("amd64", "x86_64", "x64")), - ARM("arm", ImmutableSet.of("arm", "armv7l")), + ARM("arm", ImmutableSet.of("arm", "armv7l", "aarch64")), UNKNOWN("unknown", ImmutableSet.<String>of());

Now compile

 ./compile.sh

And set

 sudo cp output/bazel /usr/local/bin

Get the tensor flow R0.9. Higher than R0.9, requires Bazel 0.3.0, which I have not yet figured out how to build yet due to gRPC problems.

 git clone -b r0.9 https://github.com/tensorflow/tensorflow.git

Build once. This will fail, but now you have a bazel.cache dir where you can put the updated config.guess and config.sub files that will determine which architecture you are using

 ./configure bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package cd ~ wget -O config.guess 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD' wget -O config.sub 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD' # below are commands I ran, yours will vary depending on .cache details. `find` is your friend cp config.guess ./.cache/bazel/_bazel_socialh/742c01ff0765b098544431b60b1eed9f/external/farmhash_archive/farmhash-34c13ddfab0e35422f4c3979f360635a8c050260/config.guess cp config.sub ./.cache/bazel/_bazel_socialh/742c01ff0765b098544431b60b1eed9f/external/farmhash_archive/farmhash-34c13ddfab0e35422f4c3979f360635a8c050260/config.sub

sparse_matmul_op had a couple of errors, I took a sneaky route and removed from the assembly

 --- a/tensorflow/core/kernels/BUILD +++ b/tensorflow/core/kernels/BUILD @@ -985,7 +985,7 @@ tf_kernel_libraries( "reduction_ops", "segment_reduction_ops", "sequence_ops", - "sparse_matmul_op", + #DC "sparse_matmul_op", ], deps = [ ":bounds_check", --- a/tensorflow/python/BUILD +++ b/tensorflow/python/BUILD @@ -1110,7 +1110,7 @@ medium_kernel_test_list = glob([ "kernel_tests/seq2seq_test.py", "kernel_tests/slice_op_test.py", "kernel_tests/sparse_ops_test.py", - "kernel_tests/sparse_matmul_op_test.py", + #DC "kernel_tests/sparse_matmul_op_test.py", "kernel_tests/sparse_tensor_dense_matmul_op_test.py", ])

TX1 cannot create fantastic constructors in cwise_op_gpu_select.cu.cc

 --- a/tensorflow/core/kernels/cwise_op_gpu_select.cu.cc +++ b/tensorflow/core/kernels/cwise_op_gpu_select.cu.cc @@ -43,8 +43,14 @@ struct BatchSelectFunctor<GPUDevice, T> { const int all_but_batch = then_flat_outer_dims.dimension(1); #if !defined(EIGEN_HAS_INDEX_LIST) - Eigen::array<int, 2> broadcast_dims{{ 1, all_but_batch }}; - Eigen::Tensor<int, 2>::Dimensions reshape_dims{{ batch, 1 }}; + //DC Eigen::array<int, 2> broadcast_dims{{ 1, all_but_batch }}; + Eigen::array<int, 2> broadcast_dims; + broadcast_dims[0] = 1; + broadcast_dims[1] = all_but_batch; + //DC Eigen::Tensor<int, 2>::Dimensions reshape_dims{{ batch, 1 }}; + Eigen::Tensor<int, 2>::Dimensions reshape_dims; + reshape_dims[0] = batch; + reshape_dims[1] = 1; #else Eigen::IndexList<Eigen::type2index<1>, int> broadcast_dims; broadcast_dims.set(1, all_but_batch);

The same thing in sparse_tensor_dense_matmul_op_gpu.cu.cc

 --- a/tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc +++ b/tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc @@ -104,9 +104,17 @@ struct SparseTensorDenseMatMulFunctor<GPUDevice, T, ADJ_A, ADJ_B> { int n = (ADJ_B) ? b.dimension(0) : b.dimension(1); #if !defined(EIGEN_HAS_INDEX_LIST) - Eigen::Tensor<int, 2>::Dimensions matrix_1_by_nnz{{ 1, nnz }}; - Eigen::array<int, 2> n_by_1{{ n, 1 }}; - Eigen::array<int, 1> reduce_on_rows{{ 0 }}; + //DC Eigen::Tensor<int, 2>::Dimensions matrix_1_by_nnz{{ 1, nnz }}; + Eigen::Tensor<int, 2>::Dimensions matrix_1_by_nnz; + matrix_1_by_nnz[0] = 1; + matrix_1_by_nnz[1] = nnz; + //DC Eigen::array<int, 2> n_by_1{{ n, 1 }}; + Eigen::array<int, 2> n_by_1; + n_by_1[0] = n; + n_by_1[1] = 1; + //DC Eigen::array<int, 1> reduce_on_rows{{ 0 }}; + Eigen::array<int, 1> reduce_on_rows; + reduce_on_rows[0] = 0; #else Eigen::IndexList<Eigen::type2index<1>, int> matrix_1_by_nnz; matrix_1_by_nnz.set(1, nnz);

CUDA 8.0 requires new macros for FP16. Many thanks to Kashif / Mrry for pointing out the fix!

 --- a/tensorflow/stream_executor/cuda/cuda_blas.cc +++ b/tensorflow/stream_executor/cuda/cuda_blas.cc @@ -25,6 +25,12 @@ limitations under the License. #define EIGEN_HAS_CUDA_FP16 #endif +#if CUDA_VERSION >= 8000 +#define SE_CUDA_DATA_HALF CUDA_R_16F +#else +#define SE_CUDA_DATA_HALF CUBLAS_DATA_HALF +#endif + #include "tensorflow/stream_executor/cuda/cuda_blas.h" #include <dlfcn.h> @@ -1680,10 +1686,10 @@ bool CUDABlas::DoBlasGemm( return DoBlasInternal( dynload::cublasSgemmEx, stream, true /* = pointer_mode_host */, CUDABlasTranspose(transa), CUDABlasTranspose(transb), m, n, k, &alpha, - CUDAMemory(a), CUBLAS_DATA_HALF, lda, - CUDAMemory(b), CUBLAS_DATA_HALF, ldb, + CUDAMemory(a), SE_CUDA_DATA_HALF, lda, + CUDAMemory(b), SE_CUDA_DATA_HALF, ldb, &beta, - CUDAMemoryMutable(c), CUBLAS_DATA_HALF, ldc); + CUDAMemoryMutable(c), SE_CUDA_DATA_HALF, ldc); #else LOG(ERROR) << "fp16 sgemm is not implemented in this cuBLAS version " << "(need at least CUDA 7.5)";

And finally, ARM does not have NUMA nodes, so you need to add this or you will get an immediate crash when running tf.Session ()

 --- a/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc +++ b/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc @@ -888,6 +888,9 @@ CudaContext* CUDAExecutor::cuda_context() { return context_; } // For anything more complicated/prod-focused than this, you'll likely want to // turn to gsys' topology modeling. static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) { + // DC - make this clever later. ARM has no NUMA node, just return 0 + LOG(INFO) << "ARM has no NUMA node, hardcoding to return zero"; + return 0; #if defined(__APPLE__) LOG(INFO) << "OS X does not support NUMA - returning NUMA node zero"; return 0;

After these changes, create and install! Hope this is helpful for some people.

Matt kleinsmith · Answer 2 · 2016-11-25T15:12:49+0000

Follow the Dwight answer, but also create a paging file of at least 6 GB

Following the Dwight Crow answer , but with an 8 gig swap file and using the following command, successfully created TensorFlow 0.9 on Jetson TX1 from a new JetPack 2.3 installation:

bazel build -c opt --local_resources 3072,4.0,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package

I used the default settings for the TensorFlow ./configure script, with the exception of enabling GPU support.

My assembly took at least 6 hours. This will be faster if you use an SSD instead of a USB drive.

Creating a swap file

 # Create a swapfile for Ubuntu at the current directory location fallocate -l *G swapfile # List out the file ls -lh swapfile # Change permissions so that only root can use it chmod 600 swapfile # List out the file ls -lh swapfile # Set up the Linux swap area mkswap swapfile # Now start using the swapfile sudo swapon swapfile # Show that it now being used swapon -s

I used this USB stick to store my page file.

In most cases, I saw that my system use was 7.7 GB (3.8 GB on Mem and 3.9 GB on Swap). The swap memory that I saw was used right away with 4.4 GB. I used free -h to view memory usage.

Create pip package and install

Adapted from TensorFlow docs :

 $ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg # The name of the .whl file will depend on your platform. $ pip install /tmp/tensorflow_pkg/tensorflow-0.9.0-py2-none-any.whl

Acknowledgment

Thanks to Dwight Crow (manual), elirex (bazel and free -h option values), tylerfox (swap file option and local_resources option), everyone who helped them, and everyone to the Github problem stream .

The pagefile script was adapted from the JetsonHack gist .

Errors I received while using the page file

To help search engines find this answer.

Error: unexpected EOF from Bazel server.

gcc: internal compiler error: Killed (program cc1plus)

TensorFlow on Nvidia TX1

Follow the Dwight answer, but also create a paging file of at least 6 GB

Creating a swap file

Create pip package and install

Acknowledgment

Errors I received while using the page file

More articles: