Skip to main content
SHARE
Publication

Automatic Generation of High-Performance Convolution Kernels on ARM CPUs for Deep Learning...

Publication Type
Journal
Journal Name
IEEE Transactions on Parallel and Distributed Systems
Publication Date
Page Numbers
1 to 12
Volume
5
Issue
1

We present FastConv, a template-based code auto-generation open source library that can automatically generate high-performance deep learning convolution kernels of arbitrary matrices/tensors shapes. FastConv is based on the Winograd algorithm, which is reportedly the highest performing algorithm for the time-consuming convolution layers of convolutional neural networks. ARM CPUs cover a wide range designs and specifications, from embedded devices to HPC-grade CPUs. The leads to the dilemma of how to consistently optimize Winograd-based convolution solvers for convolution layers of different shapes. FastConv addresses this problem by using templates to auto-generate multiple shapes of tuned kernels variants suitable for skinny tall matrices. As a performance portable library, FastConv transparently searches for the best combination of kernel shapes, cache tiles, scheduling of loop orders, packing strategies, access patterns, and online/offline computations. Auto-tuning is used to search the parameter configuration space for the best performance for a given target architecture and problem size. The experiments with layer-wise evaluation on the VGG--16 model confirms a 1.25x performance gains is got by tuning the Winograd library. Integrated comparison results shows 1.02x to 1.40x, 1.14x to 2.17x, and 1.22x and 2.48x speedup is achieved over NNPACK, Arm NN, and FeatherCNN on the Kunpeng 920 beside few cases. Furthermore, problem size performance portability experiments with various convolution shapes shows that FastConv achieves 1.2x to 1.7x speedup and 2x to 22x speedup over NNPACK and ARM NN inference engine using Winograd on Kunpeng 920 . CPU performance portability evaluation on the VGG--16 show an average speedup over NNPACK of 1.42x, 1.21x, 1.26x, 1.37x, 2.26x, and 11.02x is observed on Kunpeng 920, Snapdragon 835, 855, 888, Apple M1, and AWS Graviton2, respectively.