Multicore systems are complex in that multiple processes are running concurrently and can interfere with each other. It was then improved and thoroughly validated on TI’s Keystone 2 System-on-Chip (SoC). The proposed solution was first tested on the Adapteva Parallella board. The solution described in this paper focuses on a generic way of correlating traces among different kinds of processors through traces synchronization, to analyze the global state of the system as a whole. We propose a generic solution to trace embedded heterogeneous systems and overcome the challenges brought by their peculiar architectures (little available memory, bare-metal CPUs, or exotic components for instance). This paper proposes to fill this gap and discusses how efficient tracing can be achieved without having common system tools, such as the Linux Trace Toolkit (LTTng), at hand on every core. Even though standard tools and tracing methodologies exist for standard and distributed environments, it is not the case for heterogeneous embedded systems. Tracing is a common method used to debug, analyze, and monitor various systems. © 2019, Springer Science+Business Media, LLC, part of Springer Nature. Finally, we also enhance the memory management of an application to speed up the execution. We suggest a better placement of the computation nodes on the available hardware components for a distributed application. For example, we reduce the execution time of a face recognition application by a factor of 5X. We present a few examples of machine learning applications that can be optimized with the help of the information provided by our proposed method. To demonstrate the effectiveness of the method, it was evaluated for TensorFlow, a well-known machine learning library that uses a dataflow computational graph to represent the algorithms. This is followed by post-mortem analysis and visualization steps in order to enhance the trace and show useful information to the user. The collected traces include low-level information about the CPU, from the Linux Kernel (system calls), as well as mid-level and high-level information respectively about intermediate libraries like CUDA, HIP or HSA, and the dataflow model. The work in this paper aims at providing useful information about the execution of the dataflow graph on the available hardware, in order to understand and possibly improve the performance. To accelerate the execution, some co-processing units, like GPUs, are often used for computing intensive nodes. Within the graph, the data flows along the edges, and the nodes correspond to the computing units that process the data. Dataflow models can be represented by graphs and are widely used in many domains like signal processing or machine learning. In this paper, we propose a profiling and tracing method for dataflow applications with GPU acceleration.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |