The clients software was sped up by a factor of 6-7. Instrumental for this speedup was the removal of vectorization barriers and an improved memory layout of the algorithm. A subsequent training presented and explained these techniques to the staff members.

Case study: Speedup and vectorization of real-time signal processing

Photo: Danilo Cedrone, NOAA Photo Library

Porting algorithmic, compute-intense software to a new hardware often bears surprises with respect to performance — unfortunately, those are seldom good ones. So was the experience of Kiel-based L-3 Communications ELAC Nautik GmbH, a leader in underwater acoustics for marine and survey application and navigation. The in-house C++ code for real-time processing of acoustic signals (sonars) was optimized for specialized DSP-processors. For a new project, the client ported the software to a commodity Intel server processor. Yet, the performance achieved was not satisfactory, compared to the theoretical peak. The investiagion of the generated assembler code by ELAC revealed, that the code did not make use of the AVX vector units of the processor.

As internal resources to solve this problem were scarce, ELAC approached me. During a first, free-of-charge meeting, we could define and delimit the problem scope and agreed on the level of my involvement for finding and implementing a solution.

Analysis

The actual work was then carried out remotely. I found several causes for the lacking performance:

Indirect addressing
Unclear aliasing
A some places, an insufficient data locality

Software optimization

The indirect addressing could be eliminated by transforming the algorithm such that a case distinction could be moved out of the inner loop. Thus, automatic vectorization was possible.

In the second case, I improved the data locality by reordering the loops. Automatic vectorisation was then achieved by moving the now inner loops into a separate function.

Results

The algorithm could be sped up by a factor of 6-7. This surpassed the expected speed up significantly and permitted the customer to run this application part on a single core, leaving the other cores available for other tasks.

The project had a total effort of about 7 man days.

Training

After the project, ELAC wanted to train their staff, in order to be better prepared for future tasks involving porting and efficient implementation of algorithms. I prepared and delivered an bespoke inhouse training for the software development department, which was attended by 12 staff members. The training focused on performance, memory hierarchies, vectorization and multi-core parallelization. Using concrete examples from the client's software and the preceeding optimization project, the developers were in the position to directly understand the impact of these concepts for their day-to-day work. Upcoming questions could be clarified directly. This linking between general concepts and examples stemming from the developers'daily practice led to interesting ideas and discussions, which was very well perceived.

The training has fully met my expections concerning the contents and level of detail.

Marco Hahn, Head of Application Software, L3-Communications, ELAC Nautik GmbH, Kiel