Towards faster LSTM inference on financial time series

17 Oct, 2023

FPGAs enable programmers to tailor the electronics directly to the application in question, bypassing the general purpose, black-box, microarchitecture of CPUs and GPUs. This specialisation can potentially help ameliorate issues such as memory-bound codes and FPGAs are also well known to be highly energy efficient. Therefore I believe they are very interesting as a potential future target for HPC workloads, and whilst programming FPGAs has traditionally been a major drawback, recent advances in the ecosystem by the two major vendors, Xilinx and Intel, mean that we can write code in C++ and for much of the underlying process to be automated.

AI in high-frequency trading

With AI models being commonplace in trading where traders seek to reduce time-to-market for emerging algorithms, improve model quality and reduce overall costs, financial firms have to keep pace and therefore must integrate new AI workloads in their high frequency, low latency trading context to stay competitive in fast moving markets. While FPGAs have enjoyed significant popularity in financial algorithmic trading, FPGAs have been used to speedup the execution of simple trades through network-based solutions in real-time trading in the nanoseconds than to execute more complex workloads. With the real-time nature requirements of high frequency trading, there is only a small window in which data manipulations can occur. Therefore these real-time transformations are, by necessity, fairly simplistic as there is not time for more advanced workloads.

However, the past few years have seen very significant improvements in both the hardware and software eco-system for FPGAs which are potentially a game changer in this regard and enable the integration of AI workloads such as natural language processing and real-time inference of trading signals from market feed data in the low latency context. New, more advanced hardware technologies such as Xilinx’s Alveo and Intel’s Stratix range, provide far more capability than ever before, and with exciting developments such as the AI engines in Xilinx’s latest generation Versal ACAP FPGA, open up significant possibilities. Furthermore, the investment in the software ecosystem not only improving the programmability of these devices but also the growth of open source libraries, potentially significantly reduces programming time and enables the development of more complex codes. Typically requiring significant amounts of computation, an important question is the role that novel architectures can play in accelerating these models.

Working with STAC

Working with STAC greatly benefits the research with the ultimate aim being to enable further understanding of the dataflow techniques required to exploit next-generation FPGAs for AI workloads in high-frequency trading. Working with industry-standard benchmark codes written in a mixture of C, C++ and Python, and associated specifications means that our FPGA-based research will be applicable to these real-world problems, and having access to the STAC community also increases our ability to disseminate research findings and gain feedback on research results from experts in the field of finance.

Towards real-time market risk analysis using FPGA

In a previous paper, we explored the acceleration of the industry standard STAC’s derivatives risk analysis benchmark STAC-A2 by porting the Heston stochastic volatility model and Longstaff and Schwartz path reduction onto a Xilinx Alveo U280 FPGA with a focus on efficiency-driven computing. The STAC-A2 benchmark focuses on real-world market risk analysis which is an important, ongoing task for investors, trading firms and regulatory authorities. Whilst computational performance is one essential aspect of effective risk analysis, energy efficiency is also important because of the dedicated infrastructure in place and increased frequency of generation and usage of derived risk information in general.

Previously Xilinx developed a proprietary implementation of this benchmark on their Alveo U250 FPGAs which, when running over eight U250s, obtained a 1.48 times speed up compared to the CPU in an official STAC audit. In our work, we constantly outperform the parallel, optimised CPU version (running on two 24-core Xeon Platinum CPUs) with improvements to performance between 1.5 to 8 times and to energy efficiency between 8 to 185 times. This not only demonstrated the clear benefit of leveraging FPGAs for low-latency, efficiency-driven quantitative finance workloads, but furthermore the optimisation techniques described and lessons learnt from numerical experiments make a strong case for the optimisation potential by deploying financial AI models on FPGAs.

Precision and the potential for Quantization

As part of this research, I am also interested in the role of other numerical representations including fixed point and arbitrary precision, where the reconfigurable nature of FPGAs means that there is considerable flexibility in these regards. Measuring calculation accuracy, time to solution, and energy usage, there are interesting tradeoffs possible especially with Xilinx’s AI engines which provide high-performance vectorisation capabilities for fixed point arithmetic.

Sparsity

Exploiting sparsity patterns reduces the memory footprint and the number of required computations (as only computations for non-zero elements are performed). With FPGAs, we can build sparsity pattern-specific accelerators.