GBDT Accelerator

Decision trees are a light and efficient machine learning technique that has proved their effectiveness in several classification problems. In the context of embedded systems, energy efficiency is as important as accuracy, so it is necessary to search for efficient algorithms liable to be accelerated. This makes the decision trees a perfect target for developing an FPGA accelerator. A single decision tree is frequently not very accurate for complicated tasks but, thanks to ensemble methods, it is possible to combine several trees in order to deal with complex problems.

Gradiant Boosting Decision Trees

Despite their advantages, individual Decision Trees often lack accuracy for complex tasks, so ensemble methods like Gradient Boosting (GBDT) are used to improve performance. GBDT trains trees sequentially, correcting previous errors to refine predictions. Designers can balance accuracy, computation, and model size by adjusting the number of trees used, without retraining. This flexibility is particularly valuable in embedded systems, where memory and power are limited. Additionally, Decision Trees rely mainly on integer comparisons rather than floating­point operations, reducing computational load and allowing execution on hardware without floating­point units.

Conventional implementations of GBDT suffer from poor scalability for large datasets or a large number of features, but recently some efficient implementations have overcome this drawback. For instance, LightGBM is a highly efficient open­ source GBDT ­based framework that offers up to 20 times higher performance over conventional GBDT. Our accelerator is capable of executing GBDT models trained with LightGBM.

Design Architecture

LightGBM follows a one­vs­all strategy for classification problems that consists of training a different estimator (i.e., a set of trees) for each class, where each estimator predicts the probability of belonging to that class. Hence, during inference, each class is independent of the others, and the trees of each class can be analyzed in parallel. We take advantage of this parallelism by including one dedicated module for each class.

Accelerator Design
Accelerator Design

One goal was to store the trees in the on­chip memory resources of the FPGA to minimize data transfers with the external memory. To achieve this, the trees of a class are stored in a dedicated RAM memory called trees_nodes, using a compact 32­bit format for each node. Non­leaf nodes store an 8­bit field for the input feature, a 16­bit field for the comparison value, and a 7­bit relative address for the right child, while the left child is implicitly determined using pre­order traversal. A flag in the least significant bit indicates whether the node is a leaf. This structure optimizes storage by avoiding absolute addresses, limiting tree depth to 128, which is sufficient for GBDT models.

Leaf nodes use the 32­bit word to store a 16­bit fixed­point output, a 14­bit address for the next tree, and two flag bits to indicate if the node is a leaf and if it is the last tree in the class. The fixed­ point representation provides similar accuracy to the original 32­bit floating­point output while reducing memory requirements.

Nodes representation
Nodes representation
Tree representation
Tree representation

The other goal was to improve the speed, while trying to keep processing one node per cycle. To reduce the clock cycle we have designed a multi­ cycle implementation. The final design executes a node in three cycles, achieving an important clock­ period reduction and, at the same time, providing a clear architecture. The issue with this scheme is that we do not know the next node until the previous node has finished its execution stage. Hence we cannot fetch it in advance. Therefore, a simple pipeline will not provide any benefit. High­performance processors alleviate this problem by including complex support for speculative execution, but it is not an efficient solution for embedded systems, since it introduces significant energy overheads.

However, this problem can be solved by combining the pipeline with a multi­threading approach. The idea is to include support to interleave the execution of three different trees. This can be done by including three @_last_node registers (that is the same as having three program counters in a processor). The trees in a class are divided into three sets, and each counter manages the execution of one of these sets.

Multi­threading class diagram
Multi­threading class diagram

Experimental results

In order to evaluate the performance of our design, we have executed the inference of different models in a high performance (HP) CPU, an Intel(R) Core(TM) i5­8400 CPU @ 2.80GHz, and also in an embedded system (ES) CPU, the Dual­core ARM Cortex­A9 @ 667 MHz, with the following results:
­- HP CPU consumes x72 energy, our design is x2 faster.
­- ES CPU consumes x23 energy, our design is x30 faster.

Github Icon