GBDT Accelerator
Decision trees are a light and efficient machine learning technique that has proved their effectiveness in several classification problems. In the context of embedded systems, energy efficiency is as important as accuracy, so it is necessary to search for efficient algorithms liable to be accelerated. This makes the decision trees a perfect target for developing an FPGA accelerator. A single decision tree is frequently not very accurate for complicated tasks but, thanks to ensemble methods, it is possible to combine several trees in order to deal with complex problems.
Gradiant Boosting Decision Trees
Despite their advantages, individual Decision Trees
often lack accuracy for complex tasks, so ensemble
methods like Gradient Boosting (GBDT) are used
to improve performance. GBDT trains trees
sequentially, correcting previous errors to refine
predictions. Designers can balance accuracy,
computation, and model size by adjusting the
number of trees used, without retraining. This
flexibility is particularly valuable in embedded
systems, where memory and power are limited.
Additionally, Decision Trees rely mainly on integer
comparisons rather than floatingpoint operations,
reducing computational load and allowing
execution on hardware without floatingpoint units.
Conventional implementations of GBDT suffer
from poor scalability for large datasets or a large
number of features, but recently some efficient
implementations have overcome this drawback.
For instance, LightGBM is a highly efficient open
source GBDT based framework that offers up to 20
times higher performance over conventional
GBDT. Our accelerator is capable of executing
GBDT models trained with LightGBM.
Design Architecture
LightGBM follows a onevsall strategy for classification problems that consists of training a different estimator (i.e., a set of trees) for each class, where each estimator predicts the probability of belonging to that class. Hence, during inference, each class is independent of the others, and the trees of each class can be analyzed in parallel. We take advantage of this parallelism by including one dedicated module for each class.

One goal was to store the trees in the onchip memory resources of the FPGA to minimize data transfers with the external memory. To achieve this, the trees of a class are stored in a dedicated RAM memory called trees_nodes, using a compact 32bit format for each node. Nonleaf nodes store an 8bit field for the input feature, a 16bit field for the comparison value, and a 7bit relative address for the right child, while the left child is implicitly determined using preorder traversal. A flag in the least significant bit indicates whether the node is a leaf. This structure optimizes storage by avoiding absolute addresses, limiting tree depth to 128, which is sufficient for GBDT models.
Leaf nodes use the 32bit word to store a 16bit fixedpoint output, a 14bit address for the next tree, and two flag bits to indicate if the node is a leaf and if it is the last tree in the class. The fixed point representation provides similar accuracy to the original 32bit floatingpoint output while reducing memory requirements.


The other goal was to improve the speed, while
trying to keep processing one node per cycle. To
reduce the clock cycle we have designed a multi
cycle implementation. The final design executes a
node in three cycles, achieving an important clock
period reduction and, at the same time, providing a
clear architecture.
The issue with this scheme is that we do not know
the next node until the previous node has finished
its execution stage. Hence we cannot fetch it in
advance. Therefore, a simple pipeline will not
provide any benefit. Highperformance processors
alleviate this problem by including complex
support for speculative execution, but it is not an
efficient solution for embedded systems, since it
introduces significant energy overheads.
However, this problem can be solved by combining
the pipeline with a multithreading approach. The
idea is to include support to interleave the
execution of three different trees. This can be done
by including three @_last_node registers (that is
the same as having three program counters in a
processor). The trees in a class are divided into
three sets, and each counter manages the execution
of one of these sets.

Experimental results
In order to evaluate the performance of our design, we have
executed the inference of different models in a high
performance (HP) CPU, an Intel(R) Core(TM) i58400 CPU
@ 2.80GHz, and also in an embedded system (ES) CPU, the
Dualcore ARM CortexA9 @ 667 MHz, with the following
results:
- HP CPU consumes x72 energy, our design is x2 faster.
- ES CPU consumes x23 energy, our design is x30 faster.