Frequent Itemset Mining using the Apriori Approach
The goal of this project is to implement the Frequent Itemset Mining algorithm (The Apriori Algorithm) using the Xilinx's SDAccel Environment (with Vivado HLS).
Brief description of the Project
Frequent item-set mining is a widely used data-mining technique used for discovering sets of frequently occurring items in large databases. A typical example of where this algorithm is used is in large supermarket stores where the sellers intend to observe the correlations of different products in the store. For example, to observe which items are bought mostly together. This information would enable them to have better arrangement policies of their goods and would also help a great deal with effective marketing of products. The frequent itemset mining (FIM) is also very useful in several application domains such as social networks, learning networks and also supermarket applications to detect trends in sales. This applications in important fields make frequent itemset mining an area of active research today.
The FIM is a very data computation intensive application especially when the input dataset is large or sparse, or when the frequency threshold of the mining process is small. This makes it highly necessary to optimize the algorithm so it could produce frequent itemsets within reasonable times. As such, it would be interesting to find out how the algorithm would perform when implemented in parallel platforms such as the FPGA.
Therefore, this project aims to implement the FIM algorithm (specifically the Eclat FIM algorithm) on the FPGA. We intend to use the Xilinx SDAccel development environment for the C/C++/OpenCL high-level synthesis tool where we can develop the algorithm using C or C++ application code and then utilize its high level synthesis tool to compile this code unto an FPGA board. Utilizing the high level synthesis tool by Xilinx would enable us greater ease than developing the RTL code from scratch. The flow for this SDAccel platform is as shown below:
The SDAccel workflow.
The workflow of the SDAccel can be briefly described in the diagram below (link from xilinx's website Source: https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html):
Benefits of the SDAccel platform to Application developers.
As shown above, the SDAccel basically provides the developer with a GPU and CPU-like Programming Experience and was fundamentally created for Data Center Workload Acceleration. Using such a tool (incorporated with HLS), you can have several benefits and advantages when creating your FPGA design. Amongsts the huge benefits SDAccel provides to the developer are:
Relieves the Programmer many low level details associated with handwriting VHDL/Verilog code, or programming PCIE interface: SDAccel provides FPGA programmers the ability to write and synthesize their kernel code using the high level programming languages, and not care about low-level VHDL/Verilog details. These details can make the hard writing and implementation of RTL (VHDL/Verilog) code and PCIe IO implementation on the FPGA board very cumbersome. However, with the SDAccel platform with the Vivado HLS, you can implement your RTL design using a supported high level language (C, C++, OCL), attatch certain HLS constraints and directives to the compiler, and have the awesome Vivado HLS too automatically generate for you the optimal design you so desire. You can even go as far as specifying what design metric you wish to optimize (e.g. area, power, latency, etc). Additionally, manual implementation of the IO of your FPGA design can be a hard excercise. SDAccel greatly helps the CPU-FPGA co-design developer to easily integrate his design and make calls between heterogenous machines (e.g. FPGA and CPU) using the awesome SDAccel API provisions.
Provides performance verfification of design via emulation, performance estimation, and a quite detailed application profiling: Another awesome provision by the SDAccel platform is that it gives you detailed profiling and performance estimation results of your application design. This can be very helpful for FPGA developers in quickly locating the bottlenecks in their design - especially as the main hurdle in the FPGA world is the very lengthy compilation time. Thus, SDAccel platform helps the developer to quickly know the estimated performance metrics of the kernel architecture that the Vivado HLS has built for him, and re-shape his code to modify his design if necessary.
Uses the Vivado HLS tool to direct the compiler on building provides flexible kernel architectures: A very awesome tool used by the SDAccel platform (as mentioned earlier) is the Vivado HLS tool. This tool is a High Level Synthesis tool which is responsible for converting the C/C++/OCL kernel design into an Xilinx specific RTL format. This format can now be synthesized into a Xilinx FPGA. Vivado HLS allows the developer to specify several in-code optimization directives to the compiler (such as loop unrolling, pipelining, array partitioning, and many more). This directs the compiler on what structure of architecture it should create for the FPGA.
Easy integration of CPU and FPGA co-execute an application: As mentioned earlier, The SDAccel enables the programmer to write applications on the CPU which can invoke the FPGA for compute-intensive tasks (i.e. parallelizable tasks which would have otherwise taken a long time for the CPu to complete). This way, SDAccel enables CPU application developers to take advantage of the awesome parallelization the FPGA could offer to accelerate their applications.
Description of the Algorithm to be implemented.
The Algorithm to be implemented is the Frequent Itemset Mining Algorithm which allows marketing platforms (or other related commercial businesses) to be able to find probable correlations between their products from the purchase records of their customers. A very prominent example of where this application is used is in the Amazon website (as shown below) where you attempt to look on a product and on the sidebar (or somewhere in the page), you see items which are frequently bought with that particular item. The field of extracting this product correlation data is called Frequent Itemset Mining
Simple example of how the algotrithm works.
Several spproaches to the algorithm have been proposed by in this project, we would go with the oldest approach (The Apriori Approach). A very simple description of the algorithm is shown to show how the algorithm works.
In the first stage, the itemsets are enlisted and support counting is done to find out how many times it occurs in the database transaction items. From CPU profiling of the algorithm, we discovered that this segment of the Apriori algorithm is the most compute intensive because it has to make several nested iterative searches through the entire database of items. Therefore, it was ported to the FPGA for acceleration. After this, the resulting vector of items are filtered using the minimum support count to filter out the items that occurs less frequently in the list of transaction items:
In the second stage, the resulting support vector (from stage 1) is pruned to eliminate the items which have at least one subitem that is not found in at least one transaction. The pruning stage is described as shown below:
The algorithm continues to iterate through these major steps (support counting, candidate filtering and pruning until the itemset vector of the elements in either stage (support counting, candidate filtering or pruning) is empty). Then the last support count vector item (s) represent the frequent itemsets mined by the algotithm.
A very good video source to understand the basic idea of the algorithm in a few minutes is https://www.youtube.com/watch?v=Hk1zFOMLTrw