e l p a M

2016-09-21

LAB - Understand FPGA OpenCL


Goals

  • Patrice the emulation development environment
  • Understand Task (Single Work-Item) v.s. NDRange
  • Understand Loop Pipeline
  • * Understand Coalescing


Prerequisite


a.      PC requirement:
  • 64Bit CPU / >4G DDR
  • Win7 or Ubuntu 16.04 or other version
b.      About install PC environment
  1.  Obtain Licenses for Quartus , Altera SDK for OpenCL and PR (Partial Reconfiguration), set LM_LICENSE_FILE environment variable to the license
  2.  Install the Quartus II software v. 16.0 standard edition (including Arria10 , Stratix V and Cyclone V device families)
    1. Please install at c:\altera\16.0 (for run script later, otherwise you need modify the .bat in the test project)
  1. Download and install the Altera SDK for OpenCL version 16.0
  1. Compiler Environment
    1. Windows : Install the Visual C++ 2010 Express version of Microsoft Visual Studio with support for C++ fromhttp://www.microsoft.com/visualstudio/eng/downloads.
      • Professional version is OK if you have a license for that.
    2. Linux :
      • Use “apt-get install build-essential” to prepare build environment
  2. If you do NOT have Visual C++ Professional
    1. Open the Programs & Features control panel in Windows 7.  Manually uninstall all Microsoft Visual C++ 2010 programs that have the word “Redistributable” in them, both x64 and x86 versions
    2. Download and install Microsoft Windows SDK Version 7.1 from http://www.microsoft.com/en-us/download/details.aspx?id=8279 . This software allows the compilation of 64-bit executables that are needed for the Altera OpenCL solution
  3. Also, you do NOT need Intel SDK for OpenCL anymore
  4. Before start the Lab day. Please make sure can understand PSG OpenCL emulator mode
 



Introduction

  OpenCL allows developers to write portable, high-performance code that can target all varieties of parallel processing platforms, including Intel CPUs, FPGAs and GPUs. Programming OpenCL on FPGA is much different in Programming OpenCL on GPUs. By leveraging loop-pipeline features, Sequential execution is accepted in many cases to get simple performance improvement.  
 This article discusses simple reductions. A reduction is a very simple operation that takes an array of data and reduces it down to a single element, for example – by summing all the elements in the array. Consider this simple CPU-C code, which sums all the elements in an array:
float reduce_cpu_sum(float* input, int length) {
 float accumulator = input[0];
 for(int i = 1; i < length; i++)
   accumulator += input[i];
 return accumulator;
}
This code is completely sequential! There’s no way to parallelize the loop, since every iteration of the loop depends on the iteration before it. How can we loop-pipeline it? How can we parallelize this code?
Hope all attendee can understand NDRange and Task implementation through this hands-on.


Lab1 - Task and Loop-pipelined Version

CPU version

float reduce_cpu_sum(float* input, int length) {
 float accumulator = input[0];
 for(int i = 1; i < length; i++)
   accumulator += input[i];
 return accumulator;
}
 Idea - in the above code, length is undetermined variable so compiler can’t unroll the for-loop correctly. We can divide DATA into const-M parts. And wrap the original for-loop by another for-loop with const-M. Each iteration handle length/const-M data, accumulate them and save it into local memory. After that, we can accumulate the local memory again.
Since compiler knows the const-M in compiler times, Compiler can unroll this wrapper for-loop automatically (or manually).



Advanced Topics - Running on the real device.
To speed up the simulation time, we let DATA_N be a small number 1000. Please increase this number to 1000000 and test the total performance.

Hands-on - implement above algorithm

Advanced Topics -
  • Running on the real device.
  • Compare the performance with Lab1.


Lab2 - Loop-pipelined Version and Memory access patterns


 In the previous code. To speed-up the performance, we unroll the level-1 for-loop in previous code. Since we have 0….M accumulator and global memory read instructions. The memory access pattern is shown below :
The access pattern is interleave and not continuous. It will cause cache penalty.  
To improve cache performance and hit rate. We can change the access pattern from above to below -
Hands-on - implement above algorithm

Advanced Topics -
  • Running on the real device.
  • Compare the performance with Lab1.


Lab3 - NDrange Version (Optional)


At the OpenCL work-group level. We’ll take advantage of associativity to break the vector into small chunks, each of which we’ll build independent reduction trees for each chunk, and execute them independently, in parallel. We’ll make sure each of the chunks is small enough that it fits in local memory, and then we’ll assign one work-item per element.
At each stage of the reduction tree, we’ll be loading and storing partial reductions as we compute, so it’s crucial to use local memory to communicate between work-items in the work group. We’ll then execute the reduction tree by using a for loop in conjunction with OpenCL barriers. For example, see the following figure, which performs a min reduction to find the smallest element in a vector:




__kernel
void reduce(
           __global float* buffer,
           __local float* scratch,
           __const int length,
           __global float* result) {

 int global_index = get_global_id(0);
 int local_index = get_local_id(0);
 // Load data into local memory
 if (global_index < length) {
   scratch[local_index] = buffer[global_index];
 } else {
   // Infinity is the identity element for the min operation
   scratch[local_index] = INFINITY;
 }
 barrier(CLK_LOCAL_MEM_FENCE);
 for(int offset = get_local_size(0) / 2;
     offset > 0;
     offset >>= 1) {
   if (local_index < offset) {
     float other = scratch[local_index + offset];
     float mine = scratch[local_index];
     scratch[local_index] = (mine < other) ? mine : other;
   }
   barrier(CLK_LOCAL_MEM_FENCE);
 }

 if (local_index == 0) {
   result[get_group_id(0)] = scratch[0];
 }
}



LAB - modify host code (opencl_main.cpp) and execute above code correctly.



Hint for LABs


  1. Naive, serial-based and task based solution is included in the lab-material. Please modify reduce.cl, follow the document and implement your version

  2. When attendee use Linux environment, Makefile is ready. And attende can use “make run_emulator” to compile, execute and monitor the result.
  3. When attendee modify the .cl file. All .cpp and .c file don’t required to recompile


Reference


2011-02-04

notices for using llvm bitcode

If someone use ANTLR3 and LLVM 2.7 to construct a compiler, I believe some of my notices/hints written here will be very very useful.


  1. 1. function's parameter is read only. you should use "alloca" to create a "local variable" to make function parameter writable.
  2. each "basic block" should have a "terminate" instruction. -> br and ret
  3. each "function module" should have only one "ret" instruction. only one!!!!!!!!!!
  4. here's my solution: create a "entry block" and "return block". puts every "alloca" to "entry block" and create a "alloca" for return variable. Puts "ret" instruction in "return block". If other statement need to return different return value, I just store "return variable" and "br" to "return block".
  5. ANTLR3's tree walker is great to write "multiple pass" symbol table constructor. But it's a little inconvenient for LLVM IR generator.
  6.  EX) conditional block, if() :I need to use IR Builder to create a conditional jump before it's successor not created.  When antlr3 walk to this point, my program still don't know how many elsif and else success behind me.
    My Solution:Create your own IR to present statement structure. (only statement)


LLVM is ssa system.

  1.  But it's impossible to let all of your expression be ssa form (just my experience). 
  2. I just using alloca to present cross-statement variable, store these variable to "stack" instead SSA is much good choice for me. 
  3. If you want to fully utilize SSA's benefit, you should write your own IR to present an expression. Using ANTLR3's grammar tree, you don't need to worry about your IR's operator priority. Just be careful for assignment's LHF and RHS.

2011-02-03

symbol mangling (compiler)

in the past year (2010), I'm working for a "action script" like language compiler.
Integrate with LLVM and ANTLR3, It's really easy to come out a "worked and lots of buggies" compiler. My target code is not just execute on a hardware, and it's not designed for general purpose. According to above situation, design a GOOD enough environment to verify my works.

1. there's no entry point for these program. (main function). I need to design another way to invoke my function. Either integrate with "C/C++" or just using cheat mangling to create a main function.
2. Because our platform target is very specially, the function signatures are different between language level and "llvm bitcode" level. (for other purpose, will append function parameter and change parameter types). If I want to using C/C++ to call my llvm bitcode, I need to get correct script name mangling and invoke function with correct parameters.

........mmmmm just mumbling. After read NameMangling: http://en.wikipedia.org/wiki/Name_mangling#Complex_example .

I think i need to change my old/stupid mangling design.

2011-01-22

Hint for gitorious

2011, Jan.

If you want to install/construct your own gitorious server on your ubuntu 10.10LTS.
Please remember do not use ruby 1.9.1

There's too many compatiable issue need to be fixed.
If you just a newbie with RoR. just use 1.8, don't challenge it.

2009-12-12

GObject - First Vala Programming


svn checkout http://gobject-learing.googlecode.com/svn/trunk/ gobject-learing-read-only


為達到最佳效果,請使用以上指令下載範例碼並與以下投影片一起服用。

Coppy from wiki:Vala "Vala is a programming language that was created with the goal of bringing modern language features to C, without additional runtime requirements and with little overhead, by targeting the GObject object system. It was developed by Jürg Billeter and Raffaele Sandrini. The syntax borrows heavily from C#. Rather than being compiled directly to assembler or to an intermediate language, Vala is compiled to C which is then compiled with the platform's standard C compiler."

According above description, any program written by Vala could translate to standard C code. So I add this topic between GObject tutorial. Because it really save our time for development.

I rewrote the sample previous GObject example by using Vala. You will find those boring redundant code are no longer need to self-written.


/* 6-1 - maman-bar.vala */
public class MamanBar : Object {
public int a;
public int b;

public MamanBar(){
a = 1; b = 2;
}
public void dumpall() {
stdout.printf ("a = %d b = %d\n",a,b);
}

}



#> valac -o maman-bar-vala maman-bar.vala main.vala
#> ./maman-bar-vala
a = 1 b = 2


I would like to demo how to connect vala and c together. We use the same C code from previous sample.

int main (int argc, char *argv[])
{
/* this function should be executed first. Before everything */
g_type_init();

/* Create our object */
MamanBar *bar = maman_bar_new();
maman_bar_dumpall( bar );

return 0;
}

The only difference are compile commands.

#> valac -C -H maman-bar.h maman-bar.vala
#> cc `pkg-config --cflags glib-2.0 gtk+-2.0` -c -o main.o main.c
#> cc `pkg-config --cflags glib-2.0 gtk+-2.0` -c -o maman-bar.o maman-bar.c
#> cc -o maman-bar-c main.o maman-bar.o `pkg-config --libs glib-2.0 gtk+-2.0`
#> ./maman-bar-c
a = 1 b = 2

2009-12-11

GObject - Virtual Function


svn checkout http://gobject-learing.googlecode.com/svn/trunk/ gobject-learing-read-only


為達到最佳效果,請使用以上指令下載範例碼並與以下投影片一起服用。





= Member Function =
: the functions which have a self parameter

If you ever write python code to implement about CLASS. The class member function may look like bellow:


def f(self):
return 'hello world'


You can find out the first parameter always be class itself. And the function will use 'self' to access the member value. The same principle we use in GObject programming.
We always use thiz (instead this, because C++ syntax) for first parameter in each member functions. Whether public function or private function.

= Public Member Function =
: What's different between "Just Function" and "Public Member Function"?

Same as C++, in GObject programming, all the functions define in the class structure header with first self parameter are public member functions.
In example 5-1 we use a public member function to replace what we done in main.c, dump all member value.


void maman_bar_dump_all_value( MamanBar* thiz )
{
printf(" bar->a = (%d) bar->b = (%d)\n", thiz->a, thiz->b );
}


= Private Member Function =
: just simple rule, static functions (with first self parameter) are private

It's too simple. I believe we don't need to explain this idea.


= Member Function with Virtual Feature =




/* 5-2 */
/* maman-bar.h */
struct _MamanBarClass {
GObjectClass parent;
/* class members */

void (* incAll) ( MamanBar* );
};

/* maman-bar.c */
static void maman_bar_class_init ( MamanBarClass* class)
{
class->incAll = maman_bar_inc_all;
}


In this step, we need to explain what's different between MamanBar and MamanBarClass. We use g_object_new to allocate a memory size from heap, and g_object_new invoke function maman_bar_init to initialize this memory region. But different from MamanBar, MamanBarClass should only exists one copy during one time period.


maman_bar_dump_all_value( bar );
MAMAN_BAR_GET_CLASS( bar ) -> incAll(bar);
maman_bar_dump_all_value( bar );

In above sample code, you can see we use MAMAN_BAR_GET_CLASS to access CLASS structure, and use it to invoke the class member functions.
The whole idea in virtual function is child class will use it's member function replace the parent class's member function pointer.

= Virtual Function =

Base on example 4-2. We implement sub-bar class and replace the parent's class function pointer.


/* 5-3 */
/* sub-bar.c */
static void sub_bar_class_init ( SubBarClass* class)
{
MamanBarClass* parent_class = class;
parent_class->incAll = sub_bar_inc_all;
}


After constructor invoke (base_init and class_init ...), we can invoke child function by using parent's fp. It's simple and easy to use.


/* main.c */
SubBar *subbar = g_object_new (SUB_BAR_TYPE, NULL);
sub_bar_dump_all_value( subbar );
MAMAN_BAR_GET_CLASS( subbar ) -> incAll( subbar );
sub_bar_dump_all_value( subbar );

2009-11-30

Gobject - Inherit (Chinese)

svn checkout http://gobject-learing.googlecode.com/svn/trunk/ gobject-learing-read-only 

Of course, you can download the source by using above command.
I found there's some different between GLIB v2.0 and v2.2.
So I rewrote the example code. and separated the "STEPS" more detail.

I believe it will be helpfull for someone(?) who wanna study GObject in dummy.


* BTW. i give up Google Doc in this time. because it can't not upload odf file. and i don't want to twit the slide every time :( .