Skip to content

OpenCL and SYCL (Phase 1 §4 — Sub-Track 5)

Parent: C++ and Parallel Computing

Vendor-neutral parallel compute — one programming model for CPU, GPU, and FPGA.

Prerequisites: Sub-Track 3 (CUDA and SIMT). OpenCL maps directly onto CUDA concepts; SYCL then adds modern C++ on top.

Layer mapping: L1 (application — you write kernels), L3 (runtime — OpenCL/SYCL runtime, device drivers).


Why Learn This

CUDA locks you to NVIDIA. HIP gets you AMD. OpenCL and SYCL get you everything — Intel GPUs, AMD GPUs, FPGAs, CPUs, even custom accelerators. As an AI hardware engineer, you'll encounter systems where NVIDIA isn't an option: Intel Gaudi accelerators, Xilinx/AMD FPGAs, embedded SoCs, or multi-vendor cloud instances.

CUDA HIP OpenCL SYCL
Vendor NVIDIA only AMD + NVIDIA Any (Khronos standard) Any (Khronos standard)
Language CUDA C++ HIP C++ OpenCL C (separate) Standard C++
Kernel style Inline __global__ Inline __global__ Separate .cl source Lambda in host code
Maturity 18+ years 8+ years 15+ years 5+ years
AI ecosystem Dominant Growing Limited Growing (Intel)
Best for NVIDIA GPUs AMD/NVIDIA GPUs Portable legacy, FPGAs Modern portable C++

Part 1: OpenCL

1.1 Platform Model

OpenCL organises hardware into a hierarchy:

Platform (vendor driver: Intel, AMD, NVIDIA, Xilinx)
  └── Device (GPU, CPU, FPGA, accelerator)
        └── Compute Unit (CU) — SM on NVIDIA, CU on AMD, PE on FPGA
              └── Processing Element (PE) — individual ALU / thread

Context binds one or more devices together. Command queues send work to a specific device within a context.

Host (CPU)
  ├── Context ─────────── Device 0 (GPU)
  │     │                     └── Command Queue 0
  │     │
  │     └──────────────── Device 1 (FPGA)
  │                           └── Command Queue 1
  └── Buffers, Programs, Kernels (shared within context)

1.2 Execution Model

OpenCL's execution model maps directly onto CUDA concepts:

OpenCL CUDA equivalent Meaning
Work-item Thread Single execution instance
Work-group Block Group of work-items with shared memory
NDRange Grid Global problem space (1D/2D/3D)
get_global_id(0) blockIdx.x * blockDim.x + threadIdx.x Global index
get_local_id(0) threadIdx.x Local index within work-group
get_group_id(0) blockIdx.x Work-group index
barrier(CLK_LOCAL_MEM_FENCE) __syncthreads() Work-group barrier

1.3 Memory Model

OpenCL memory spaces:

┌─────────────────────────────────────────────┐
│ Global Memory (device DRAM / HBM)           │  ← CUDA: global memory
│   Accessible by all work-items              │
│   Slow (~300 cycles)                        │
├─────────────────────────────────────────────┤
│ Constant Memory                              │  ← CUDA: constant memory
│   Read-only, cached                          │
├─────────────────────────────────────────────┤
│ Local Memory (per work-group)               │  ← CUDA: shared memory (__shared__)
│   Fast (~4 cycles), programmer-managed      │
│   Shared within work-group                  │
├─────────────────────────────────────────────┤
│ Private Memory (per work-item)              │  ← CUDA: registers / local
│   Fastest, compiler-allocated               │
└─────────────────────────────────────────────┘

1.4 OpenCL C Kernel

Kernels are written in OpenCL C (a C99 dialect) and compiled at runtime from source strings.

// kernel.cl — vector addition
__kernel void vector_add(
    __global const float* a,
    __global const float* b,
    __global float* c,
    const int n)
{
    int i = get_global_id(0);
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

Memory qualifiers: - __global — pointer to device global memory - __local — pointer to work-group shared memory - __constant — pointer to constant memory - __private — (default) per-work-item

1.5 Host Code — The Full Setup

OpenCL host code is verbose — you must explicitly create platform, device, context, queue, buffers, program, and kernel objects. This boilerplate is the price of portability.

#include <CL/cl.h>
#include <cstdio>
#include <cstdlib>

int main() {
    const int N = 1024;
    size_t bytes = N * sizeof(float);
    float *h_a = (float*)malloc(bytes);
    float *h_b = (float*)malloc(bytes);
    float *h_c = (float*)malloc(bytes);

    for (int i = 0; i < N; i++) { h_a[i] = i; h_b[i] = i * 2; }

    // ── Step 1: Get platform and device ────────────────────────
    cl_platform_id platform;
    clGetPlatformIDs(1, &platform, NULL);

    cl_device_id device;
    clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

    // ── Step 2: Create context and command queue ───────────────
    cl_context ctx = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
    cl_command_queue queue = clCreateCommandQueueWithProperties(ctx, device, 0, NULL);

    // ── Step 3: Create buffers ─────────────────────────────────
    cl_mem d_a = clCreateBuffer(ctx, CL_MEM_READ_ONLY,  bytes, NULL, NULL);
    cl_mem d_b = clCreateBuffer(ctx, CL_MEM_READ_ONLY,  bytes, NULL, NULL);
    cl_mem d_c = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY, bytes, NULL, NULL);

    // ── Step 4: Upload data ────────────────────────────────────
    clEnqueueWriteBuffer(queue, d_a, CL_TRUE, 0, bytes, h_a, 0, NULL, NULL);
    clEnqueueWriteBuffer(queue, d_b, CL_TRUE, 0, bytes, h_b, 0, NULL, NULL);

    // ── Step 5: Build program from source ──────────────────────
    const char* src = "__kernel void vector_add("
        "__global const float* a, __global const float* b,"
        "__global float* c, const int n) {"
        "  int i = get_global_id(0);"
        "  if (i < n) c[i] = a[i] + b[i];"
        "}";

    cl_program prog = clCreateProgramWithSource(ctx, 1, &src, NULL, NULL);
    clBuildProgram(prog, 1, &device, "-cl-fast-relaxed-math", NULL, NULL);

    // ── Step 6: Create kernel and set arguments ────────────────
    cl_kernel kernel = clCreateKernel(prog, "vector_add", NULL);
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_a);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_b);
    clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_c);
    clSetKernelArg(kernel, 3, sizeof(int),    &N);

    // ── Step 7: Launch kernel ──────────────────────────────────
    size_t global_size = N;
    size_t local_size  = 256;
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL,
                           &global_size, &local_size, 0, NULL, NULL);

    // ── Step 8: Read back results ──────────────────────────────
    clEnqueueReadBuffer(queue, d_c, CL_TRUE, 0, bytes, h_c, 0, NULL, NULL);

    printf("c[0] = %f, c[%d] = %f\n", h_c[0], N-1, h_c[N-1]);

    // ── Cleanup ────────────────────────────────────────────────
    clReleaseKernel(kernel);
    clReleaseProgram(prog);
    clReleaseMemObject(d_a); clReleaseMemObject(d_b); clReleaseMemObject(d_c);
    clReleaseCommandQueue(queue);
    clReleaseContext(ctx);
    free(h_a); free(h_b); free(h_c);
}
# Compile (link against OpenCL ICD loader)
gcc -O2 -o vecadd vecadd.c -lOpenCL

# Run (works on any GPU/CPU with an OpenCL driver)
./vecadd

1.6 Tiled Matrix Multiply with Local Memory

This is the OpenCL equivalent of CUDA shared-memory tiled GEMM:

// matmul.cl — tiled matrix multiply using local memory
#define TILE 16

__kernel void matmul(
    __global const float* A,
    __global const float* B,
    __global float* C,
    const int N)
{
    int row = get_local_id(1);
    int col = get_local_id(0);
    int gRow = get_global_id(1);
    int gCol = get_global_id(0);

    __local float tileA[TILE][TILE];
    __local float tileB[TILE][TILE];

    float sum = 0.0f;

    for (int t = 0; t < N / TILE; t++) {
        // Load tiles into local memory
        tileA[row][col] = A[gRow * N + t * TILE + col];
        tileB[row][col] = B[(t * TILE + row) * N + gCol];
        barrier(CLK_LOCAL_MEM_FENCE);

        // Compute partial dot product from local memory
        for (int k = 0; k < TILE; k++)
            sum += tileA[row][k] * tileB[k][col];
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    C[gRow * N + gCol] = sum;
}

Launch with 2D NDRange:

size_t global[2] = {N, N};
size_t local[2]  = {TILE, TILE};   // 16×16 = 256 work-items per group
clEnqueueNDRangeKernel(queue, kernel, 2, NULL, global, local, 0, NULL, NULL);

1.7 Event-Based Profiling

// Create queue with profiling enabled
cl_queue_properties props[] = {CL_QUEUE_PROPERTIES, CL_QUEUE_PROFILING_ENABLE, 0};
cl_command_queue queue = clCreateCommandQueueWithProperties(ctx, device, props, NULL);

cl_event event;
clEnqueueNDRangeKernel(queue, kernel, 1, NULL,
                       &global_size, &local_size, 0, NULL, &event);
clWaitForEvents(1, &event);

cl_ulong start, end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(start), &start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END,   sizeof(end),   &end,   NULL);

printf("Kernel time: %.3f ms\n", (end - start) / 1e6);

1.8 Device Query

Always query what the device supports before assuming capabilities:

char name[128];
cl_uint cu_count;
cl_ulong global_mem, local_mem;
size_t max_wg;

clGetDeviceInfo(device, CL_DEVICE_NAME,                128, name,       NULL);
clGetDeviceInfo(device, CL_DEVICE_MAX_COMPUTE_UNITS,     4, &cu_count,  NULL);
clGetDeviceInfo(device, CL_DEVICE_GLOBAL_MEM_SIZE,       8, &global_mem,NULL);
clGetDeviceInfo(device, CL_DEVICE_LOCAL_MEM_SIZE,        8, &local_mem, NULL);
clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_GROUP_SIZE,sizeof(max_wg), &max_wg, NULL);

printf("Device: %s\n", name);
printf("CUs: %u, Global mem: %lu MB, Local mem: %lu KB, Max WG: %zu\n",
       cu_count, global_mem / (1024*1024), local_mem / 1024, max_wg);

Part 2: SYCL

2.1 What SYCL Is

SYCL is a modern C++ abstraction built on top of OpenCL (and now other backends). It eliminates OpenCL's boilerplate by expressing kernels as C++ lambdas and managing buffers with RAII accessors.

OpenCL:
  Platform → Device → Context → Queue → Buffer → Program → Kernel → SetArgs → Launch
  (~50 lines of setup)

SYCL:
  queue q;
  q.submit([&](handler& h) { h.parallel_for(..., [=](id<1> i) { ... }); });
  (~5 lines)

SYCL implementations:

Implementation Vendor Backend targets
Intel oneAPI DPC++ Intel Intel GPUs, CPUs, FPGAs, NVIDIA GPUs (via plugin)
AdaptiveCpp (formerly hipSYCL) Open-source AMD GPUs, NVIDIA GPUs, CPUs
ComputeCpp Codeplay ARM Mali, Renesas, custom
triSYCL Xilinx/AMD FPGAs (experimental)

2.2 SYCL Hello World — Vector Addition

#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>

int main() {
    const int N = 1024;
    std::vector<float> a(N), b(N), c(N);

    for (int i = 0; i < N; i++) { a[i] = i; b[i] = i * 2; }

    {
        // Create buffers that wrap host vectors
        sycl::buffer<float> buf_a(a.data(), N);
        sycl::buffer<float> buf_b(b.data(), N);
        sycl::buffer<float> buf_c(c.data(), N);

        // Select device (GPU preferred, falls back to CPU)
        sycl::queue q(sycl::gpu_selector_v);
        std::cout << "Device: "
                  << q.get_device().get_info<sycl::info::device::name>()
                  << "\n";

        q.submit([&](sycl::handler& h) {
            // Accessors: read from a,b; write to c
            auto A = buf_a.get_access<sycl::access::mode::read>(h);
            auto B = buf_b.get_access<sycl::access::mode::read>(h);
            auto C = buf_c.get_access<sycl::access::mode::write>(h);

            h.parallel_for(sycl::range<1>(N), [=](sycl::id<1> i) {
                C[i] = A[i] + B[i];
            });
        });
    }  // buffer destructor → implicit copy back to host vectors

    std::cout << "c[0] = " << c[0] << ", c[" << N-1 << "] = " << c[N-1] << "\n";
}
# Compile with Intel DPC++
icpx -fsycl -o vecadd vecadd.cpp

# Or with AdaptiveCpp (targeting NVIDIA)
acpp -o vecadd vecadd.cpp --acpp-targets="cuda:sm_80"

# Run
./vecadd

Compare to OpenCL: SYCL's buffer/accessor model handles data movement automatically. No explicit clCreateBuffer, clEnqueueWriteBuffer, clSetKernelArg — the SYCL runtime infers dependencies and schedules transfers.

2.3 USM (Unified Shared Memory) — CUDA-Like Pointers in SYCL

Buffers/accessors are safe but unfamiliar to CUDA programmers. USM provides raw pointers:

#include <sycl/sycl.hpp>

int main() {
    sycl::queue q(sycl::gpu_selector_v);
    const int N = 1024;

    // Allocate device memory (like cudaMalloc)
    float* d_a = sycl::malloc_device<float>(N, q);
    float* d_b = sycl::malloc_device<float>(N, q);
    float* d_c = sycl::malloc_device<float>(N, q);

    // Host data
    std::vector<float> h_a(N, 1.0f), h_b(N, 2.0f), h_c(N);

    // Copy to device (like cudaMemcpy)
    q.memcpy(d_a, h_a.data(), N * sizeof(float));
    q.memcpy(d_b, h_b.data(), N * sizeof(float));
    q.wait();

    // Launch kernel
    q.parallel_for(sycl::range<1>(N), [=](sycl::id<1> i) {
        d_c[i] = d_a[i] + d_b[i];
    }).wait();

    // Copy back
    q.memcpy(h_c.data(), d_c, N * sizeof(float)).wait();

    // Free
    sycl::free(d_a, q);
    sycl::free(d_b, q);
    sycl::free(d_c, q);
}

USM allocation types:

Type Host access Device access Data movement
malloc_device No Yes Explicit memcpy
malloc_host Yes Yes (over PCIe) Implicit (slow)
malloc_shared Yes Yes Automatic migration

2.4 Local Memory and Work-Group Tiling

SYCL exposes local memory through local_accessor:

q.submit([&](sycl::handler& h) {
    auto A = buf_a.get_access<sycl::access::mode::read>(h);
    auto B = buf_b.get_access<sycl::access::mode::read>(h);
    auto C = buf_c.get_access<sycl::access::mode::write>(h);

    constexpr int TILE = 16;
    sycl::local_accessor<float, 1> tileA(sycl::range<1>(TILE * TILE), h);
    sycl::local_accessor<float, 1> tileB(sycl::range<1>(TILE * TILE), h);

    h.parallel_for(
        sycl::nd_range<2>({N, N}, {TILE, TILE}),
        [=](sycl::nd_item<2> item) {
            int row = item.get_local_id(1);
            int col = item.get_local_id(0);
            int gRow = item.get_global_id(1);
            int gCol = item.get_global_id(0);

            float sum = 0.0f;
            for (int t = 0; t < N / TILE; t++) {
                tileA[row * TILE + col] = A[gRow * N + t * TILE + col];
                tileB[row * TILE + col] = B[(t * TILE + row) * N + gCol];
                item.barrier(sycl::access::fence_space::local_space);

                for (int k = 0; k < TILE; k++)
                    sum += tileA[row * TILE + k] * tileB[k * TILE + col];
                item.barrier(sycl::access::fence_space::local_space);
            }
            C[gRow * N + gCol] = sum;
        }
    );
});

2.5 Sub-Groups (Warp/Wavefront Equivalent)

SYCL sub_group maps to CUDA warps (32) or AMD wavefronts (64):

q.parallel_for(
    sycl::nd_range<1>(N, 256),
    [=](sycl::nd_item<1> item) {
        auto sg = item.get_sub_group();
        int lane = sg.get_local_id();
        int sg_size = sg.get_local_range()[0];   // 32 on NVIDIA, 64 on AMD

        float val = data[item.get_global_id(0)];

        // Warp-level reduction (portable!)
        for (int offset = sg_size / 2; offset > 0; offset /= 2)
            val += sycl::shift_group_left(sg, val, offset);

        if (lane == 0)
            result[sg.get_group_id()] = val;
    }
);

This code runs correctly on NVIDIA (sub_group = 32), AMD (sub_group = 64), and Intel (sub_group = 8/16/32) without modification.

2.6 SYCL for FPGA (Intel oneAPI)

Intel oneAPI DPC++ can compile SYCL kernels to FPGA bitstreams. The same C++ code targets GPU or FPGA — but FPGA-optimised kernels look different (pipes, loop unrolling, banking).

// FPGA-optimised: use pipes for streaming
using PipeA = sycl::ext::intel::pipe<class pA, float, 16>;  // depth 16

// Producer kernel
q.submit([&](sycl::handler& h) {
    h.single_task([=]() {
        for (int i = 0; i < N; i++)
            PipeA::write(input[i]);
    });
});

// Consumer kernel (runs concurrently on FPGA fabric)
q.submit([&](sycl::handler& h) {
    h.single_task([=]() {
        for (int i = 0; i < N; i++)
            output[i] = PipeA::read() * 2.0f;
    });
});

FPGA compilation takes hours (full place-and-route), so use emulation during development:

# Emulation (fast, runs on CPU)
icpx -fsycl -fintelfpga -DFPGA_EMULATOR -o emu emu.cpp

# FPGA hardware compile (30min–2hrs)
icpx -fsycl -fintelfpga -Xshardware -o hw hw.cpp

Part 3: OpenCL vs SYCL Decision Guide

Criterion Choose OpenCL Choose SYCL
Existing C codebase Yes No
Need FPGA (Xilinx) Yes (Vitis) Partial (Intel only)
Modern C++ preferred No Yes
Intel GPU / FPGA Works Best (oneAPI)
AMD GPU Works Works (AdaptiveCpp)
NVIDIA GPU Works Works (DPC++ plugin, AdaptiveCpp)
Runtime kernel compilation Yes (source strings) Optional (online compiler)
Team knows CUDA Harder transition Easier transition (similar feel)
Long-term direction Maintenance mode Actively developed

Practical advice: For new projects targeting multiple vendors, use SYCL (specifically Intel DPC++ or AdaptiveCpp). For legacy code or Xilinx FPGA, OpenCL is still the right choice. For NVIDIA-only or AMD-only, just use CUDA or HIP — they're simpler and faster.


Resources

Resource Type Focus
Khronos OpenCL 3.0 specification Standard Authoritative API reference
Khronos SYCL 2020 specification Standard Authoritative SYCL reference
Intel oneAPI DPC++ documentation Docs SYCL implementation + FPGA
AdaptiveCpp (github.com/AdaptiveCpp) Open-source SYCL on AMD/NVIDIA/CPU
OpenCL Programming Guide — Munshi et al. Textbook Comprehensive OpenCL
Data Parallel C++ — Reinders, Ashbaugh, Brodman Textbook SYCL/DPC++ (free PDF from Intel)
Codeplay developer blog Blog SYCL tutorials, GPU porting guides

Projects

# Project Concepts practiced Complexity
1 OpenCL vector add Platform/device/context/queue setup, kernel compile Beginner
2 SYCL vector add (buffer + USM) Buffer/accessor vs USM, device selection Beginner
3 OpenCL tiled matmul Local memory, barriers, 2D NDRange Intermediate
4 SYCL tiled matmul local_accessor, nd_range, nd_item Intermediate
5 Cross-device benchmark Run same kernel on CPU + GPU, compare throughput Intermediate
6 SYCL sub-group reduction Warp/wavefront-portable primitives Intermediate
7 Port CUDA kernel to SYCL Compare CUDA vs SYCL for same algorithm Intermediate
8 OpenCL device query tool Enumerate all platforms/devices, print capabilities Beginner
9 SYCL FPGA pipe pipeline Intel FPGA, pipe, emulation, hardware compile Advanced
10 Multi-device task graph SYCL event dependencies, split work across GPU+CPU Advanced

Next

→ Back to C++ and Parallel Computing hub

Phase 3 — Neural Networks