OpenCL and SYCL (Phase 1 §4 — Sub-Track 5)¶

Vendor-neutral parallel compute — one programming model for CPU, GPU, and FPGA.

Prerequisites: Sub-Track 3 (CUDA and SIMT). OpenCL maps directly onto CUDA concepts; SYCL then adds modern C++ on top.

Layer mapping: L1 (application — you write kernels), L3 (runtime — OpenCL/SYCL runtime, device drivers).

Why Learn This¶

CUDA locks you to NVIDIA. HIP gets you AMD. OpenCL and SYCL get you everything — Intel GPUs, AMD GPUs, FPGAs, CPUs, even custom accelerators. As an AI hardware engineer, you'll encounter systems where NVIDIA isn't an option: Intel Gaudi accelerators, Xilinx/AMD FPGAs, embedded SoCs, or multi-vendor cloud instances.

	CUDA	HIP	OpenCL	SYCL
Vendor	NVIDIA only	AMD + NVIDIA	Any (Khronos standard)	Any (Khronos standard)
Language	CUDA C++	HIP C++	OpenCL C (separate)	Standard C++
Kernel style	Inline `__global__`	Inline `__global__`	Separate `.cl` source	Lambda in host code
Maturity	18+ years	8+ years	15+ years	5+ years
AI ecosystem	Dominant	Growing	Limited	Growing (Intel)
Best for	NVIDIA GPUs	AMD/NVIDIA GPUs	Portable legacy, FPGAs	Modern portable C++

Part 1: OpenCL¶

1.1 Platform Model¶

OpenCL organises hardware into a hierarchy:

Platform (vendor driver: Intel, AMD, NVIDIA, Xilinx)
  └── Device (GPU, CPU, FPGA, accelerator)
        └── Compute Unit (CU) — SM on NVIDIA, CU on AMD, PE on FPGA
              └── Processing Element (PE) — individual ALU / thread

Context binds one or more devices together. Command queues send work to a specific device within a context.

Host (CPU)
  │
  ├── Context ─────────── Device 0 (GPU)
  │     │                     └── Command Queue 0
  │     │
  │     └──────────────── Device 1 (FPGA)
  │                           └── Command Queue 1
  │
  └── Buffers, Programs, Kernels (shared within context)

1.2 Execution Model¶

OpenCL's execution model maps directly onto CUDA concepts:

OpenCL	CUDA equivalent	Meaning
Work-item	Thread	Single execution instance
Work-group	Block	Group of work-items with shared memory
NDRange	Grid	Global problem space (1D/2D/3D)
`get_global_id(0)`	`blockIdx.x * blockDim.x + threadIdx.x`	Global index
`get_local_id(0)`	`threadIdx.x`	Local index within work-group
`get_group_id(0)`	`blockIdx.x`	Work-group index
`barrier(CLK_LOCAL_MEM_FENCE)`	`__syncthreads()`	Work-group barrier

1.3 Memory Model¶

OpenCL memory spaces:

┌─────────────────────────────────────────────┐
│ Global Memory (device DRAM / HBM)           │  ← CUDA: global memory
│   Accessible by all work-items              │
│   Slow (~300 cycles)                        │
├─────────────────────────────────────────────┤
│ Constant Memory                              │  ← CUDA: constant memory
│   Read-only, cached                          │
├─────────────────────────────────────────────┤
│ Local Memory (per work-group)               │  ← CUDA: shared memory (__shared__)
│   Fast (~4 cycles), programmer-managed      │
│   Shared within work-group                  │
├─────────────────────────────────────────────┤
│ Private Memory (per work-item)              │  ← CUDA: registers / local
│   Fastest, compiler-allocated               │
└─────────────────────────────────────────────┘

1.4 OpenCL C Kernel¶

Kernels are written in OpenCL C (a C99 dialect) and compiled at runtime from source strings.

// kernel.cl — vector addition
__kernel void vector_add(
    __global const float* a,
    __global const float* b,
    __global float* c,
    const int n)
{
    int i = get_global_id(0);
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

Memory qualifiers: - __global — pointer to device global memory - __local — pointer to work-group shared memory - __constant — pointer to constant memory - __private — (default) per-work-item

1.5 Host Code — The Full Setup¶

OpenCL host code is verbose — you must explicitly create platform, device, context, queue, buffers, program, and kernel objects. This boilerplate is the price of portability.

#include <CL/cl.h>
#include <cstdio>
#include <cstdlib>

int main() {
    const int N = 1024;
    size_t bytes = N * sizeof(float);
    float *h_a = (float*)malloc(bytes);
    float *h_b = (float*)malloc(bytes);
    float *h_c = (float*)malloc(bytes);

    for (int i = 0; i < N; i++) { h_a[i] = i; h_b[i] = i * 2; }

    // ── Step 1: Get platform and device ────────────────────────
    cl_platform_id platform;
    clGetPlatformIDs(1, &platform, NULL);

    cl_device_id device;
    clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

    // ── Step 2: Create context and command queue ───────────────
    cl_context ctx = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
    cl_command_queue queue = clCreateCommandQueueWithProperties(ctx, device, 0, NULL);

    // ── Step 3: Create buffers ─────────────────────────────────
    cl_mem d_a = clCreateBuffer(ctx, CL_MEM_READ_ONLY,  bytes, NULL, NULL);
    cl_mem d_b = clCreateBuffer(ctx, CL_MEM_READ_ONLY,  bytes, NULL, NULL);
    cl_mem d_c = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY, bytes, NULL, NULL);

    // ── Step 4: Upload data ────────────────────────────────────
    clEnqueueWriteBuffer(queue, d_a, CL_TRUE, 0, bytes, h_a, 0, NULL, NULL);
    clEnqueueWriteBuffer(queue, d_b, CL_TRUE, 0, bytes, h_b, 0, NULL, NULL);

    // ── Step 5: Build program from source ──────────────────────
    const char* src = "__kernel void vector_add("
        "__global const float* a, __global const float* b,"
        "__global float* c, const int n) {"
        "  int i = get_global_id(0);"
        "  if (i < n) c[i] = a[i] + b[i];"
        "}";

    cl_program prog = clCreateProgramWithSource(ctx, 1, &src, NULL, NULL);
    clBuildProgram(prog, 1, &device, "-cl-fast-relaxed-math", NULL, NULL);

    // ── Step 6: Create kernel and set arguments ────────────────
    cl_kernel kernel = clCreateKernel(prog, "vector_add", NULL);
    clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_a);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_b);
    clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_c);
    clSetKernelArg(kernel, 3, sizeof(int),    &N);

    // ── Step 7: Launch kernel ──────────────────────────────────
    size_t global_size = N;
    size_t local_size  = 256;
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL,
                           &global_size, &local_size, 0, NULL, NULL);

    // ── Step 8: Read back results ──────────────────────────────
    clEnqueueReadBuffer(queue, d_c, CL_TRUE, 0, bytes, h_c, 0, NULL, NULL);

    printf("c[0] = %f, c[%d] = %f\n", h_c[0], N-1, h_c[N-1]);

    // ── Cleanup ────────────────────────────────────────────────
    clReleaseKernel(kernel);
    clReleaseProgram(prog);
    clReleaseMemObject(d_a); clReleaseMemObject(d_b); clReleaseMemObject(d_c);
    clReleaseCommandQueue(queue);
    clReleaseContext(ctx);
    free(h_a); free(h_b); free(h_c);
}

# Compile (link against OpenCL ICD loader)
gcc -O2 -o vecadd vecadd.c -lOpenCL

# Run (works on any GPU/CPU with an OpenCL driver)
./vecadd

1.6 Tiled Matrix Multiply with Local Memory¶

This is the OpenCL equivalent of CUDA shared-memory tiled GEMM:

// matmul.cl — tiled matrix multiply using local memory
#define TILE 16

__kernel void matmul(
    __global const float* A,
    __global const float* B,
    __global float* C,
    const int N)
{
    int row = get_local_id(1);
    int col = get_local_id(0);
    int gRow = get_global_id(1);
    int gCol = get_global_id(0);

    __local float tileA[TILE][TILE];
    __local float tileB[TILE][TILE];

    float sum = 0.0f;

    for (int t = 0; t < N / TILE; t++) {
        // Load tiles into local memory
        tileA[row][col] = A[gRow * N + t * TILE + col];
        tileB[row][col] = B[(t * TILE + row) * N + gCol];
        barrier(CLK_LOCAL_MEM_FENCE);

        // Compute partial dot product from local memory
        for (int k = 0; k < TILE; k++)
            sum += tileA[row][k] * tileB[k][col];
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    C[gRow * N + gCol] = sum;
}

Launch with 2D NDRange:

size_t global[2] = {N, N};
size_t local[2]  = {TILE, TILE};   // 16×16 = 256 work-items per group
clEnqueueNDRangeKernel(queue, kernel, 2, NULL, global, local, 0, NULL, NULL);

1.7 Event-Based Profiling¶

// Create queue with profiling enabled
cl_queue_properties props[] = {CL_QUEUE_PROPERTIES, CL_QUEUE_PROFILING_ENABLE, 0};
cl_command_queue queue = clCreateCommandQueueWithProperties(ctx, device, props, NULL);

cl_event event;
clEnqueueNDRangeKernel(queue, kernel, 1, NULL,
                       &global_size, &local_size, 0, NULL, &event);
clWaitForEvents(1, &event);

cl_ulong start, end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(start), &start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END,   sizeof(end),   &end,   NULL);

printf("Kernel time: %.3f ms\n", (end - start) / 1e6);

1.8 Device Query¶

Always query what the device supports before assuming capabilities:

char name[128];
cl_uint cu_count;
cl_ulong global_mem, local_mem;
size_t max_wg;

clGetDeviceInfo(device, CL_DEVICE_NAME,                128, name,       NULL);
clGetDeviceInfo(device, CL_DEVICE_MAX_COMPUTE_UNITS,     4, &cu_count,  NULL);
clGetDeviceInfo(device, CL_DEVICE_GLOBAL_MEM_SIZE,       8, &global_mem,NULL);
clGetDeviceInfo(device, CL_DEVICE_LOCAL_MEM_SIZE,        8, &local_mem, NULL);
clGetDeviceInfo(device, CL_DEVICE_MAX_WORK_GROUP_SIZE,sizeof(max_wg), &max_wg, NULL);

printf("Device: %s\n", name);
printf("CUs: %u, Global mem: %lu MB, Local mem: %lu KB, Max WG: %zu\n",
       cu_count, global_mem / (1024*1024), local_mem / 1024, max_wg);

Part 2: SYCL¶

2.1 What SYCL Is¶

SYCL is a modern C++ abstraction built on top of OpenCL (and now other backends). It eliminates OpenCL's boilerplate by expressing kernels as C++ lambdas and managing buffers with RAII accessors.

OpenCL:
  Platform → Device → Context → Queue → Buffer → Program → Kernel → SetArgs → Launch
  (~50 lines of setup)

SYCL:
  queue q;
  q.submit([&](handler& h) { h.parallel_for(..., [=](id<1> i) { ... }); });
  (~5 lines)

SYCL implementations:

Implementation	Vendor	Backend targets
Intel oneAPI DPC++	Intel	Intel GPUs, CPUs, FPGAs, NVIDIA GPUs (via plugin)
AdaptiveCpp (formerly hipSYCL)	Open-source	AMD GPUs, NVIDIA GPUs, CPUs
ComputeCpp	Codeplay	ARM Mali, Renesas, custom
triSYCL	Xilinx/AMD	FPGAs (experimental)

2.2 SYCL Hello World — Vector Addition¶

#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>

int main() {
    const int N = 1024;
    std::vector<float> a(N), b(N), c(N);

    for (int i = 0; i < N; i++) { a[i] = i; b[i] = i * 2; }

    {
        // Create buffers that wrap host vectors
        sycl::buffer<float> buf_a(a.data(), N);
        sycl::buffer<float> buf_b(b.data(), N);
        sycl::buffer<float> buf_c(c.data(), N);

        // Select device (GPU preferred, falls back to CPU)
        sycl::queue q(sycl::gpu_selector_v);
        std::cout << "Device: "
                  << q.get_device().get_info<sycl::info::device::name>()
                  << "\n";

        q.submit([&](sycl::handler& h) {
            // Accessors: read from a,b; write to c
            auto A = buf_a.get_access<sycl::access::mode::read>(h);
            auto B = buf_b.get_access<sycl::access::mode::read>(h);
            auto C = buf_c.get_access<sycl::access::mode::write>(h);

            h.parallel_for(sycl::range<1>(N), [=](sycl::id<1> i) {
                C[i] = A[i] + B[i];
            });
        });
    }  // buffer destructor → implicit copy back to host vectors

    std::cout << "c[0] = " << c[0] << ", c[" << N-1 << "] = " << c[N-1] << "\n";
}

# Compile with Intel DPC++
icpx -fsycl -o vecadd vecadd.cpp

# Or with AdaptiveCpp (targeting NVIDIA)
acpp -o vecadd vecadd.cpp --acpp-targets="cuda:sm_80"

# Run
./vecadd

Compare to OpenCL: SYCL's buffer/accessor model handles data movement automatically. No explicit clCreateBuffer, clEnqueueWriteBuffer, clSetKernelArg — the SYCL runtime infers dependencies and schedules transfers.

2.3 USM (Unified Shared Memory) — CUDA-Like Pointers in SYCL¶

Buffers/accessors are safe but unfamiliar to CUDA programmers. USM provides raw pointers:

#include <sycl/sycl.hpp>

int main() {
    sycl::queue q(sycl::gpu_selector_v);
    const int N = 1024;

    // Allocate device memory (like cudaMalloc)
    float* d_a = sycl::malloc_device<float>(N, q);
    float* d_b = sycl::malloc_device<float>(N, q);
    float* d_c = sycl::malloc_device<float>(N, q);

    // Host data
    std::vector<float> h_a(N, 1.0f), h_b(N, 2.0f), h_c(N);

    // Copy to device (like cudaMemcpy)
    q.memcpy(d_a, h_a.data(), N * sizeof(float));
    q.memcpy(d_b, h_b.data(), N * sizeof(float));
    q.wait();

    // Launch kernel
    q.parallel_for(sycl::range<1>(N), [=](sycl::id<1> i) {
        d_c[i] = d_a[i] + d_b[i];
    }).wait();

    // Copy back
    q.memcpy(h_c.data(), d_c, N * sizeof(float)).wait();

    // Free
    sycl::free(d_a, q);
    sycl::free(d_b, q);
    sycl::free(d_c, q);
}

USM allocation types:

Type	Host access	Device access	Data movement
`malloc_device`	No	Yes	Explicit `memcpy`
`malloc_host`	Yes	Yes (over PCIe)	Implicit (slow)
`malloc_shared`	Yes	Yes	Automatic migration

2.4 Local Memory and Work-Group Tiling¶

SYCL exposes local memory through local_accessor:

q.submit([&](sycl::handler& h) {
    auto A = buf_a.get_access<sycl::access::mode::read>(h);
    auto B = buf_b.get_access<sycl::access::mode::read>(h);
    auto C = buf_c.get_access<sycl::access::mode::write>(h);

    constexpr int TILE = 16;
    sycl::local_accessor<float, 1> tileA(sycl::range<1>(TILE * TILE), h);
    sycl::local_accessor<float, 1> tileB(sycl::range<1>(TILE * TILE), h);

    h.parallel_for(
        sycl::nd_range<2>({N, N}, {TILE, TILE}),
        [=](sycl::nd_item<2> item) {
            int row = item.get_local_id(1);
            int col = item.get_local_id(0);
            int gRow = item.get_global_id(1);
            int gCol = item.get_global_id(0);

            float sum = 0.0f;
            for (int t = 0; t < N / TILE; t++) {
                tileA[row * TILE + col] = A[gRow * N + t * TILE + col];
                tileB[row * TILE + col] = B[(t * TILE + row) * N + gCol];
                item.barrier(sycl::access::fence_space::local_space);

                for (int k = 0; k < TILE; k++)
                    sum += tileA[row * TILE + k] * tileB[k * TILE + col];
                item.barrier(sycl::access::fence_space::local_space);
            }
            C[gRow * N + gCol] = sum;
        }
    );
});

2.5 Sub-Groups (Warp/Wavefront Equivalent)¶

SYCL sub_group maps to CUDA warps (32) or AMD wavefronts (64):

q.parallel_for(
    sycl::nd_range<1>(N, 256),
    [=](sycl::nd_item<1> item) {
        auto sg = item.get_sub_group();
        int lane = sg.get_local_id();
        int sg_size = sg.get_local_range()[0];   // 32 on NVIDIA, 64 on AMD

        float val = data[item.get_global_id(0)];

        // Warp-level reduction (portable!)
        for (int offset = sg_size / 2; offset > 0; offset /= 2)
            val += sycl::shift_group_left(sg, val, offset);

        if (lane == 0)
            result[sg.get_group_id()] = val;
    }
);

This code runs correctly on NVIDIA (sub_group = 32), AMD (sub_group = 64), and Intel (sub_group = 8/16/32) without modification.

2.6 SYCL for FPGA (Intel oneAPI)¶

Intel oneAPI DPC++ can compile SYCL kernels to FPGA bitstreams. The same C++ code targets GPU or FPGA — but FPGA-optimised kernels look different (pipes, loop unrolling, banking).

// FPGA-optimised: use pipes for streaming
using PipeA = sycl::ext::intel::pipe<class pA, float, 16>;  // depth 16

// Producer kernel
q.submit([&](sycl::handler& h) {
    h.single_task([=]() {
        for (int i = 0; i < N; i++)
            PipeA::write(input[i]);
    });
});

// Consumer kernel (runs concurrently on FPGA fabric)
q.submit([&](sycl::handler& h) {
    h.single_task([=]() {
        for (int i = 0; i < N; i++)
            output[i] = PipeA::read() * 2.0f;
    });
});

FPGA compilation takes hours (full place-and-route), so use emulation during development:

# Emulation (fast, runs on CPU)
icpx -fsycl -fintelfpga -DFPGA_EMULATOR -o emu emu.cpp

# FPGA hardware compile (30min–2hrs)
icpx -fsycl -fintelfpga -Xshardware -o hw hw.cpp

Part 3: OpenCL vs SYCL Decision Guide¶

Criterion	Choose OpenCL	Choose SYCL
Existing C codebase	Yes	No
Need FPGA (Xilinx)	Yes (Vitis)	Partial (Intel only)
Modern C++ preferred	No	Yes
Intel GPU / FPGA	Works	Best (oneAPI)
AMD GPU	Works	Works (AdaptiveCpp)
NVIDIA GPU	Works	Works (DPC++ plugin, AdaptiveCpp)
Runtime kernel compilation	Yes (source strings)	Optional (online compiler)
Team knows CUDA	Harder transition	Easier transition (similar feel)
Long-term direction	Maintenance mode	Actively developed

Practical advice: For new projects targeting multiple vendors, use SYCL (specifically Intel DPC++ or AdaptiveCpp). For legacy code or Xilinx FPGA, OpenCL is still the right choice. For NVIDIA-only or AMD-only, just use CUDA or HIP — they're simpler and faster.

Resources¶

Resource	Type	Focus
Khronos OpenCL 3.0 specification	Standard	Authoritative API reference
Khronos SYCL 2020 specification	Standard	Authoritative SYCL reference
Intel oneAPI DPC++ documentation	Docs	SYCL implementation + FPGA
AdaptiveCpp (github.com/AdaptiveCpp)	Open-source	SYCL on AMD/NVIDIA/CPU
OpenCL Programming Guide — Munshi et al.	Textbook	Comprehensive OpenCL
Data Parallel C++ — Reinders, Ashbaugh, Brodman	Textbook	SYCL/DPC++ (free PDF from Intel)
Codeplay developer blog	Blog	SYCL tutorials, GPU porting guides

Projects¶

#	Project	Concepts practiced	Complexity
1	OpenCL vector add	Platform/device/context/queue setup, kernel compile	Beginner
2	SYCL vector add (buffer + USM)	Buffer/accessor vs USM, device selection	Beginner
3	OpenCL tiled matmul	Local memory, barriers, 2D NDRange	Intermediate
4	SYCL tiled matmul	local_accessor, nd_range, nd_item	Intermediate
5	Cross-device benchmark	Run same kernel on CPU + GPU, compare throughput	Intermediate
6	SYCL sub-group reduction	Warp/wavefront-portable primitives	Intermediate
7	Port CUDA kernel to SYCL	Compare CUDA vs SYCL for same algorithm	Intermediate
8	OpenCL device query tool	Enumerate all platforms/devices, print capabilities	Beginner
9	SYCL FPGA pipe pipeline	Intel FPGA, pipe, emulation, hardware compile	Advanced
10	Multi-device task graph	SYCL event dependencies, split work across GPU+CPU	Advanced

Next¶

→ Back to C++ and Parallel Computing hub

→ Phase 3 — Neural Networks