Skip to content

ElementwiseOps: Complete Guide

ElementwiseOps work element-by-element on tensors. Each element is processed independently — perfect for GPU parallelization. There are three subtypes: Unary (1 input), Binary (2 inputs), Ternary (3 inputs).


Overview

UnaryOp:    [a, b, c, d]           → [f(a), f(b), f(c), f(d)]
BinaryOp:   [a,b,c,d] + [e,f,g,h] → [a+e, b+f, c+g, d+h]
TernaryOp:  [T,F,T,F].where([a,b,c,d],[e,f,g,h]) → [a, f, c, h]

Key property: Multiple elementwise ops are automatically fused into a single GPU kernel:

y = ((x + 1) * 2 - 0.5).relu()  # → 1 kernel, not 4


Part 1: UnaryOps (7 primitives)

UnaryOps apply a function to each element of a single tensor.

Primitive UnaryOps

Op Code Math Example
EXP2 x.exp2() 2^x [0,1,2][1,2,4]
LOG2 x.log2() log₂(x) [1,2,4][0,1,2]
SQRT x.sqrt() √x [1,4,9][1,2,3]
RECIP x.reciprocal() 1/x [1,2,4][1,0.5,0.25]
NEG -x -x [1,-2][-1,2]
SIN x.sin() sin(x) [0,π/2][0,1]
CAST x.cast(dtype) type(x) [1.5][1] (int)

Code examples

from tinygrad import Tensor, dtypes

x = Tensor([0, 1, 2, 3])
print(x.exp2().numpy())        # [1, 2, 4, 8]
print(x.log2().numpy())        # requires x > 0

x = Tensor([1, 4, 9, 16], dtype=dtypes.float32)
print(x.sqrt().numpy())        # [1, 2, 3, 4]
print(x.reciprocal().numpy())  # [1, 0.25, 0.111, 0.0625]

x = Tensor([1.5, 2.7, 3.9])
print(x.cast(dtypes.int32).numpy())   # [1, 2, 3]
print(x.cast(dtypes.float16).numpy()) # half precision

Derived UnaryOps (built from primitives)

Op Code Built From
EXP x.exp() (x * log₂(e)).exp2()
LOG x.log() x.log2() * ln(2)
ABS x.abs() x.maximum(-x)
RELU x.relu() x.maximum(0)
SIGMOID x.sigmoid() (1 + (-x).exp()).reciprocal()
TANH x.tanh() 2 * (2*x).sigmoid() - 1
x = Tensor([-2, -1, 0, 1, 2])
print(x.relu().numpy())     # [0, 0, 0, 1, 2]
print(x.sigmoid().numpy())  # [0.119, 0.268, 0.5, 0.731, 0.881]
print(x.tanh().numpy())     # [-0.964, -0.762, 0, 0.762, 0.964]

Performance

  • Fast: NEG, CAST, RECIP (simple hardware ops)
  • Medium: SQRT, EXP2, LOG2 (one instruction)
  • Slower: SIN, EXP, TANH (derived, multiple ops)

Part 2: BinaryOps (7 primitives)

BinaryOps combine two tensors element-by-element with automatic broadcasting.

Primitive BinaryOps

Op Code Math Example
ADD a + b a + b [1,2] + [3,4][4,6]
SUB a - b a - b [5,6] - [1,2][4,4]
MUL a * b a × b [2,3] * [4,5][8,15]
DIV a / b a ÷ b [10,20] / [2,4][5,5]
MOD a % b a mod b [10,11] % [3,3][1,2]
MAX a.maximum(b) max(a,b) [1,5] max [4,2][4,5]
CMPLT a < b a < b [1,3] < [2,2][T,F]

Code examples

from tinygrad import Tensor

a = Tensor([1, 2, 3, 4])
b = Tensor([5, 6, 7, 8])
print((a + b).numpy())          # [6, 8, 10, 12]
print((a * b).numpy())          # [5, 12, 21, 32]
print(a.maximum(b).numpy())     # [5, 6, 7, 8]
print((a < b).numpy())          # [1, 1, 1, 1]

Derived BinaryOps

Op Code Built From
GT a > b b < a
LE a <= b ~(a > b)
GE a >= b ~(a < b)
EQ a == b (a <= b) & (a >= b)
NE a != b ~(a == b)
MIN a.minimum(b) -((-a).maximum(-b))
POW a ** b (a.log() * b).exp()

Broadcasting rules

# Scalar: broadcasts to all elements
[1,2,3] + 10 = [11,12,13]

# Vector: broadcasts along missing dims
a = Tensor([[1,2,3],[4,5,6]])  # (2,3)
b = Tensor([10,20,30])          # (3,) → broadcasts to (2,3)
print((a + b).numpy())
# [[11,22,33],[14,25,36]]

# Two 1-element dims: outer product style
a = Tensor([[1],[2],[3]])  # (3,1)
b = Tensor([10,20,30,40]) # (4,)
print((a + b).numpy())    # (3,4)

Common uses

# ReLU
def relu(x): return x.maximum(0)

# Leaky ReLU
def leaky_relu(x, alpha=0.01): return x.maximum(alpha * x)

# Residual connection
y = x + residual

# Dropout mask
mask = Tensor.rand(*x.shape) > p
out = mask * x / (1 - p)

# MSE loss
loss = ((pred - target) ** 2).mean()

Part 3: TernaryOps (2 primitives)

TernaryOps take three tensor inputs.

WHERE — Conditional Selection

result[i] = condition[i] ? if_true[i] : if_false[i]
from tinygrad import Tensor

condition = Tensor([True, False, True, False])
if_true   = Tensor([1, 2, 3, 4])
if_false  = Tensor([5, 6, 7, 8])

result = condition.where(if_true, if_false)
# Result: [1, 6, 3, 8]

Common patterns:

# ReLU via WHERE
x.relu()  # equivalent to (x > 0).where(x, 0)

# Causal attention mask
scores = (x < 0).where(Tensor.full(scores.shape, float('-inf')), scores)

# Clipping
x_clipped = (x < min_val).where(min_val, (x > max_val).where(max_val, x))

# Dropout
mask = (Tensor.rand(*x.shape) > p)
out = mask.where(x / (1-p), 0)

# Chained (piecewise function)
result = (x < -1).where(-1, (x > 1).where(1, x))  # clamp to [-1, 1]

MULACC — Fused Multiply-Add

result[i] = a[i] * b[i] + c[i]  # single hardware instruction
a = Tensor([1, 2, 3])
b = Tensor([4, 5, 6])
c = Tensor([10, 20, 30])
result = a.mulacc(b, c)
# [14, 30, 48]  — faster and more numerically accurate than a*b + c

Uses: polynomial evaluation (Horner's method), linear layers, weighted sums.


Quick Reference

Activation Functions

relu(x)     = x.maximum(0)
sigmoid(x)  = (1 + (-x).exp()).reciprocal()
tanh(x)     = 2 * (2*x).sigmoid() - 1
swish(x)    = x * x.sigmoid()
gelu(x)     = 0.5*x*(1 + (x * 0.7979 * (1 + 0.044715 * x * x)).tanh())
leaky(x)    = x.maximum(0.01 * x)
hard_sig(x) = (x < -2.5).where(0, (x > 2.5).where(1, 0.2*x + 0.5))

Normalization

z_score  = (x - x.mean()) / x.std()
min_max  = (x - x.min()) / (x.max() - x.min())
layer_norm = (x - x.mean(-1, True)) / (x.var(-1, True) + 1e-5).sqrt()

Loss Functions

mse = ((pred - target) ** 2).mean()
mae = (pred - target).abs().mean()
bce = -(target * pred.log() + (1-target) * (1-pred).log()).mean()

Clipping and Masking

x.maximum(min_val).minimum(max_val)   # clip
mask.where(x, 0)                       # apply binary mask
mask.where(scores, float('-inf'))      # attention masking

Performance Tips

Prefer Instead of
x * (1/scale) x / scale (MUL faster than DIV)
Chain ops, single .realize() Intermediate .realize() calls
a.mulacc(b, c) a * b + c (avoids intermediate allocation)
x.maximum(0) (x > 0).where(x, 0) (same result, simpler)

All 16 Primitives at a Glance

UnaryOps  (7): EXP2  LOG2  SQRT  RECIP  NEG  SIN  CAST
BinaryOps (7): ADD   SUB   MUL   DIV    MOD  MAX  CMPLT
TernaryOps(2): WHERE  MULACC

These 16 primitives, combined with ReduceOps and MovementOps, build all of deep learning.