Interface Specifications

Introduction

This document specifies the standardized interface required by hardware vendors integrating with the Homomorphic Encryption Abstraction Layer (HEAL). HEAL is designed to abstract the complexity of fully homomorphic encryption (FHE) computations, enabling efficient and scalable implementations across diverse hardware architectures.to integrate

The interface defined in this document includes essential functions categorized into:

Memory Management Functions: Operations responsible for allocation, initialization, and efficient transfer of tensor data between host (CPU) and hardware devices.
Shape Manipulation Functions: Provides operations to change the shape, layout, or dimension arrangement of tensors without copying data, enabling flexible transformations for downstream computations.
Tensor Value Assignments: Provides utility functions to assign constant values to all elements of a tensor without changing its shape or memory layout.
Arithmetic Operations (Modular Arithmetic): Essential modular arithmetic computations required in FHE workflows.
Modular Arithmetic Axis Operations: Performs modular arithmetic computations across a specified tensor axis, combining elements using summation or product-reduction patterns.
NTT Transforms: Implements forward and inverse Number-Theoretic Transforms (NTT/INTT) for efficient polynomial operations in the encrypted domain.
Other Compute Operations: Additional tensor computations and transformations essential to specialized FHE processes.

The core data structure managed through this interface is the tensor, a multi-dimensional array representing polynomial coefficients and associated metadata for homomorphic computations. A detailed explanation of tensors, supported data types, shapes, memory management strategies, and data flow considerations can be found in the Memory Management and Data Structures documentation.

📙Memory Management Functions

This chapter defines the interface for memory management operations within the HEAL framework. These functions enable efficient allocation, initialization, and transfer of tensor data between host (CPU) memory and hardware memory. They ensure that data is correctly formatted, aligned, and accessible for hardware execution, serving as the foundation for all subsequent FHE computations.

Memory management functions include:

Index

Name

Description

zeros

Allocates initialized memory for a tensor

empty

Allocates uninitialized memory for a tensor

host_to_device

Transfers tensor data from host to device memory

device_to_host

Transfers tensor data from device to host memory

contiguous

Ensures tensor has contiguous memory layout; makes copy if needed.

📑`zeros`

Introduced in v0.1.0
Renamed in v1.0.0 (formerly allocate_on_hardware)

The function allocates memory on the device for a tensor with the specified shape. It initializes the contents of the allocated memory with zero values.

🧩Call Format

device_tensor = zeros<T>(dims);

T: Scalar data type of the tensor elements (e.g.,int32, int64, float32, float64, complex64, complex128)
dims: A list of dimensions representing the desired shape of the tensor.
device_tensor: A smart pointer to a newly allocated tensor on the device with initialized memory.

📥 Input

Name

Type

Description

dims

std::vector<int64_t>

Tensor shape - list of dimension sizes. Assumes values are valid and > 0.

This parameter defines the tensor shape, not the data type. The data type is set by the template parameter T

📤 Output

Name

Type

Description

device_tensor

std::shared_ptr<DeviceTensor<T>>

A new tensor object on the device with initialized memory and associated metadata.

// Define a 2D tensor shape
std::vector<int64_t> dims = {8, 8};

// Allocate an initialized float32 tensor on the device
std::shared_ptr<DeviceTensor<float>> device_tensor = zeros<float>(dims);

⚠️ Error Messages

The function assumes:

The shape is valid (e.g., not negative).
Memory allocation on the device succeeds.

It performs no internal validation or exception handling.

📝 Changelog

v1.0.0: Renamed from allocate_on_hardwareto zeros
v0.1.0: Initial version

📑`empty`

Since: v1.0.0

The function allocates memory on the device for a tensor with the specified shape. It does not initialize the contents of the allocated memory.

This function is useful when the memory is going to be immediately overwritten by subsequent operations, allowing for faster allocation without the overhead of zero-initialization.

Allocates memory on the device but does not initialize values.
Computes default row-major strides.
Returns a valid DeviceTensor<T> that can be used as the output of other operations.
The total number of elements is computed as the product of dimensions in dims.

❗ Error Conditions

If memory allocation fails (malloc returns nullptr), the implementation should raise an error or return nullptr.
Negative or invalid dimension values may cause incorrect behavior and should be guarded by the caller.

🧩Call Format

device_tensor = empty<T>(dims);

T: Scalar data type of the tensor elements
dims: Shape of the tensor (each dimension must be ≥ 0)
device_tensor: A smart pointer to a newly allocated tensor on the device with uninitialized memory.

📥 Input

Name

Type

Description

dims

std::vector<int64_t>

Tensor shape - list of dimension sizes. Assumes values are valid and > 0.

📤 Output

Name

Type

Description

device_tensor

std::shared_ptr<DeviceTensor<T>>

Newly allocated device tensor with the given shape and inferred strides. Contents are uninitialized.

// Define a 2D tensor shape
std::vector<int64_t> dims = {8, 8};

// Allocate an uninitialized float32 tensor on the device
std::shared_ptr<DeviceTensor<float>> device_tensor = empty<float>(dims);

📝 Changelog

v1.0.0: Initial version

📑`host_to_device`

Since: v0.1.0

Transfers data from a host-side tensor (e.g., PyTorch, NumPy) to a newly allocated device tensor suitable for computation on accelerator hardware.

What does host_to_device do? (Click to expand)

The function copies data from a tensor located in host (CPU) memory to a device-specific tensor allocated on accelerator hardware (e.g., GPU, ASIC, FPGA). The actual type of host and device tensors is backend-dependent:

The host tensor may be any type that exposes shape and raw data access (e.g., PyTorch, NumPy).
The device tensor is created as an instance of the backend-defined DeviceTensor<T>, where the scalar type T matches that of the host tensor.

The function performs:

Allocate a new device tensor of the same shape and type.
Copy the host tensor's data to device memory.
Return a pointer to the newly allocated device tensor.

🧩 Call Format

device_tensor = host_to_device<T>(host_tensor);

T: Scalar data type (e.g., int32_t, float, etc.)
host_tensor: A tensor in host memory (e.g., PyTorch, NumPy) with scalar type T.
device_tensor: A smart pointer to a device-side representation of the tensor.

📥 Input Parameters

Name

Type

Description

host_tensor

TensorLike (e.g., PyTorch)

Host-side tensor with shape and data accessible for transfer The data type must match template type T.

📤 Output

Name

Type

Description

device_tensor

std::shared_ptr<DeviceTensor<T>>

Device-allocated tensor containing copied data from the host

// Create a PyTorch (illustration only) tensor with int32 data on the host (CPU)
torch::Tensor host_tensor = torch::tensor({1, 2, 3}, torch::kInt32);

// Transfer the tensor to device memory using the HEAL interface
auto device_tensor = host_to_device<int32_t>(host_tensor);

⚠️ Error Messages

The function does not currently include explicit error handling for mismatches or null inputs. It assumes:

The host tensor has correct and accessible data for the scalar type T.
Allocation on the device succeeds.

📝 Changelog

v0.1.0 - Initial release.

📑 `device_to_host`

Since: v0.1.0

Transfers data from a device-side tensor to a host-side tensor, facilitating the retrieval of computation results from accelerator hardware to the host environment.

What does device_to_hostdo? (Click to expand)

The function enables the copying of data from a tensor residing in device memory (e.g., GPU, FPGA) back to a tensor in host memory (e.g., CPU). This operation is essential for accessing and utilizing the results of computations performed on accelerator hardware within the host application.

The function performs:

Allocate a new host tensor of the same shape and type.
Copy the device tensor's data to host memory.
Return the newly allocated host tensor containing the copied data.

🧩 Call Format

host_tensor = device_to_host<T>(device_tensor);

T: Scalar data type (e.g., int32_t, float)
device_tensor: A std::shared_ptr<DeviceTensor<T>> residing on the device
host_tensor: A tensor in host memory containing the copied data

📥 Input Parameters

Name

Type

Description

device_tensor

std::shared_ptr<DeviceTensor<T>>

Device-side tensor to be copied to host memory Data type must match template T.

📤 Output

Name

Type

Description

host_tensor

TensorLike

Host-side tensor containing data copied from device

The specific type of the returned host tensor depends on the host tensor library in use (e.g., PyTorch, NumPy).

// Assume this device tensor was created using host_to_device earlier
std::shared_ptr<DeviceTensor<int32_t>> device_tensor = ...;

// Transfer the tensor back to host memory as a PyTorch (implementation example) tensor
torch::Tensor host_tensor = device_to_host<int32_t>(device_tensor);

📝 Changelog

v0.1.0 - Initial release.

📑`contiguous`

- Introduced in v0.1.0
- Renamed in v1.0.0 (formerly make_contiguous)

The function ensures that a tensor has a standard, contiguous memory layout. If the tensor is already contiguous, it returns immediately. If not, it creates a new memory buffer, copies the elements into contiguous layout, updates strides, and modifies the tensor in-place.

Why do we need contiguous? (Click to expand)

Some tensor operations, like transposing or slicing, can change the memory layout of a tensor without changing its shape. This can make the tensor non-contiguous, meaning the elements are not laid out sequentially in memory.

A contiguous tensor has elements stored in standard row-major order, without gaps or jumps in memory.

Hardware accelerators and many algorithms expect contiguous tensors for best performance (and sometimes for correctness).

The contiguous function checks whether a tensor is already contiguous:

✔ If it is, nothing changes.

✘ If not, it creates a contiguous copy and updates the tensor’s memory and strides.

🧩Call Format

contiguous<T>(tensor);

T: Scalar data type (int32_t, int64_t, float, double)
Tensor tensor is modified in-place if needed.

📥📤 Parameters

Name

Type

Direction

Description

tensor

std::shared_ptr<DeviceTensor<T>>

Input/Output

Input tensor to be made contiguous if necessary.

// Non-contiguous tensor example: result of transpose
auto a = host_to_device<int32_t>(torch::randint(0, 10, {3, 4}).transpose(0, 1));

// Make contiguous (copying into new memory if needed)
contiguous<int32_t>(a);

// After the call, 'a' now has standard contiguous memory layout

📝 Changelog

v1.0.0: Renamed from make_contiguous to contiguous
v0.1.0: Initial version

📔Tensor Value Assignments

Functions in this chapter assign constant values to all elements of a tensor without changing its shape or memory layout.

Index

Name

Short Description

pad_single_axis

Appends zeros at the end of a specific axis, expanding shape.

set_const_val

Sets all elements of a tensor to a constant value; in-place, no allocation.

📑`pad_single_axis`

Since: v1.0.0

What does pad_single_axis do? (Click to expand)

The pad_single_axis function takes a tensor and adds zeros at the end of a chosen axis.

For example:

Input: [1, 2, 3], pad = 2, axis = 0 → Result: [1, 2, 3, 0, 0]
Input: [[1, 2, 3], [4, 5, 6]], pad = 2, axis = 1 → Result: [[1, 2, 3, 0, 0], [4, 5, 6, 0, 0]]

It doesn’t change the other dimensions — only the one you specify.

Negative axis values count from the end:

axis = -1 → last axis,
axis = -2 → second-to-last, etc.

Summary: Copy existing data → Add zeros at end → Output padded tensor.

🧩 Call Format

pad_single_axis<T>( a, pad, axis, result);

📥📤Parameters

Name

Type

Shape

Role

Since

Description

a

std::shared_ptr<DeviceTensor<T>>

Any shape

Input

v1.0.0

Input tensor to pad.

pad

int64_t

—

Input

v1.0.0

Number of zeros to append (must be ≥ 0).

axis

int64_t

—

Input

v1.0.0

Axis index along which to pad (negative values allowed, e.g., -1 = last axis).

result

std::shared_ptr<DeviceTensor<T>>

Padded shape

Output

v1.0.0

Output tensor; same as a but with padded axis expanded by pad.

Logic

Expands the dimension at axis by pad elements.
Copies existing values from a into result.
Fills padded positions with zero (0 of type T).
Supports negative axis indices (-1 = last, -2 = second-to-last, etc.).

❗ Throws std::invalid_argument if:

pad < 0
Axis out of bounds.
Input and result ranks mismatch.
Result shape does not match expected padded dimensions.

// a: shape [2, 3]
// pad: 2 on axis 1 → result shape: [2, 5]
auto result = empty<int64_t>({2, 5});
pad_single_axis<int64_t>(a, 2, 1, result);

📝 Changelog

v1.0.0 - Initial release.

📑`set_const_val`

Sets all elements of a tensor to a given constant value. This is a utility function often used to initialize intermediate tensors or reset memory before computation.

Since: v1.0.0

🧩 Call Format

set_const_val<T>(tensor, val);

T: Scalar data type (int32_t, int64_t)
Tensor std::shared_ptr<DeviceTensor<T>>

📥 📤Parameters

Name

Type

Role

Description

a

std::shared_ptr<DeviceTensor<T>>

Input/Output

Tensor to overwrite. Modified in-place.

val

T

Input

Scalar value to assign to all elements.

Logic

Iterates over all elements of the tensor, replacing each with val.
Supports tensors of any rank, including scalar (0D) tensors.
Leaves the tensor shape unchanged.
Throws:
std::invalid_argument if the input tensor a is null.

// This sets every element in the tensor to zero.
auto hw_tensor = host_to_device<int32_t>(torch::rand({10}));
set_const_val<int32_t>(hw_tensor, 0);

📝 Changelog

v1.0.0 - Initial release.

📘Arithmetic Operations (Modular Arithmetic)

Functions in this chapter perform element-wise or structured computations such as modular addition, and modular multiplication.

All modular arithmetic functions in HEAL accept a modulus parameter p, which can be:

A scalar (same modulus applied to all elements), or
A 1D tensor of shape [k], where k matches the size of the result tensor’s last dimension.

All results are reduced modulo p, and the outcome is always in the range [0, p), even if intermediate values (e.g., inputs or intermediate sums/products) are negative: (-1 % 5) → 4 ; (-7 % 5) → 3

This ensures correctness and consistency across all platforms and encryption schemes.

✖️ Modular Multiplication Functions (modmul)

This section defines the modular multiplication functions supported by the HEAL interface. These functions compute element-wise (a * b) % p using different combinations of tensor and scalar inputs.

ttt: all inputs are tensors
ttc: modulus is a scalar
tct: multiplier b is scalar
tcc: both multiplier and modulus are scalars

The result is stored in a pre-allocated output tensor, which must match the expected broadcasted shape of inputs a and b.

Index

Name

Short Description

modmul_ttt

Modular multiplication (tensor-tensor-tensor)

modmul_ttc

Modular multiplication (tensor-tensor-constant)

modmul_tct

Modular multiplication (tensor-constant-tensor)

modmul_tcc

Modular multiplication (tensor-constant-constant)

Since: v0.1.0

🧩 Call Format

// tensor * tensor % tensor
modmul_ttt<T>(a, b, p, result);

// tensor * tensor % constant
modmul_ttc<T>(a, b, p_scalar, result);

// tensor * constant % tensor
modmul_tct<T>(a, b_scalar, p, result);

// tensor * constant % constant
modmul_tcc<T>(a, b_scalar, p_scalar, result);

T: Scalar data type (int32_t, int64_t, etc.)
a, b, p: Shared pointers to DeviceTensor<T> objects
p_scalar, b_scalar: Scalar values of type T
result: Pre-allocated output tensor (std::shared_ptr<DeviceTensor<T>>) on the device

📥 Parameters by Function Variant

Function

Input A (tensor)

Input B

Modulus P

Output

modmul_ttt

DeviceTensor<T>

DeviceTensor<T> (pre-allocated)

modmul_ttc

DeviceTensor<T>

T (scalar)

DeviceTensor<T> (pre-allocated)

modmul_tct

DeviceTensor<T>

T (scalar)

DeviceTensor<T>

DeviceTensor<T> (pre-allocated)

modmul_tcc

DeviceTensor<T>

T (scalar)

DeviceTensor<T> (pre-allocated)

The result tensor must be pre-allocated and have a shape compatible with broadcasted inputs

auto a = host_to_device<int32_t>(torch::tensor({1, 2, 3}));
auto b = host_to_device<int32_t>(torch::tensor({4, 5, 6}));
auto p = host_to_device<int32_t>(torch::tensor({7, 7, 7}));
auto result = zeros<int32_t>({3});

modmul_ttt<int32_t>(a, b, p, result);

📝 Changelog

v1.0.0 - Initial release.

➕ Modular Addition Functions (modsum)

This section defines the modular addition functions supported by the HEAL interface. These functions compute element-wise modular addition: result[i] = (a[i] + b[i]) % p[i]

The input can consist of tensors or scalars, and broadcasting is supported. The result is stored in a pre-allocated output tensor that must be shape-compatible with the broadcasted inputs.

Index

Name

Short Description

modsum_ttt

Modular summation (tensor-tensor-tensor)

modsum_ttc

Modular summation (tensor-tensor-constant)

modsum_tct

Modular summation (tensor-constant-tensor)

modsum_tcc

Modular summation (tensor-constant-constant)

Since: v0.1.0

🧩 Call Format

// tensor + tensor % tensor
modsum_ttt<T>(a, b, p, result);

// tensor + tensor % constant
modsum_ttc<T>(a, b, p_scalar, result);

// tensor + constant % tensor
modsum_tct<T>(a, b_scalar, p, result);

// tensor + constant % constant
modsum_tcc<T>(a, b_scalar, p_scalar, result);

T: Scalar data type (int32_t, int64_t, etc.)
a, b, p: Shared pointers to DeviceTensor<T> objects
p_scalar, b_scalar: Scalar values of type T
result: Pre-allocated output tensor (std::shared_ptr<DeviceTensor<T>>) on the device

📥 Input Parameters by Function Variant

Function

Input A (tensor)

Input B

Modulus P

Output Result

modsum_ttt

DeviceTensor<T>

DeviceTensor<T> (pre-allocated)

modsum_ttc

DeviceTensor<T>

T (scalar)

DeviceTensor<T> (pre-allocated)

modsum_tct

DeviceTensor<T>

T (scalar)

DeviceTensor<T>

DeviceTensor<T> (pre-allocated)

modsum_tcc

DeviceTensor<T>

T (scalar)

DeviceTensor<T> (pre-allocated)

All tensors must be pre-allocated and reside in device memory

auto a = host_to_device<int32_t>(torch::tensor({1, 2, 3}));
auto b = host_to_device<int32_t>(torch::tensor({4, 5, 6}));
auto p = host_to_device<int32_t>(torch::tensor({7, 7, 7}));
auto result = zeros<int32_t>({3});

modsum_ttt<int32_t>(a, b, p, result);

📝 Changelog

v0.1.0 - Initial release.

％ Modular Remainder Functions (mod)

Since: v1.0.0

These functions compute element-wise a % b using different combinations of tensor and scalar inputs.

tensor. The result is stored in a pre-allocated output tensor, which must match the shape of the input tensor(s).

Index

Name

Short Description

mod_tt

tensor % tensor

mod_tc

tensor % scalar

mod_ct

scalar % tensor

🧩 Call Format

// tensor % tensor
mod_tt<T>(a, b, result);

// tensor % scalar
mod_tc<T>(a, b_scalar, result);

// scalar % tensor
mod_ct<T>(a_scalar, b, result);

T: Scalar data type (int32_t, int64_t, etc.)
a, b: std::shared_ptr<DeviceTensor<T>>
a_scalar, b_scalar: int64_t
result: Pre-allocated output tensor std::shared_ptr<DeviceTensor<T>> on the device

📥📤 Parameters

Function

result

mod_tt

DeviceTensor<T>

mod_tc

DeviceTensor<T>

int64_t

DeviceTensor<T>

mod_ct

int64_t

DeviceTensor<T>

The result tensor must be pre-allocated and have a shape compatible with a and/or b.

▶️ Example Usage

auto a = host_to_device<int64_t>(torch::tensor({5, 10, 15}));
auto b = host_to_device<int64_t>(torch::tensor({3, 4, 5}));
auto result = empty<int64_t>({3});

mod_tt<int64_t>(a, b, result);

// Or tensor % scalar
mod_tc<int64_t>(a, 7, result);

// Or scalar % tensor
mod_ct<int64_t>(9, b, result);

📝 Changelog

v1.0.0: Initial version of mod_tt, mod_tc, mod_ct functions.

➖ Modular Negation Functions (modneg)

Since in v1.0.0

Performs modular negation, computing:

(-a) % p   →   (-(a % p) + p) % p

This ensures the result is non-negative and lies in the range [0, p).

Index

Name

Short Description

modneg_tt

(- tensor) % tensor

modneg_tc

(- tensor) % scalar

🧩 Call Formats

// Tensor % Tensor
modneg_tt<T>(a,p,result);

// Tensor % Scalar
modneg_tc<T>(a, p_scalar,result)
);

📥📤 Parameters

Name

Type

Role

Description

a

DeviceTensor<T>

Input

Input tensor to be negated.

p

DeviceTensor<T>

Input (modneg_tt)

Modulus tensor.

Broadcastable to a

Broadcasts elementwise.

p_scalar

T

Input (modneg_tc)

Scalar modulus.

result

DeviceTensor<T>

Output

Preallocated output tensor.

Shapes must be broadcast-compatible. If not, the function will throw invalid_argument.

▶️ Example Usage

modneg_tt<int64_t>(a_tensor, p_tensor, result);
modneg_tc<int64_t>(a_tensor, 7, result);

📝 Changelog

v1.0.0 - Initial release of modneg_tt (tensor/tensor) and modneg_tc (tensor/scalar)

📓 Modular Arithmetic Axis-wise

This chapter includes functions that perform modular arithmetic along a specific axis of a tensor. Instead of applying operations to each element one by one, these functions work across one dimension of the tensor.

Index

Name

Short Description

axis_modsum

sums values along a given axis and reduces them modulo p

modmul_axis_sum

Computes a modular sum of products over a specified axis between tensors a and b, with optional permutation

📑`axis_modsum`

Since: v0.1.0

The function performs a modular summation along a specific axis of a tensor. This means it reduces values across that axis by summing them, then applies a modulus operation on each result, using a provided vector of moduli p.

This is commonly used in FHE workloads for reducing polynomials or batched data along structural axes.

What does axis_modsum do? (Click to expand)

For each "slice" of the tensor along the selected axis:

It adds all values in that slice
Then applies % p[i] for each element in the last dimension
The result is written into the output tensor

Example

Assume you have an input tensor a with shape [2, 3, 4]:

a = [
  [[ 1,  2,  3,  4],
   [ 5,  6,  7,  8],
   [ 9, 10, 11, 12]],
  
  [[13, 14, 15, 16],
   [17, 18, 19, 20],
   [21, 22, 23, 24]]
]

And a modulus tensor p = [11, 13, 17, 19] (shape [4], matching the last dimension).

Calling:

axis_modsum(a, p, result, axis=1);

Will reduce across axis 1 (i.e., over the second dimension — the rows). The output will be:

[
  [(1+5+9)%11, (2+6+10)%13, (3+7+11)%17, (4+8+12)%19],
  [(13+17+21)%11, (14+18+22)%13, ...]
]

The output shape is [2, 4], same as the input shape with axis 1 removed.

Summary

The reduction is performed across the selected axis
The modulus is applied element-wise across the last dimension
The result tensor has one fewer dimension than the input

🧩 Call Format

axis_modsum<T>(a, p, axis, result);

T: Scalar data type (int32_t, int64_t, etc.)
All tensor arguments are std::shared_ptr<DeviceTensor<T>>
axis is an integer index specifying the dimension to reduce

📥📤 Parameters

Name

Type

Direction

Description

a

std::shared_ptr<DeviceTensor<T>>

Input

Input tensor. Must have shape [..., k] where k = p->dims[0].

p

std::shared_ptr<DeviceTensor<T>>

Input

Modulus vector of shape [k], where k matches the last dimension of a.

axis

int64_t

Input

Axis to reduce over.

result

std::shared_ptr<DeviceTensor<T>>

Output

Output tensor with shape equal to a with the axis dimension removed.

▶️ Example Usage

auto a = host_to_device<int32_t>(torch::tensor({{1, 2}, {3, 4}}, torch::kInt32));  // [2, 2]
auto p = host_to_device<int32_t>(torch::tensor({5, 5}, torch::kInt32));           // [2]
auto result = zeros<int32_t>({2});                                 // axis=0 reduced

axis_modsum<int32_t>(a, p, /*axis=*/0, result);

📝 Changelog

v0.1.0 - Initial release.
v1.0.0 - Repositioned the result parameter to the end of the parameters list for consistency with other functions

📑`modmul_axis_sum`

Computes a modular sum of products over a specified axis between tensors a and b, optionally applying a permutation. This function performs elementwise modular multiplication followed by summation.

Unlike most other HEAL functions, this function reads and updates the existing values in the result tensor, performing an incremental accumulation.

What does modmul_axis_sum do? (click to expand)

The modmul_axis_sum function multiplies two tensors together along one axis, sums the results, and applies modular reduction.

In simpler words:

For each output position, it:
1. Multiplies matching elements from a and b.
2. Adds up all those multiplied values along a chosen axis.
3. Applies a modulo (% p) operation to keep the result within [0, p).

It’s like doing:

sum over i (a[...] * b[...]) % p

Example

a = [ [1, 2, 3],  [4, 5, 6] ]   # shape [2, 3]
b = [ [7, 8, 9],  [10, 11, 12] ] # shape [2, 3]
p = [13]                        # modulus

If we sum over axis 1 (columns), we do:

For row 0: (1*7 + 2*8 + 3*9) % 13
For row 1: (4*10 + 5*11 + 6*12) % 13

The result is a 1D tensor with one value per row, each reduced modulo 13.

What does `apply_perm` do?

If apply_perm = true, you reorder the positions along that axis before multiplying and summing, using a perm vector.

Example:

If perm = [2, 0, 1], it means:
- “Position 0 becomes 2, 1 becomes 0, 2 becomes 1”
- You apply this shuffle before doing the calculations.

Since: v1.0.0

🧩 Call Format

modmul_axis_sum<T> ( a, b,p, perm,log2p_list,mu_list,apply_perm,result);

T: Scalar data type (int32_t, int64_t, etc.)
All tensor arguments are std::shared_ptr<DeviceTensor<T>>

📥📤 Parameters

Name

Type

Shape (axis = -1)

Role

Since

Description

a

std::shared_ptr<DeviceTensor<T>>

[reps, sum_size, k, n]

Input

v1.0.0

Left input tensor.

b

std::shared_ptr<DeviceTensor<T>>

[sum_size, k, n]

Input

v1.0.0

Right input tensor.

p

std::shared_ptr<DeviceTensor<T>>

[k]

Input

v1.0.0

Modulus per RNS channel.

perm

std::shared_ptr<DeviceTensor<T>>

[n]

Input (optional)

v1.0.0

Permutation indices if apply_perm is true.

log2p_list

std::shared_ptr<DeviceTensor<T>>

[k]

Input (optional)

v1.0.0

Barrett log2(p) values.

mu_list

std::shared_ptr<DeviceTensor<T>>

[k]

Input (optional)

v1.0.0

Barrett mu constants.

axis

int64_t

-1 or -3

~~Input~~

v1.0.0 (removed in v1.1.0)

Previously indicated which axis (-1 or -3) represented transform dimension m. Now the operation always uses axis = -1.

apply_perm

bool

—

Input

v1.0.0

Whether to apply permutation.

result

std::shared_ptr<DeviceTensor<T>>

[reps, k, n] or [reps, n, k]

Output

v1.0.0

Accumulator tensor; updated by adding new values to existing contents.

Logic

For output: result[...] = (result[...] + sum_i (a[...] * b[...]) % p) % p
1. Computes the sum over i: (a[...] * b[...]) % p
2. Adds this sum to the existing value in result[...].
3. Applies % p again to keep the value within modular bounds.
Key difference from other HEAL functions: The result tensor is not cleared or zero-initialized internally. If you need to avoid accumulation, you must initialize it to zero yourself before the call.
If apply_perm is true, applies the perm permutation before computation.
Modular arithmetic is performed per k-channel, ensuring overflow safety.

❗ Throws std::invalid_argument on:

Shape mismatches.
Non-positive moduli.
Invalid permutation indices.

// Initialize result with zeros if you want overwrite behavior
auto result = empty<int64_t>({reps, k, n});
set_const_val<int64_t>(result, 0);

modmul_axis_sum<int64_t>(
    a, b, p, perm, nullptr, nullptr, false, result);

📝Changelog

v1.1.0 - Removed axis (-1 or -3) parameter, which represented transform dimension m. Now the operation always uses axis = -1.

v1.0.0 - Initial release.

📒Number Theoretic Transform (NTT, INTT) functions

This section describes the forward and inverse Number Theoretic Transform operations used in modular polynomial arithmetic.

Index

Name

Short Description

ntt

Applies Number-Theoretic Transform - performs a forward transform on batched, multi-channel input tensors, converting them to the NTT domain for efficient polynomial multiplication.

intt

Applies Inverse Number-Theoretic Transform, returning the data to its original (coefficient) domain.

📑`ntt`

Applies the forward Number Theoretic Transform (NTT) on a batched, multi-channel tensor. This transform converts data from the coefficient domain into the NTT domain, enabling efficient modular polynomial multiplication.

- Introduced in v0.1.0
- Signature updated in v1.0.0
  — Added support for axis parameter (enabling [l, r, k, m] layout)
  — Added optional log2p_list and mu_list for Barrett reduction

🧩 Call Format

// Forward NTT
ntt<T>(
    a,          // [l, m, r, k]
    p,          // [k]
    perm,       // [m]
    twiddles,   // [k, m]
    log2p_list, // [k] — optional (v1.0.0+)
    mu_list,    // [k] — optional (v1.0.0+)
    axis,       // required (v1.0.0+)
    skip_perm,
    result      // [l, m, r, k]
);

T: Scalar data type
All inputs are std::shared_ptr<DeviceTensor<T>>
result is a pre-allocated output tensor

📥 📤Parameters

Name

Shape

Direction

Since

Description

a

[l, m, r, k]

Input

v0.1.0

Input tensor: l = left batch, m = transform length, r = right batch, k = RNS channels

p

[k]

Input

v0.1.0

Vector of modulus values for each RNS channel

perm

[m]

Input

v0.1.0

Permutation vector for final reordering

twiddles

[k, m]

Input

v0.1.0

Twiddle factors for forward transform

log2p_list

[k]

Input (optional)

v1.0.0

Precomputed log₂(pᵢ) per modulus: used for optional Barrett reduction.

mu_list

[k]

Input (optional)

v1.0.0

Precomputed Barrett constants (2²ⁿ / pᵢ) per modulus.

axis

-3 or -1

Input

v1.0.0

Which axis represents transform dimension m.

skip_perm

boolean

Input

v1.0.0

Indicates whether to skip the permutation step.

result

[l, m, r, k]

Output

v0.1.0

Output tensor. Must be pre-allocated and match shape of input a.

Logic

Executes staged butterfly operations over the specified axis (-1 or -3).
Uses twiddle factors and modulus values to perform modular arithmetic.
Writes the transformed result into the result tensor.
By default, it applies a final permutation using perm. Set skip_perm = true to skip this step.

❗ Throws:

std::invalid_argument if axis is invalid or shapes mismatch.

// Forward transform
ntt<int32_t>(a, p, perm, twiddles, nullptr, nullptr, -3, true, result);

📝 Changelog

v1.0.0
- Added support for axis = -1 layout ([l, r, k, m])
- Introduced log2p_list and mu_list for optional Barrett reduction
v0.1.0
- Original implementation with fixed [l, m, r, k] layout

📑`intt`

Applies the inverse Number Theoretic Transform (INTT) to return tensors from the NTT domain back to the coefficient domain.

- Introduced in v0.1.0
- Signature updated in v1.0.0
  — Supports optional Barrett reduction parameters

🧩 Call Format


// Inverse NTT
intt<T>(
    a,
    p,
    perm,
    inv_twiddles,
    m_inv,
    log2p_list,  // optional (v1.0.0+)
    mu_list,     // optional (v1.0.0+)
    result
);

T: Scalar data type (e.g., int32_t, int64_t)
All inputs are std::shared_ptr<DeviceTensor<T>>
result is a pre-allocated output tensor

📥 📤Parameters

Name

Type

Shape

Role

Since

Description

a

std::shared_ptr<DeviceTensor<T>>

[l, m, r, k]

Input

v0.1.0

Input tensor in NTT domain.

p

std::shared_ptr<DeviceTensor<T>>

[k]

Input

v0.1.0

Modulus values (one per RNS channel).

perm

std::shared_ptr<DeviceTensor<T>>

[m]

Input

v0.1.0

Reordering vector to restore canonical element order.

inv_twiddles

std::shared_ptr<DeviceTensor<T>>

[k, m]

Input

v0.1.0

Inverse twiddle factors.

m_inv

std::shared_ptr<DeviceTensor<T>>

[k]

Input

v0.1.0

Modular inverse of transform size m.

log2p_list

std::shared_ptr<DeviceTensor<T>>

[k]

Input (optional)

v1.0.0

⌊log₂(pᵢ)⌋ values for Barrett reduction (not used yet in default impl).

mu_list

std::shared_ptr<DeviceTensor<T>>

[k]

Input (optional)

v1.0.0

Barrett constants (2²ⁿ / pᵢ).

result

std::shared_ptr<DeviceTensor<T>>

[l, m, r, k]

Output

v0.1.0

Must be preallocated and match input shape.

// Inverse transform
intt<int32_t>(
    transformed, p, perm, inv_twiddles, m_inv,
    nullptr, nullptr, restored
);

📝 Changelog

v1.0.0
- Introduced log2p_list and mu_list for optional Barrett reduction
v0.1.0
- Original implementation with fixed [l, m, r, k] layout

📕Other Compute Operations

This chapter includes additional computational functions that are not strictly arithmetic or shape-related but are essential to support specialized FHE workloads.

Other compute operations functions include:

Index

Group

Name

Short Description

Other Compute Operations

apply_g_decomp

Applies gadget decomposition (HE-specific operation)

Other Compute Operations

take_along_axis

Selects values from a tensor along a specified axis using provided indices.

📑`apply_g_decomp_relative_to_full_q`

Since: v1.1.0

Replaced apply_g_decomp (removed in v1.1.0)

This function performs gadget decomposition of RNS-represented tensors into base-g digit form. Unlike the previous version, the decomposition is performed relative to the full modulus product \(Q = ∏q_i ).

Each element is first reconstructed from its CRT residues into an integer, then decomposed into g_exp digits in base 2^g_base_bits.

What does apply_g_decomp_relative_to_full_q do? (Click to expand)

Input tensor a of shape [reps_l, q_list, reps_r] contains residues in RNS/CRT form.
The function reconstructs each element modulo Q with the Q is a product of q_i ([0, ..., q_list-1] )
Each element is expressed as a sum of digits in base $2^{g\_base\_bits}$ : n

$x = d_0 \cdot 2^{0 \cdot g\_base\_bits} \;+\; d_1 \cdot 2^{1 \cdot g\_base\_bits} \;+\; \dots \;+\; d_{g\_exp-1} \cdot 2^{(g\_exp-1) \cdot g\_base\_bits}$

The result is written into a new tensor out with shape [reps_l, g_exp, reps_r], where the second axis enumerates the digits of the decomposition.

🧩 Call Format

apply_g_decomp_relative_to_full_q<T, U>( a, q_list, g_exp, g_base_bits, out);

T: scalar type of input residues (e.g., int32_t, int64_t)
U: scalar type of output digits (can be same or different from T)
All tensors are std::shared_ptr<DeviceTensor<...>>

📥📤 Parameters

Name

Type

Direction

Since

Description

a

std::shared_ptr<DeviceTensor<T>>

Input

v1.1.0

Input tensor of shape [reps_l, q_list_len, reps_r] containing RNS residues.

q_list

std::shared_ptr<DeviceTensor<T>>

Input

v1.1.0

Vector of RNS moduli [q_list_len].

g_exp

int

Input

v1.1.0

Number of digits to extract in the decomposition.

g_base_bits

int

Input

v1.1.0

Bit-width of each digit (defines base = 2^g_base_bits).

out

std::shared_ptr<DeviceTensor<U>>

Output

v1.1.0

Output tensor of shape [reps_l, g_exp, reps_r] containing decomposition digits.


// Run gadget decomposition relative to full Q
apply_g_decomp_relative_to_full_q<int32_t, int32_t>(a, q_list, g_exp, g_base_bits, out);

// 'out' now holds base-4 digits for each reconstructed integer from a

📝 Changelog

v1.1.0: Introduced apply_g_decomp_relative_to_full_q. Replaces apply_g_decomp by performing decomposition relative to full modulus QQQ.
v1.0.0: Renamed from g_decomposition to apply_g_decomp.Repositioned the result parameter to the end of the parameters list for consistency with other functions
v0.1.0: Initial version

📑`take_along_axis`

Selects values from a tensor along a specified axis using provided indices. This function performs an axis-wise gather operation and writes the selected values to a result tensor.

Since: v1.0.0

What does take_along_axis do? (Click to expand)

This function allows you to select specific values from a tensor by specifying the exact positions (indices) you want along a chosen axis.

It’s like requesting “From each row (or column, or depth slice), give me the element at position X.”

1D Example

Suppose you have a 1D tensor:

a = [5, 10, 15, 20]

And you want to select items in this order: [2, 0, 3, 1] → which means: pick the 3rd, then 1st, then 4th, then 2nd element.

You call:

take_along_axis(a, indices = [2, 0, 3, 1], axis = 0)

You get:

[15, 5, 20, 10]

2D example:

If you have a 2D tensor:

a = [[10, 20, 30],
     [40, 50, 60],
     [70, 80, 90]]

And you want to select:

indices = [[2, 1, 0],
           [1, 0, 2],
           [0, 2, 1]]

🔹 Case 1: `axis = 0` → Gather down rows

At each column, we choose from the vertical direction (axis 0), so rows are selected.
Shape of indices must match a, and each index points to a row.

We apply:

take_along_axis(a, indices, axis = 0)

Result:

[[70, 50, 30],  // from rows [2,1,0]
 [40, 20, 90],  // from rows [1,0,2]
 [70, 80, 60]]  // from rows [0,2,1]

Each column's value is selected from the specified row in that column.

🔹 Case 2: `axis = 1` → Gather across columns

At each row, we choose from the horizontal direction (axis 1) — so columns are selected.

We apply:

take_along_axis(a, indices, axis = 1)

[[30, 20, 10],  // from columns [2,1,0]
 [50, 40, 60],  // from columns [1,0,2]
 [70, 90, 80]]  // from columns [0,2,1]

Each value is selected from the same row, just pulling the column specified by indices.

Another example

Input tensor's shape = [3, 4] with 3 rows (axis 0) and 4 columns (axis 1):

a =  [[10, 20, 30, 40],
      [50, 60, 70, 80],
      [90,100,110,120]]

We want to gather 2 values from each row →

indices = [[2, 0],
           [1, 3],
           [3, 1]]

take_along_axis(a, indices, axis=1)

Returns:

[[ 30,  10],
 [ 60,  80],
 [120, 100]]

The result shape matches indices.shape, because the indices are driving the size along the axis you’re gathering from.

🧩 Call Format

take_along_axis<T>( tensor, indices, axis, result);

axis tells which dimension you’re indexing into.
You can use negative numbers to count from the end. For example, -1 means “last element”
indices.shape must match a.shape, except along axis.
The resulting shape is always the same as indices.
This does not sort; it selects values based on your index map.

📥📤 Parameters

Name

Type

Shape

Role

Description

a

std::shared_ptr<DeviceTensor<T>>

Input

Source tensor to gather from.

indices

std::shared_ptr<DeviceTensor<int64_t>>

Broadcast-compatible with a (except along axis)

Input

Indices to select along axis; broadcastable to a.

axis

int64_t

—

Input

Axis along which to take values (can be negative).

result

std::shared_ptr<DeviceTensor<T>>

Same as a

Output

Output tensor (preallocated).

Logic

For each coordinate in indices, selects an element from a along axis.
Supports negative axis values.
Supports negative indices ( -1 means last element).
Requires:
- indices.shape broadcast-compatible with a.
- result.shape == a.shape (new in v1.1.0, simplified API).
❗ Throws:
- std::invalid_argument if shapes mismatch.
- std::out_of_range if indices are out of bounds or axis is invalid.

auto a = torch::tensor({5, 10, 15, 20});
auto indices = torch::tensor({2, 0, 3, 1});
auto result = empty<int64_t>({4});

take_along_axis<int64_t>(a_hw, indices_hw, 0, result_hw);
//Result: [15, 5, 20, 10]

📝 Changelog

v0.1.0 - Introduced function permute- This function rearranged elements of a tensor along a specified axis according to a batch-wise permutation pattern.
v1.0.0 - Function permute was replaced by take_along_axis, which generalizes the behavior and aligns more closely with established tensor APIs.

📗Shape Manipulation Functions

This chapter outlines operations used to manipulate the shape and structure of tensors without unnecessary data duplication. These functions are critical for enabling memory-efficient transformations during FHE program execution.

Shape manipulation functions include:

Index

Name

Short Description

flatten

Collapses a range of dimensions into a single axis, reshaping the tensor while preserving element order.

expand

Expands tensor dimensions without copying data

unsqueeze

Adds a dimension of size 1 at a specified position

squeeze

Removes dimensions of size 1 from tensor

reshape

Changes tensor shape while preserving data

moveaxis

Updates dims and strides metadata so that the tensor appears to have the same data but with one axis relocated.

get_slice

Produces a zero-copy sliced view using index or range.

📑`flatten`

Collapses a range of dimensions into a single axis, reshaping the tensor while preserving element order. Commonly used to reduce rank before linear processing or output.

Since: v1.0.0

What does flatten do? (click to expand)

The function collapses several dimensions of a tensor into a single dimension, without changing the actual data, just how it’s shaped.

You tell it:

which dimensions to flatten together (from start_axis to end_axis),
and it will replace that range with one combined dimension.

Simple Analogy

Imagine your tensor is a box of LEGO bricks organized by color, shape, and size:

[2, 3, 4] = 2 colors × 3 shapes × 4 sizes = 24 bricks.

If you flatten from axis 0 to 1, you mix colors and shapes into one group:

→ [6, 4] = 6 combined color-shape combos, still 4 sizes each.

Examples

Flatten middle dimensions

// Tensor shape: [2, 3, 4, 5]
flatten(a, 1, 2) 
// Result shape: [2, 12, 5]

We flatten [3, 4] into 12.

Flatten all dimensions

// Tensor shape: [2, 3, 4, 5]
flatten(a, 0, -1)
// Result shape: [120]

All axes are collapsed into one long row.

Flatten with negative indices

// Tensor shape: [4, 5, 6]
flatten(a, -3, -2)
// Result shape: [20, 6]

⚠️ Important Notes

Does not change the data, just reshapes the view of it.
Input must be contiguous in memory (no transposes before flatten).
Axes are inclusive, so flatten(a, 1, 3) flattens 3 axes, not 2.

🧩 Call Format

tensor = flatten<T>( tensor, start_axis, end_axis);

📥Input Parameters

Name

Type

Role

Description

a

std::shared_ptr<DeviceTensor<T>>

Input/Output

Input tensor to flatten. Metadata is updated in-place.

start_axis

int64_t

Input

Start of axis range to flatten (inclusive). Supports negative indexing.

end_axis

int64_t

Input

End of axis range to flatten (inclusive). Must be ≥ start_axis.

📤 Returns

Type

Description

std::shared_ptr<DeviceTensor<T>>

Same tensor as input, with updated shape and strides.

Logic

Flattens dimensions [start_axis, end_axis] into a single dimension.
All other dimensions remain unchanged.
Operates in-place: modifies tensor metadata but not the data buffer.
Input tensor must be contiguous. Non-contiguous tensors will throw an error.
Negative axes are normalized (-1 = last axis, etc.).
Throws on invalid ranges (e.g., start > end, or axes out of bounds).

// From [2, 3, 4, 5]:
flatten(a, 1, 2)   // shape becomes [2, 12, 5]
flatten(a, 0, -1)  // shape becomes [120]

📝 Changelog

v1.0.0 - Initial release.

📑`expand`

Since: v0.1.0

The expand function virtually replicates a singleton dimension of a tensor along a specified axis, modifying its shape and stride metadata without duplicating memory.

What does expand do? (Click to expand)

Operation

You specify:

Which axis you want to expand (axis)
How many times to repeat the dimension (repeats)

The selected axis must originally have size 1, because only dimensions of size 1 can be "stretched" safely by broadcasting.

Internally, the stride along that axis becomes 0, meaning all repeated positions point to the same memory location.

Example

Suppose you have a tensor with shape [2, 1, 3]:

[
 [[1, 2, 3]],
 [[4, 5, 6]]
]

If you call:

expand(a, axis=-2, repeats=4);

The shape becomes [2, 4, 3].

Every value along the second axis is repeated without copying:

[
 [[1, 2, 3],
  [1, 2, 3],
  [1, 2, 3],
  [1, 2, 3]],

 [[4, 5, 6],
  [4, 5, 6],
  [4, 5, 6],
  [4, 5, 6]]
]

🧩 Call Format

expand<T>(a, axis, repeats);

T: Scalar data type (int32_t, int64_t, float, double)
Tensor a is modified in-place.

📥📤 Parameters

Name

Type

Direction

Description

a

std::shared_ptr<DeviceTensor<T>>

Input/Output

Tensor whose dimension will be expanded in-place.

axis

int64_t

Input

Axis to expand (can be negative to count from the end).

repeats

int64_t

Input

Number of times to replicate the dimension; must be positive.

Warning: This function modifies the input tensor a in-place by changing its dimensions and strides.

// Expand along axis 1 (currently size 1) to make it size 3
auto expanded = expand<int32_t>(a, /*axis=*/1, /*repeats=*/3);

📝 Changelog

v1.0.0 - Initial release.

📑`unsqueeze`

Since: v0.1.0

The function inserts a new axis of size 1 into a tensor’s shape. This is a metadata-only operation: no data is changed, copied, or moved.

It is commonly used to align tensor shapes for broadcasting or to explicitly add batch, channel, or dimension markers.

How Unsqueeze Works (Click to expand)

The unsqueeze function adds a new dimension of size 1 into the tensor.

Imagine a tensor of shape [5, 10]. If you call:

unsqueeze(a, 0)

You insert a new leading dimension → new shape is [1, 5, 10].

If you instead call:

unsqueeze(a, -1)

You insert a new trailing dimension → new shape is [5, 10, 1].

🧩 Call Format

unsqueeze<T>(a, axis) → result

T: Scalar data type (int32_t, int64_t, float, double)
Returns: std::shared_ptr<DeviceTensor<T>>

📥 Input Parameters

Name

Type

Description

a

std::shared_ptr<DeviceTensor<T>>

Input tensor to be reshaped. This tensor is modified in-place.

axis

int64_t

The axis at which to insert a new dimension of size 1. Supports negative indexing.

📤 Output

Name

Type

Description

result

std::shared_ptr<DeviceTensor<T>>

A reference to the same tensor a, with an updated shape and stride metadata reflecting the added dimension.

// Input: shape [3, 4]
auto a = host_to_device<int32_t>(torch::rand({3, 4}, torch::kInt32));

// Insert new dimension at axis 1 → shape becomes [3, 1, 4]
auto result = unsqueeze<int32_t>(a, 1);

📝 Changelog

v1.0.0 - Initial release.

📑`squeeze`

Since: v0.1.0

The function removes a dimension of size 1 at the specified axis. This is a metadata-only operation — no data is copied or moved.

It is often used after broadcasting or slicing to clean up unnecessary singleton dimensions.

How Squeeze Works (Click to expand)

The squeeze function removes a dimension of size 1 at a specific axis. This is helpful when tensors have extra "empty" dimensions from operations like broadcasting or slicing.

For example:

If you have a tensor of shape [3, 1, 4] and you call:

squeeze(a, 1)

The output will have shape [3, 4].

Nothing is copied — the underlying data stays in place. Only shape and stride metadata is adjusted.

✔ Saves memory ✔ Keeps tensors clean ✔ Makes broadcasting more predictable

🧩 Call Format

squeeze<T>(a, axis) → result

T: Scalar data type (int32_t, int64_t, float, double)
Returns: std::shared_ptr<DeviceTensor<T>>

📥 Input Parameters

Name

Type

Description

a

std::shared_ptr<DeviceTensor<T>>

Input tensor to be reshaped. Modified in-place.

axis

int64_t

Axis to remove. Must be within valid range and must point to a dimension of size 1. Supports negative indexing.

📤 Output

Name

Type

Description

result

std::shared_ptr<DeviceTensor<T>>

Same tensor as input, with one fewer dimension. Shape and stride metadata are updated.

// Input: shape [3, 1, 4]
auto a = host_to_device<int32_t>(torch::rand({3, 1, 4}, torch::kInt32));

// Remove axis 1 → shape becomes [3, 4]
auto result = squeeze<int32_t>(a, 1);

📝 Changelog

v1.0.0 - Initial release.

📑`reshape`

Since: v0.1.0

The reshape method updates a tensor’s shape and stride metadata to match a new specified shape, as long as the total number of elements remains unchanged (excluding broadcasted dimensions).

How Reshape Works (Click to expand)

The reshape function changes how a tensor’s data is interpreted — without changing the data itself.

Example

Suppose you have a tensor with shape [2, 3, 4]:

[
  [[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9,10,11]],
  [[12,13,14,15], [16,17,18,19], [20,21,22,23]]
]

This tensor has 24 elements. Now call:

a->reshape({6, 4});

The shape becomes [6, 4], and the data is interpreted as:

[
 [ 0, 1, 2, 3],
 [ 4, 5, 6, 7],
 ...
 [20, 21, 22, 23]
]

You must preserve the number of elements. If the original had 24 elements, so must the new shape.

🧩 Call Format

a->reshape(new_dims)

Operates in-place: modifies the current tensor's shape and stride metadata

📥Input Parameters

Name

Type

Direction

Description

a

std::shared_ptr<DeviceTensor<T>>

Input/Output

The tensor to reshape. Shape and strides will be modified in-place.

new_dims

std::vector<int64_t>

Input

Desired new shape. Total element count must match the current tensor.

auto a = host_to_device<int32_t>(torch::arange(24).reshape({2, 3, 4}));

// Reshape from [2, 3, 4] to [6, 4]
a->reshape({6, 4});

📝 Changelog

v1.0.0 - Initial release.

📑`moveaxis`

Since: v1.0.0

The function updates the internal metadata (dims and strides) of a tensor to simulate movement of one axis to a new position, without modifying the underlying memory.

What does moveaxis do? (Click to expand)

moveaxis changes the order of axes in a tensor without moving any actual data in memory.
It updates the tensor’s shape (dims) and stride (strides) metadata so that one axis appears in a new position.
This is equivalent to reordering dimensions, like how PyTorch's movedim() or NumPy's moveaxis() works.

How it works

You specify:
- Which axis to move: axis_src
- Where to move it: axis_dst
Both axes can be negative — e.g., -1 means the last axis, -2 the second-to-last, etc.
Internally, the function:
- Removes the source axis from the dims and strides vectors
- Reinserts it at the target position
The underlying data buffer stays unchanged — only how the tensor interprets that data is updated.

🧠 Intuition It’s like cutting one column from a spreadsheet and pasting it in a different position, without changing the actual cell contents.

Example

Suppose we have a tensor with shape [2, 3, 4]:

a.shape = [2, 3, 4]

Now we call:

moveaxis(a, axis_src=2, axis_dst=0)

This means:

Take axis 2 (which had size 4 — the last dimension)
Move it to the front (position 0)

The result:

a.shape = [4, 2, 3]

So the dimensions are now rearranged: what used to be the last axis is now the first.

🧩 Call Format

moveaxis<T>(tensor, axis_src, axis_dst)

tensor: Tensor to update (metadata modified in-place)
axis_src: Axis to move (may be negative)
axis_dst: Target position (may be negative)

📥Input Parameters

Name

Type

Description

tensor

std::shared_ptr<DeviceTensor<T>>

Tensor to be modified in-place

axis_src

int64_t

Source axis index (supports negative indexing)

axis_dst

int64_t

Destination axis index (supports negative indexing)

Logic

Modifies the tensor’s dims and strides vectors to simulate a move of one axis.
Negative axis values are normalized using the tensor’s rank.
If axis_src == axis_dst, the operation is a no-op.
Invalid axis indices will raise std::invalid_argument.

❗ Error Conditions

Null pointer input → throws std::invalid_argument.
Axis indices outside valid range → throws std::invalid_argument.

auto a_hw = host_to_device<int64_t>(a);

moveaxis<int64_t>(a_hw, /*src=*/2, /*dst=*/0);

angelog

v1.0.0 - Initial release.

📑`get_slice`

Since: v1.0.0

This function produces a zero-copy view into the input tensor by modifying the metadata (shape, strides, and pointer offset) based on a slicing specification.

What does get_slice do? (Click to expand)

get_slice lets you select a portion of a tensor, like cutting out a smaller block from a larger one, without copying any data.

It works by adjusting how the tensor is viewed:

No new memory is created.
The function just updates shape and stride metadata to make it look like a smaller tensor.

You control the slicing with one instruction per axis:

You can either pick a single index (removes that axis), or
Select a range of elements using a start, end, and optional step.

Types of Slice Instructions

You can use:

Single Index Select one specific element along the axis and remove (virtually - done via metadata only) that axis from the shape.
```
SliceArg = int64_t(2)  // pick index 2 only
```
Range (Slice) Select multiple elements using a start, end, and optional step (default is 1).
```
SliceArg = Slice(start=1, end=4, step=1)  // pick indices 1, 2, 3
```

Examples

Example 1: Slice 1D Tensor

Suppose your tensor is:

a = [10, 20, 30, 40, 50]

get_slice(a, { Slice(1, 4) })

This means: keep items from index 1 to 3 → [20, 30, 40].

Example 2: Use Step

Same tensor:

a = [10, 20, 30, 40, 50]

get_slice(a, { Slice(0, 5, 2) })

This picks every 2nd element → [10, 30, 50]

Example 3: Pick a Row in 2D

a = [[ 1,  2,  3],
     [ 4,  5,  6]]

get_slice(a, { int64_t(1) })

This picks row 1 → [4, 5, 6] (The output is now 1D - the row axis is collapsed.)

Example 4: Select Sub-Block

a = [[10, 20, 30, 40],
     [50, 60, 70, 80]]

get_slice(a, { Slice(0,2), Slice(1,3) })

Rows 0 and 1 → keep both rows
Columns 1 and 2 → keep 20, 30 and 60, 70

Result:

[[20, 30],
 [60, 70]]

Example 5: Complex Case (3D)

Imagine a 3D tensor shaped [2, 3, 4] - like 2 blocks of 3 rows × 4 columns

get_slice(a, {
    int64_t(1),          // Pick block 1 → shape becomes [3, 4]
    Slice(0, 3, 2),      // Rows: take indices 0 and 2 → now shape is [2, 4]
    Slice(1, 4)          // Columns: take indices 1 to 3 → final shape is [2, 3]
})

Final result:

Block 1
Rows: 0 and 2
Columns: 1, 2, 3

💡 This is like selecting a submatrix or zoomed-in region of a larger tensor - no memory is moved, but the tensor behaves like a smaller view.

🧩 Call Format

get_slice<T>(input, slices) -> result;

T: Scalar data type
input: Input tensor whose metadata is modified
slices: specifying either a fixed index or a range

📥Input Parameters

Name

Type

Description

input

std::shared_ptr<DeviceTensor<T>>

Tensor to slice (metadata modified in-place)

slices

std::vector<SliceArg>

Slice specification per axis (see below)

Each SliceArg can be:

int64_t: Take a single index → collapses that axis
Slice: A struct of (start, end, step) (default step=1), with:
- start (inclusive)
- end (exclusive)
- step > 0

📤 Output

Type

Description

std::shared_ptr<DeviceTensor<T>>

A new view of the input tensor with updated shape, strides, and offset. No memory is copied.

Logic

Performs slicing without allocating a new buffer (zero-copy).
May collapse axes when single index is selected.
All slicing rules follow PyTorch-style semantics.
Negative indices are not currently supported.

❗ Error Conditions

slices.size() ≠ input.rank() → throws std::invalid_argument
Index out of bounds → throws std::out_of_range
Invalid range (e.g. end ≤ start, or step ≤ 0) → throws std::invalid_argument

  auto a = torch::tensor({{10,20,30,40},{50,60,70,80}}, torch::kInt32);
    std::vector<SliceArg> slices = {
        Slice(0, 2),      
        Slice(1, 3)   
    };
    auto a_hw   = host_to_device<int32_t>(a);
    auto out_hw = get_slice<int32_t>(a_hw, slices);
    auto out    = device_to_host<int32_t>(out_hw);

📝 Changelog

v1.0.0 - Initial release.

PreviousRuntime

Last updated 1 month ago

Introduction

📙Memory Management Functions

📑zeros

🧩Call Format

📥 Input

📤 Output

⚠️ Error Messages

📝 Changelog

📑empty

🧩Call Format

📥 Input

📤 Output

📝 Changelog

📑host_to_device

🧩 Call Format

📥 Input Parameters

📤 Output

⚠️ Error Messages

📝 Changelog

📑 device_to_host

🧩 Call Format

📥 Input Parameters

📤 Output

📝 Changelog

📑contiguous

🧩Call Format

📥📤 Parameters

📝 Changelog

📔Tensor Value Assignments

📑pad_single_axis

🧩 Call Format

📥📤Parameters

Logic

📝 Changelog

📑set_const_val

🧩 Call Format

📥 📤Parameters

Logic

📝 Changelog

📘Arithmetic Operations (Modular Arithmetic)

✖️ Modular Multiplication Functions (modmul)

🧩 Call Format

📥 Parameters by Function Variant

📝 Changelog

➕ Modular Addition Functions (modsum)

🧩 Call Format

📥 Input Parameters by Function Variant

📝 Changelog

％ Modular Remainder Functions (mod)

🧩 Call Format

📥📤 Parameters

▶️ Example Usage

📝 Changelog

➖ Modular Negation Functions (modneg)

🧩 Call Formats

📥📤 Parameters

▶️ Example Usage

📝 Changelog

📓 Modular Arithmetic Axis-wise

📑axis_modsum

Example

Summary

📝 Changelog

📑modmul_axis_sum

Example

What does apply_perm do?

🧩 Call Format

📥📤 Parameters

Logic

📝Changelog

📒Number Theoretic Transform (NTT, INTT) functions

📑ntt

Logic

📑intt

📕Other Compute Operations

📑apply_g_decomp_relative_to_full_q

🧩 Call Format

📥📤 Parameters

📝 Changelog

📑take_along_axis

📑`zeros`

📑`empty`

📑`host_to_device`

📑 `device_to_host`

📑`contiguous`

📑`pad_single_axis`

📑`set_const_val`

📑`axis_modsum`

📑`modmul_axis_sum`

What does `apply_perm` do?

📑`ntt`

📑`intt`

📑`apply_g_decomp_relative_to_full_q`

📑`take_along_axis`

🔹 Case 1: `axis = 0` → Gather down rows

🔹 Case 2: `axis = 1` → Gather across columns

📑`flatten`

📑`expand`

📑`unsqueeze`

📑`squeeze`

📑`reshape`

📑`moveaxis`

📑`get_slice`