Interface Specifications
Introduction
This document specifies the standardized interface required by hardware vendors integrating with the Homomorphic Encryption Abstraction Layer (HEAL). HEAL is designed to abstract the complexity of fully homomorphic encryption (FHE) computations, enabling efficient and scalable implementations across diverse hardware architectures.to integrate
The interface defined in this document includes essential functions categorized into:
Memory Management Functions: Operations responsible for allocation, initialization, and efficient transfer of tensor data between host (CPU) and hardware devices.
Shape Manipulation Functions: Provides operations to change the shape, layout, or dimension arrangement of tensors without copying data, enabling flexible transformations for downstream computations.
Tensor Value Assignments: Provides utility functions to assign constant values to all elements of a tensor without changing its shape or memory layout.
Arithmetic Operations (Modular Arithmetic): Essential modular arithmetic computations required in FHE workflows.
Modular Arithmetic Axis Operations: Performs modular arithmetic computations across a specified tensor axis, combining elements using summation or product-reduction patterns.
NTT Transforms: Implements forward and inverse Number-Theoretic Transforms (NTT/INTT) for efficient polynomial operations in the encrypted domain.
Other Compute Operations: Additional tensor computations and transformations essential to specialized FHE processes.
The core data structure managed through this interface is the tensor, a multi-dimensional array representing polynomial coefficients and associated metadata for homomorphic computations. A detailed explanation of tensors, supported data types, shapes, memory management strategies, and data flow considerations can be found in the Memory Management and Data Structures documentation.
📙Memory Management Functions
This chapter defines the interface for memory management operations within the HEAL framework. These functions enable efficient allocation, initialization, and transfer of tensor data between host (CPU) memory and hardware memory. They ensure that data is correctly formatted, aligned, and accessible for hardware execution, serving as the foundation for all subsequent FHE computations.
Memory management functions include:
1
zeros
Allocates uninitialized memory for a tensor
2
empty
Allocates initialized memory for a tensor
3
host_to_device
Transfers tensor data from host to device memory
4
device_to_host
Transfers tensor data from device to host memory
5
contiguous
Ensures tensor has contiguous memory layout; makes copy if needed.
📑zeros
zeros
Introduced in v0.1.0
Renamed in v1.0.0 (formerly allocate_on_hardware)
The function allocates memory on the device for a tensor with the specified shape. It does not initialize the contents of the allocated memory.
This function is useful when the memory is going to be immediately overwritten by subsequent operations, allowing for faster allocation without the overhead of zero-initialization.
🧩Call Format
device_tensor = zeros<T>(dims);
T
: Scalar data type of the tensor elements (e.g.,int32, int64, float32, float64, complex64, complex128)dims
: A list of dimensions representing the desired shape of the tensor.device_tensor
: A smart pointer to a newly allocated tensor on the device with uninitialized memory.
📥 Input
dims
std::vector<int64_t>
Tensor shape - list of dimension sizes. Assumes values are valid and > 0.
📤 Output
device_tensor
std::shared_ptr<DeviceTensor<T>>
A new tensor object on the device with uninitialized memory and associated metadata.
// Define a 2D tensor shape
std::vector<int64_t> dims = {8, 8};
// Allocate an uninitialized float32 tensor on the device
std::shared_ptr<DeviceTensor<float>> device_tensor = zeros<float>(dims);
⚠️ Error Messages
The function assumes:
The shape is valid (e.g., not negative).
Memory allocation on the device succeeds.
It performs no internal validation or exception handling.
📝 Changelog
v1.0.0: Renamed from
allocate_on_hardware
tozeros
v0.1.0: Initial version
📑empty
empty
Since: v1.0.0
The function allocates memory on the device for a tensor with the specified shape. It does not initialize the contents of the allocated memory.
This function is useful when the memory is going to be immediately overwritten by subsequent operations, allowing for faster allocation without the overhead of zero-initialization.
Allocates memory on the device but does not initialize values.
Computes default row-major strides.
Returns a valid
DeviceTensor<T>
that can be used as the output of other operations.The total number of elements is computed as the product of dimensions in
dims
.
❗ Error Conditions
If memory allocation fails (
malloc
returnsnullptr
), the implementation should raise an error or returnnullptr
.Negative or invalid dimension values may cause incorrect behavior and should be guarded by the caller.
🧩Call Format
device_tensor = empty<T>(dims);
T
: Scalar data type of the tensor elementsdims
: Shape of the tensor (each dimension must be ≥ 0)device_tensor
: A smart pointer to a newly allocated tensor on the device with uninitialized memory.
📥 Input
dims
std::vector<int64_t>
Tensor shape - list of dimension sizes. Assumes values are valid and > 0.
📤 Output
device_tensor
std::shared_ptr<DeviceTensor<T>>
Newly allocated device tensor with the given shape and inferred strides. Contents are uninitialized.
// Define a 2D tensor shape
std::vector<int64_t> dims = {8, 8};
// Allocate an uninitialized float32 tensor on the device
std::shared_ptr<DeviceTensor<float>> device_tensor = empty<float>(dims);
📝 Changelog
v1.0.0: Initial version
📑host_to_device
host_to_device
Since: v0.1.0
Transfers data from a host-side tensor (e.g., PyTorch, NumPy) to a newly allocated device tensor suitable for computation on accelerator hardware.
🧩 Call Format
device_tensor = host_to_device<T>(host_tensor);
T
: Scalar data type (e.g.,int32_t
,float
, etc.)host_tensor
: A tensor in host memory (e.g., PyTorch, NumPy) with scalar typeT
.device_tensor
: A smart pointer to a device-side representation of the tensor.
📥 Input Parameters
host_tensor
TensorLike
(e.g., PyTorch)
Host-side tensor with shape and data accessible for transfer
The data type must match template type T
.
📤 Output
device_tensor
std::shared_ptr<DeviceTensor<T>>
Device-allocated tensor containing copied data from the host
// Create a PyTorch (illustration only) tensor with int32 data on the host (CPU)
torch::Tensor host_tensor = torch::tensor({1, 2, 3}, torch::kInt32);
// Transfer the tensor to device memory using the HEAL interface
auto device_tensor = host_to_device<int32_t>(host_tensor);
⚠️ Error Messages
The function does not currently include explicit error handling for mismatches or null inputs. It assumes:
The host tensor has correct and accessible data for the scalar type
T
.Allocation on the device succeeds.
📝 Changelog
v0.1.0 - Initial release.
📑 device_to_host
device_to_host
Since: v0.1.0
Transfers data from a device-side tensor to a host-side tensor, facilitating the retrieval of computation results from accelerator hardware to the host environment.
🧩 Call Format
host_tensor = device_to_host<T>(device_tensor);
T
: Scalar data type (e.g.,int32_t
,float
)device_tensor
: Astd::shared_ptr<DeviceTensor<T>>
residing on the devicehost_tensor
: A tensor in host memory containing the copied data
📥 Input Parameters
device_tensor
std::shared_ptr<DeviceTensor<T>>
Device-side tensor to be copied to host memory
Data type must match template T
.
📤 Output
host_tensor
TensorLike
Host-side tensor containing data copied from device
// Assume this device tensor was created using host_to_device earlier
std::shared_ptr<DeviceTensor<int32_t>> device_tensor = ...;
// Transfer the tensor back to host memory as a PyTorch (implementation example) tensor
torch::Tensor host_tensor = device_to_host<int32_t>(device_tensor);
📝 Changelog
v0.1.0 - Initial release.
📑contiguous
contiguous
- Introduced in v0.1.0
- Renamed in v1.0.0 (formerly make_contiguous)
The function ensures that a tensor has a standard, contiguous memory layout. If the tensor is already contiguous, it returns immediately. If not, it creates a new memory buffer, copies the elements into contiguous layout, updates strides, and modifies the tensor in-place.
🧩Call Format
contiguous<T>(tensor);
T
: Scalar data type (int32_t
,int64_t
,float
,double
)Tensor
tensor
is modified in-place if needed.
📥📤 Parameters
tensor
std::shared_ptr<DeviceTensor<T>>
Input/Output
Input tensor to be made contiguous if necessary.
// Non-contiguous tensor example: result of transpose
auto a = host_to_device<int32_t>(torch::randint(0, 10, {3, 4}).transpose(0, 1));
// Make contiguous (copying into new memory if needed)
contiguous<int32_t>(a);
// After the call, 'a' now has standard contiguous memory layout
📝 Changelog
v1.0.0: Renamed from
make_contiguous
tocontiguous
v0.1.0: Initial version
📔Tensor Value Assignments
Functions in this chapter assign constant values to all elements of a tensor without changing its shape or memory layout.
6
pad_single_axis
Appends zeros at the end of a specific axis, expanding shape.
7
set_const_val
Sets all elements of a tensor to a constant value; in-place, no allocation.
📑pad_single_axis
pad_single_axis
Since: v1.0.0
🧩 Call Format
pad_single_axis<T>( a, pad, axis, result);
📥📤Parameters
a
std::shared_ptr<DeviceTensor<T>>
Any shape
Input
v1.0.0
Input tensor to pad.
pad
int64_t
—
Input
v1.0.0
Number of zeros to append (must be ≥ 0).
axis
int64_t
—
Input
v1.0.0
Axis index along which to pad (negative values allowed, e.g., -1
= last axis).
result
std::shared_ptr<DeviceTensor<T>>
Padded shape
Output
v1.0.0
Output tensor; same as a
but with padded axis expanded by pad
.
Logic
Expands the dimension at
axis
bypad
elements.Copies existing values from
a
intoresult
.Fills padded positions with zero (
0
of typeT
).Supports negative axis indices (
-1
= last,-2
= second-to-last, etc.).
❗ Throws std::invalid_argument
if:
pad < 0
Axis out of bounds.
Input and result ranks mismatch.
Result shape does not match expected padded dimensions.
// a: shape [2, 3]
// pad: 2 on axis 1 → result shape: [2, 5]
auto result = empty<int64_t>({2, 5});
pad_single_axis<int64_t>(a, 2, 1, result);
📝 Changelog
v1.0.0 - Initial release.
📑set_const_val
set_const_val
Sets all elements of a tensor to a given constant value. This is a utility function often used to initialize intermediate tensors or reset memory before computation.
Since: v1.0.0
🧩 Call Format
set_const_val<T>(tensor, val);
T
: Scalar data type (int32_t
,int64_t
)Tensor
std::shared_ptr<DeviceTensor<T>>
📥 📤Parameters
a
std::shared_ptr<DeviceTensor<T>>
Input/Output
Tensor to overwrite. Modified in-place.
val
T
Input
Scalar value to assign to all elements.
Logic
Iterates over all elements of the tensor, replacing each with
val
.Supports tensors of any rank, including scalar (0D) tensors.
Leaves the tensor shape unchanged.
Throws:
std::invalid_argument
if the input tensora
is null.
// This sets every element in the tensor to zero.
auto hw_tensor = host_to_device<int32_t>(torch::rand({10}));
set_const_val<int32_t>(hw_tensor, 0);
📝 Changelog
v1.0.0 - Initial release.
📘Arithmetic Operations (Modular Arithmetic)
Functions in this chapter perform element-wise or structured computations such as modular addition, and modular multiplication.
All modular arithmetic functions in HEAL accept a modulus parameter p
, which can be:
A scalar (same modulus applied to all elements), or
A 1D tensor of shape
[k]
, wherek
matches the size of the result tensor’s last dimension.
All results are reduced modulo p
, and the outcome is always in the range [0, p)
, even if intermediate values (e.g., inputs or intermediate sums/products) are negative: (-1 % 5) → 4
; (-7 % 5) → 3
This ensures correctness and consistency across all platforms and encryption schemes.
✖️ Modular Multiplication Functions (modmul)
This section defines the modular multiplication functions supported by the HEAL interface. These functions compute element-wise (a * b) % p
using different combinations of tensor and scalar inputs.
ttt
: all inputs are tensorsttc
: modulus is a scalartct
: multiplierb
is scalartcc
: both multiplier and modulus are scalars
The result is stored in a pre-allocated output tensor, which must match the expected broadcasted shape of inputs a
and b
.
8
modmul_ttt
Modular multiplication (tensor-tensor-tensor)
9
modmul_ttc
Modular multiplication (tensor-tensor-constant)
10
modmul_tct
Modular multiplication (tensor-constant-tensor)
11
modmul_tcc
Modular multiplication (tensor-constant-constant)
Since: v0.1.0
🧩 Call Format
// tensor * tensor % tensor
modmul_ttt<T>(a, b, p, result);
// tensor * tensor % constant
modmul_ttc<T>(a, b, p_scalar, result);
// tensor * constant % tensor
modmul_tct<T>(a, b_scalar, p, result);
// tensor * constant % constant
modmul_tcc<T>(a, b_scalar, p_scalar, result);
T
: Scalar data type (int32_t
,int64_t
, etc.)a
,b
,p
: Shared pointers toDeviceTensor<T>
objectsp_scalar
,b_scalar
: Scalar values of typeT
result
: Pre-allocated output tensor (std::shared_ptr<DeviceTensor<T>>
) on the device
📥 Parameters by Function Variant
modmul_ttt
DeviceTensor<T>
DeviceTensor<T>
DeviceTensor<T>
DeviceTensor<T>
(pre-allocated)
modmul_ttc
DeviceTensor<T>
DeviceTensor<T>
T
(scalar)
DeviceTensor<T>
(pre-allocated)
modmul_tct
DeviceTensor<T>
T
(scalar)
DeviceTensor<T>
DeviceTensor<T>
(pre-allocated)
modmul_tcc
DeviceTensor<T>
T
(scalar)
T
(scalar)
DeviceTensor<T>
(pre-allocated)
The
result
tensor must be pre-allocated and have a shape compatible with broadcasted inputs
auto a = host_to_device<int32_t>(torch::tensor({1, 2, 3}));
auto b = host_to_device<int32_t>(torch::tensor({4, 5, 6}));
auto p = host_to_device<int32_t>(torch::tensor({7, 7, 7}));
auto result = zeros<int32_t>({3});
modmul_ttt<int32_t>(a, b, p, result);
📝 Changelog
v1.0.0 - Initial release.
➕ Modular Addition Functions (modsum)
This section defines the modular addition functions supported by the HEAL interface. These functions compute element-wise modular addition: result[i] = (a[i] + b[i]) % p[i]
The input can consist of tensors or scalars, and broadcasting is supported. The result is stored in a pre-allocated output tensor that must be shape-compatible with the broadcasted inputs.
12
modsum_ttt
Modular summation (tensor-tensor-tensor)
13
modsum_ttc
Modular summation (tensor-tensor-constant)
14
modsum_tct
Modular summation (tensor-constant-tensor)
15
modsum_tcc
Modular summation (tensor-constant-constant)
Since: v0.1.0
🧩 Call Format
// tensor + tensor % tensor
modsum_ttt<T>(a, b, p, result);
// tensor + tensor % constant
modsum_ttc<T>(a, b, p_scalar, result);
// tensor + constant % tensor
modsum_tct<T>(a, b_scalar, p, result);
// tensor + constant % constant
modsum_tcc<T>(a, b_scalar, p_scalar, result);
T
: Scalar data type (int32_t
,int64_t
, etc.)a
,b
,p
: Shared pointers toDeviceTensor<T>
objectsp_scalar
,b_scalar
: Scalar values of typeT
result
: Pre-allocated output tensor (std::shared_ptr<DeviceTensor<T>>
) on the device
📥 Input Parameters by Function Variant
modsum_ttt
DeviceTensor<T>
DeviceTensor<T>
DeviceTensor<T>
DeviceTensor<T>
(pre-allocated)
modsum_ttc
DeviceTensor<T>
DeviceTensor<T>
T
(scalar)
DeviceTensor<T>
(pre-allocated)
modsum_tct
DeviceTensor<T>
T
(scalar)
DeviceTensor<T>
DeviceTensor<T>
(pre-allocated)
modsum_tcc
DeviceTensor<T>
T
(scalar)
T
(scalar)
DeviceTensor<T>
(pre-allocated)
All tensors must be pre-allocated and reside in device memory
auto a = host_to_device<int32_t>(torch::tensor({1, 2, 3}));
auto b = host_to_device<int32_t>(torch::tensor({4, 5, 6}));
auto p = host_to_device<int32_t>(torch::tensor({7, 7, 7}));
auto result = zeros<int32_t>({3});
modsum_ttt<int32_t>(a, b, p, result);
📝 Changelog
v0.1.0 - Initial release.
% Modular Remainder Functions (mod)
Since: v1.0.0
These functions compute element-wise a % b
using different combinations of tensor and scalar inputs.
tensor. The result is stored in a pre-allocated output tensor, which must match the shape of the input tensor(s).
16
mod_tt
tensor % tensor
17
mod_tc
tensor % scalar
18
mod_ct
scalar % tensor
🧩 Call Format
// tensor % tensor
mod_tt<T>(a, b, result);
// tensor % scalar
mod_tc<T>(a, b_scalar, result);
// scalar % tensor
mod_ct<T>(a_scalar, b, result);
T
: Scalar data type (int32_t
,int64_t
, etc.)a
,b
:std::shared_ptr<DeviceTensor<T>>
a_scalar
,b_scalar
:int64_t
result
: Pre-allocated output tensorstd::shared_ptr<DeviceTensor<T>>
on the device
📥📤 Parameters
mod_tt
DeviceTensor<T>
DeviceTensor<T>
DeviceTensor<T>
mod_tc
DeviceTensor<T>
int64_t
DeviceTensor<T>
mod_ct
int64_t
DeviceTensor<T>
DeviceTensor<T>
The result tensor must be pre-allocated and have a shape compatible with a
and/or b
.
▶️ Example Usage
auto a = host_to_device<int64_t>(torch::tensor({5, 10, 15}));
auto b = host_to_device<int64_t>(torch::tensor({3, 4, 5}));
auto result = empty<int64_t>({3});
mod_tt<int64_t>(a, b, result);
// Or tensor % scalar
mod_tc<int64_t>(a, 7, result);
// Or scalar % tensor
mod_ct<int64_t>(9, b, result);
📝 Changelog
v1.0.0: Initial version of
mod_tt
,mod_tc
,mod_ct
functions.
➖ Modular Negation Functions (modneg)
Since in v1.0.0
Performs modular negation, computing:
(-a) % p → (-(a % p) + p) % p
This ensures the result is non-negative and lies in the range [0, p)
.
19
modneg_tt
(- tensor) % tensor
20
modneg_tc
(- tensor) % scalar
🧩 Call Formats
// Tensor % Tensor
modneg_tt<T>(a,p,result);
// Tensor % Scalar
modneg_tc<T>(a, p_scalar,result)
);
📥📤 Parameters
a
DeviceTensor<T>
Input
Input tensor to be negated.
p
DeviceTensor<T>
Input (modneg_tt
)
Modulus tensor.
Broadcastable to a
Broadcasts elementwise.
p_scalar
T
Input (modneg_tc
)
Scalar modulus.
result
DeviceTensor<T>
Output
Preallocated output tensor.
Shapes must be broadcast-compatible. If not, the function will throw invalid_argument
.
▶️ Example Usage
modneg_tt<int64_t>(a_tensor, p_tensor, result);
modneg_tc<int64_t>(a_tensor, 7, result);
📝 Changelog
v1.0.0 - Initial release of
modneg_tt
(tensor/tensor) andmodneg_tc
(tensor/scalar)
📓 Modular Arithmetic Axis-wise
This chapter includes functions that perform modular arithmetic along a specific axis of a tensor. Instead of applying operations to each element one by one, these functions work across one dimension of the tensor.
21
axis_modsum
sums values along a given axis and reduces them modulo p
22
modmul_axis_sum
Computes a modular sum of products over a specified axis between tensors a
and b
, with optional permutation
📑axis_modsum
axis_modsum
Since: v0.1.0
The function performs a modular summation along a specific axis of a tensor. This means it reduces values across that axis by summing them, then applies a modulus operation on each result, using a provided vector of moduli p
.
This is commonly used in FHE workloads for reducing polynomials or batched data along structural axes.
🧩 Call Format
axis_modsum<T>(a, p, axis, result);
T
: Scalar data type (int32_t
,int64_t
, etc.)All tensor arguments are
std::shared_ptr<DeviceTensor<T>>
axis
is an integer index specifying the dimension to reduce
📥📤 Parameters
a
std::shared_ptr<DeviceTensor<T>>
Input
Input tensor. Must have shape [..., k]
where k = p->dims[0]
.
p
std::shared_ptr<DeviceTensor<T>>
Input
Modulus vector of shape [k]
, where k
matches the last dimension of a
.
axis
int64_t
Input
Axis to reduce over.
result
std::shared_ptr<DeviceTensor<T>>
Output
Output tensor with shape equal to a
with the axis
dimension removed.
▶️ Example Usage
auto a = host_to_device<int32_t>(torch::tensor({{1, 2}, {3, 4}}, torch::kInt32)); // [2, 2]
auto p = host_to_device<int32_t>(torch::tensor({5, 5}, torch::kInt32)); // [2]
auto result = zeros<int32_t>({2}); // axis=0 reduced
axis_modsum<int32_t>(a, p, /*axis=*/0, result);
📝 Changelog
v0.1.0 - Initial release.
v1.0.0 - Repositioned the result parameter to the end of the parameters list for consistency with other functions
📑modmul_axis_sum
modmul_axis_sum
Computes a modular sum of products over a specified axis between tensors a
and b
, optionally applying a permutation. This function performs elementwise modular multiplication followed by summation.
Since: v1.0.0
🧩 Call Format
modmul_axis_sum<T> ( a, b,p, perm,log2p_list,mu_list,axis,apply_perm,result);
T
: Scalar data type (int32_t
,int64_t
, etc.)All tensor arguments are
std::shared_ptr<DeviceTensor<T>>
📥📤 Parameters
a
std::shared_ptr<DeviceTensor<T>>
[reps, sum_size, k, n]
Input
v1.0.0
Left input tensor.
b
std::shared_ptr<DeviceTensor<T>>
[sum_size, k, n]
Input
v1.0.0
Right input tensor.
p
std::shared_ptr<DeviceTensor<T>>
[k]
Input
v1.0.0
Modulus per RNS channel.
perm
std::shared_ptr<DeviceTensor<T>>
[n]
Input (optional)
v1.0.0
Permutation indices if apply_perm
is true.
log2p_list
std::shared_ptr<DeviceTensor<T>>
[k]
Input (optional)
v1.0.0
Barrett log2(p) values.
mu_list
std::shared_ptr<DeviceTensor<T>>
[k]
Input (optional)
v1.0.0
Barrett mu constants.
axis
int64_t
-1
or -3
Input
v1.0.0
Contraction axis.
apply_perm
bool
—
Input
v1.0.0
Whether to apply permutation.
result
std::shared_ptr<DeviceTensor<T>>
[reps, k, n]
or [reps, n, k]
Output
v1.0.0
Accumulator tensor; updated by adding new values to existing contents.
Logic
For each output position:
result[...] = (result[...] + sum_i (a[...] * b[...]) % p) % p
Computes the sum over
i
:(a[...] * b[...]) % p
Adds this sum to the existing value in
result[...]
.Applies
% p
again to keep the value within modular bounds.
Key difference from other HEAL functions: The
result
tensor is not cleared or zero-initialized internally. If you need to avoid accumulation, you must initialize it to zero yourself before the call.Supports two contraction modes:
axis = -1
: sum oversum_size
on axis 1 (a
) and axis 0 (b
).axis = -3
: sum oversum_size
on axis 2 (a
) and axis 1 (b
).
If
apply_perm
is true, applies theperm
permutation before computation.Modular arithmetic is performed per
k
-channel, ensuring overflow safety.
❗ Throws std::invalid_argument
on:
Shape mismatches.
Invalid axis values.
Non-positive moduli.
Invalid permutation indices.
// Initialize result with zeros if you want overwrite behavior
auto result = empty<int64_t>({reps, k, n});
set_const_val<int64_t>(result, 0);
modmul_axis_sum<int64_t>(
a, b, p, perm, nullptr, nullptr, -1, false, result);
📝Changelog
v1.0.0 - Initial release.
📒Number Theoretic Transform (NTT, INTT) functions
This section describes the forward and inverse Number Theoretic Transform operations used in modular polynomial arithmetic.
23
ntt
Applies Number-Theoretic Transform - performs a forward transform on batched, multi-channel input tensors, converting them to the NTT domain for efficient polynomial multiplication.
24
intt
Applies Inverse Number-Theoretic Transform, returning the data to its original (coefficient) domain.
📑ntt
ntt
Applies the forward Number Theoretic Transform (NTT) on a batched, multi-channel tensor. This transform converts data from the coefficient domain into the NTT domain, enabling efficient modular polynomial multiplication.
- Introduced in v0.1.0
- Signature updated in v1.0.0
— Added support for axis parameter (enabling [l, r, k, m] layout)
— Added optional log2p_list and mu_list for Barrett reduction
🧩 Call Format
// Forward NTT
ntt<T>(
a, // [l, m, r, k]
p, // [k]
perm, // [m]
twiddles, // [k, m]
log2p_list, // [k] — optional (v1.0.0+)
mu_list, // [k] — optional (v1.0.0+)
axis, // required (v1.0.0+)
skip_perm,
result // [l, m, r, k]
);
T
: Scalar data typeAll inputs are
std::shared_ptr<DeviceTensor<T>>
result
is a pre-allocated output tensor
📥 📤Parameters
a
[l, m, r, k]
Input
v0.1.0
Input tensor: l = left batch, m = transform length, r = right batch, k = RNS channels
p
[k]
Input
v0.1.0
Vector of modulus values for each RNS channel
perm
[m]
Input
v0.1.0
Permutation vector for final reordering
twiddles
[k, m]
Input
v0.1.0
Twiddle factors for forward transform
log2p_list
[k]
Input (optional)
v1.0.0
Precomputed log₂(pᵢ) per modulus: used for optional Barrett reduction.
mu_list
[k]
Input (optional)
v1.0.0
Precomputed Barrett constants (2²ⁿ / pᵢ) per modulus.
axis
-3
or -1
Input
v1.0.0
Which axis represents transform dimension m
.
skip_perm
boolean
Input
v1.0.0
Indicates whether to skip the permutation step.
result
[l, m, r, k]
Output
v0.1.0
Output tensor. Must be pre-allocated and match shape of input a
.
Logic
Executes staged butterfly operations over the specified
axis
(-1
or-3
).Uses twiddle factors and modulus values to perform modular arithmetic.
Writes the transformed result into the
result
tensor.By default, it applies a final permutation using
perm
. Setskip_perm = true
to skip this step.
❗ Throws:
std::invalid_argument
if axis is invalid or shapes mismatch.
// Forward transform
ntt<int32_t>(a, p, perm, twiddles, nullptr, nullptr, -3, true, result);
📝 Changelog
v1.0.0
Added support for
axis = -1
layout ([l, r, k, m]
)Introduced
log2p_list
andmu_list
for optional Barrett reduction
v0.1.0
Original implementation with fixed
[l, m, r, k]
layout
📑intt
intt
Applies the inverse Number Theoretic Transform (INTT) to return tensors from the NTT domain back to the coefficient domain.
- Introduced in v0.1.0
- Signature updated in v1.0.0
— Supports optional Barrett reduction parameters
🧩 Call Format
// Inverse NTT
intt<T>(
a,
p,
perm,
inv_twiddles,
m_inv,
log2p_list, // optional (v1.0.0+)
mu_list, // optional (v1.0.0+)
result
);
T
: Scalar data type (e.g.,int32_t
,int64_t
)All inputs are
std::shared_ptr<DeviceTensor<T>>
result
is a pre-allocated output tensor
📥 📤Parameters
a
std::shared_ptr<DeviceTensor<T>>
[l, m, r, k]
Input
v0.1.0
Input tensor in NTT domain.
p
std::shared_ptr<DeviceTensor<T>>
[k]
Input
v0.1.0
Modulus values (one per RNS channel).
perm
std::shared_ptr<DeviceTensor<T>>
[m]
Input
v0.1.0
Reordering vector to restore canonical element order.
inv_twiddles
std::shared_ptr<DeviceTensor<T>>
[k, m]
Input
v0.1.0
Inverse twiddle factors.
m_inv
std::shared_ptr<DeviceTensor<T>>
[k]
Input
v0.1.0
Modular inverse of transform size m
.
log2p_list
std::shared_ptr<DeviceTensor<T>>
[k]
Input (optional)
v1.0.0
⌊log₂(pᵢ)⌋ values for Barrett reduction (not used yet in default impl).
mu_list
std::shared_ptr<DeviceTensor<T>>
[k]
Input (optional)
v1.0.0
Barrett constants (2²ⁿ / pᵢ).
result
std::shared_ptr<DeviceTensor<T>>
[l, m, r, k]
Output
v0.1.0
Must be preallocated and match input shape.
// Inverse transform
intt<int32_t>(
transformed, p, perm, inv_twiddles, m_inv,
nullptr, nullptr, restored
);
📝 Changelog
v1.0.0
Introduced
log2p_list
andmu_list
for optional Barrett reduction
v0.1.0
Original implementation with fixed
[l, m, r, k]
layout
📕Other Compute Operations
This chapter includes additional computational functions that are not strictly arithmetic or shape-related but are essential to support specialized FHE workloads.
Other compute operations functions include:
25
Other Compute Operations
apply_g_decomp
Applies gadget decomposition (HE-specific operation)
26
Other Compute Operations
take_along_axis
Selects values from a tensor along a specified axis using provided indices.
📑apply_g_decomp
apply_g_decomp
Since: v0.1.0
Introduced in v0.1.0
Renamed in v1.0.0 (formerly g_decomposition)
This function performs a positional radix decomposition of each integer value in a tensor. Each element is expressed as a sum of digits in base 2^base_bits
, spread over power
digits.
🧩 Call Format
apply_g_decomp<T>( a, power, base_bits, result);
T
: Scalar data type (int32_t
,int64_t
, etc.)All tensors are
std::shared_ptr<DeviceTensor<T>>
📥📤 Parameters
a
std::shared_ptr<DeviceTensor<T>>
Input
Input tensor of arbitrary shape to decompose
power
size_t
Input
Number of base digits to extract
base_bits
size_t
Input
Bit width of each base digit (i.e., log₂ of the base used for decomposition)
result
std::shared_ptr<DeviceTensor<T>>
Output
Output tensor of shape a.shape + [power]
to hold digit decompositions
using namespace lattica_hw_api;
// Input tensor
auto a = std::make_shared<DeviceTensor<int32_t>>(std::vector<int32_t>{13, 7}); // shape: [2]
// Parameters
size_t power = 3;
size_t base_bits = 2; // base = 2^2 = 4
// Output tensor shape = [2, 3]
auto result = std::make_shared<DeviceTensor<int32_t>>(Shape{2, 3});
// Decompose
apply_g_decomp<int32_t>(a, power, base_bits, result);
// Expected output:
// 13 = 1 + 3*4 + 0*16 → [1, 3, 0]
// 7 = 3 + 1*4 + 0*16 → [3, 1, 0]
📝 Changelog
v1.0.0: Renamed from
g_decomposition
toapply_g_decomp
.Repositioned theresult
parameter to the end of the parameters list for consistency with other functionsv0.1.0: Initial version
📑take_along_axis
take_along_axis
Selects values from a tensor along a specified axis using provided indices. This function performs an axis-wise gather operation and writes the selected values to a result tensor.
Since: v1.0.0
🧩 Call Format
take_along_axis<T>( tensor, indices, axis, result);
axis
tells which dimension you’re indexing into.You can use negative numbers to count from the end. For example,
-1
means “last element”indices.shape
must matcha.shape
, except alongaxis
.The resulting shape is always the same as
indices
.This does not sort; it selects values based on your index map.
📥📤 Parameters
a
std::shared_ptr<DeviceTensor<T>>
Input
Source tensor to gather from.
indices
std::shared_ptr<DeviceTensor<int64_t>>
Broadcast-compatible with a
(except along axis
)
Input
Indices to select along axis
; broadcastable to a
.
axis
int64_t
—
Input
Axis along which to take values (can be negative).
result
std::shared_ptr<DeviceTensor<T>>
Same as a
Output
Output tensor (preallocated).
Logic
For each coordinate in
indices
, selects an element froma
alongaxis
.Supports negative
axis
values.Supports negative indices (
-1
means last element).Requires:
indices.shape
broadcast-compatible witha
.result.shape == a.shape
(new in v1.1.0, simplified API).
❗ Throws:
std::invalid_argument
if shapes mismatch.std::out_of_range
if indices are out of bounds or axis is invalid.
auto a = torch::tensor({5, 10, 15, 20});
auto indices = torch::tensor({2, 0, 3, 1});
auto result = empty<int64_t>({4});
take_along_axis<int64_t>(a_hw, indices_hw, 0, result_hw);
//Result: [15, 5, 20, 10]
📝 Changelog
v0.1.0 - Introduced function
permute
- This function rearranged elements of a tensor along a specified axis according to a batch-wise permutation pattern.v1.0.0 - Function
permute
was replaced bytake_along_axis
, which generalizes the behavior and aligns more closely with established tensor APIs.
📗Shape Manipulation Functions
This chapter outlines operations used to manipulate the shape and structure of tensors without unnecessary data duplication. These functions are critical for enabling memory-efficient transformations during FHE program execution.
Shape manipulation functions include:
27
flatten
Collapses a range of dimensions into a single axis, reshaping the tensor while preserving element order.
28
expand
Expands tensor dimensions without copying data
29
unsqueeze
Adds a dimension of size 1 at a specified position
30
squeeze
Removes dimensions of size 1 from tensor
31
reshape
Changes tensor shape while preserving data
32
moveaxis
Updates dims and strides metadata so that the tensor appears to have the same data but with one axis relocated.
33
get_slice
Produces a zero-copy sliced view using index or range.
📑flatten
flatten
Collapses a range of dimensions into a single axis, reshaping the tensor while preserving element order. Commonly used to reduce rank before linear processing or output.
Since: v1.0.0
🧩 Call Format
tensor = flatten<T>( tensor, start_axis, end_axis);
📥Input Parameters
a
std::shared_ptr<DeviceTensor<T>>
Input/Output
Input tensor to flatten. Metadata is updated in-place.
start_axis
int64_t
Input
Start of axis range to flatten (inclusive). Supports negative indexing.
end_axis
int64_t
Input
End of axis range to flatten (inclusive). Must be ≥ start_axis
.
📤 Returns
std::shared_ptr<DeviceTensor<T>>
Same tensor as input, with updated shape and strides.
Logic
Flattens dimensions
[start_axis, end_axis]
into a single dimension.All other dimensions remain unchanged.
Operates in-place: modifies tensor metadata but not the data buffer.
Input tensor must be contiguous. Non-contiguous tensors will throw an error.
Negative axes are normalized (
-1
= last axis, etc.).Throws on invalid ranges (e.g.,
start > end
, or axes out of bounds).
// From [2, 3, 4, 5]:
flatten(a, 1, 2) // shape becomes [2, 12, 5]
flatten(a, 0, -1) // shape becomes [120]
📝 Changelog
v1.0.0 - Initial release.
📑expand
expand
Since: v0.1.0
The expand
function virtually replicates a singleton dimension of a tensor along a specified axis, modifying its shape and stride metadata without duplicating memory.
🧩 Call Format
expand<T>(a, axis, repeats);
T
: Scalar data type (int32_t
,int64_t
,float
,double
)Tensor
a
is modified in-place.
📥📤 Parameters
a
std::shared_ptr<DeviceTensor<T>>
Input/Output
Tensor whose dimension will be expanded in-place.
axis
int64_t
Input
Axis to expand (can be negative to count from the end).
repeats
int64_t
Input
Number of times to replicate the dimension; must be positive.
Warning:
This function modifies the input tensor a
in-place by changing its dimensions and strides.
// Expand along axis 1 (currently size 1) to make it size 3
auto expanded = expand<int32_t>(a, /*axis=*/1, /*repeats=*/3);
📝 Changelog
v1.0.0 - Initial release.
📑unsqueeze
unsqueeze
Since: v0.1.0
The function inserts a new axis of size 1
into a tensor’s shape.
This is a metadata-only operation: no data is changed, copied, or moved.
It is commonly used to align tensor shapes for broadcasting or to explicitly add batch, channel, or dimension markers.
🧩 Call Format
unsqueeze<T>(a, axis) → result
T
: Scalar data type (int32_t
,int64_t
,float
,double
)Returns:
std::shared_ptr<DeviceTensor<T>>
📥 Input Parameters
a
std::shared_ptr<DeviceTensor<T>>
Input tensor to be reshaped. This tensor is modified in-place.
axis
int64_t
The axis at which to insert a new dimension of size 1. Supports negative indexing.
📤 Output
result
std::shared_ptr<DeviceTensor<T>>
A reference to the same tensor a
, with an updated shape and stride metadata reflecting the added dimension.
// Input: shape [3, 4]
auto a = host_to_device<int32_t>(torch::rand({3, 4}, torch::kInt32));
// Insert new dimension at axis 1 → shape becomes [3, 1, 4]
auto result = unsqueeze<int32_t>(a, 1);
📝 Changelog
v1.0.0 - Initial release.
📑squeeze
squeeze
Since: v0.1.0
The function removes a dimension of size 1 at the specified axis. This is a metadata-only operation — no data is copied or moved.
It is often used after broadcasting or slicing to clean up unnecessary singleton dimensions.
🧩 Call Format
squeeze<T>(a, axis) → result
T
: Scalar data type (int32_t
,int64_t
,float
,double
)Returns:
std::shared_ptr<DeviceTensor<T>>
📥 Input Parameters
a
std::shared_ptr<DeviceTensor<T>>
Input tensor to be reshaped. Modified in-place.
axis
int64_t
Axis to remove. Must be within valid range and must point to a dimension of size 1. Supports negative indexing.
📤 Output
result
std::shared_ptr<DeviceTensor<T>>
Same tensor as input, with one fewer dimension. Shape and stride metadata are updated.
// Input: shape [3, 1, 4]
auto a = host_to_device<int32_t>(torch::rand({3, 1, 4}, torch::kInt32));
// Remove axis 1 → shape becomes [3, 4]
auto result = squeeze<int32_t>(a, 1);
📝 Changelog
v1.0.0 - Initial release.
📑reshape
reshape
Since: v0.1.0
The reshape
method updates a tensor’s shape and stride metadata to match a new specified shape, as long as the total number of elements remains unchanged (excluding broadcasted dimensions).
🧩 Call Format
a->reshape(new_dims)
Operates in-place: modifies the current tensor's shape and stride metadata
📥Input Parameters
a
std::shared_ptr<DeviceTensor<T>>
Input/Output
The tensor to reshape. Shape and strides will be modified in-place.
new_dims
std::vector<int64_t>
Input
Desired new shape. Total element count must match the current tensor.
auto a = host_to_device<int32_t>(torch::arange(24).reshape({2, 3, 4}));
// Reshape from [2, 3, 4] to [6, 4]
a->reshape({6, 4});
📝 Changelog
v1.0.0 - Initial release.
📑moveaxis
moveaxis
Since: v1.0.0
The function updates the internal metadata (dims and strides) of a tensor to simulate movement of one axis to a new position, without modifying the underlying memory.
🧩 Call Format
moveaxis<T>(tensor, axis_src, axis_dst)
tensor
: Tensor to update (metadata modified in-place)axis_src
: Axis to move (may be negative)axis_dst
: Target position (may be negative)
📥Input Parameters
tensor
std::shared_ptr<DeviceTensor<T>>
Tensor to be modified in-place
axis_src
int64_t
Source axis index (supports negative indexing)
axis_dst
int64_t
Destination axis index (supports negative indexing)
Logic
Modifies the tensor’s
dims
andstrides
vectors to simulate a move of one axis.Negative axis values are normalized using the tensor’s rank.
If
axis_src == axis_dst
, the operation is a no-op.Invalid axis indices will raise
std::invalid_argument
.
❗ Error Conditions
Null pointer input → throws
std::invalid_argument
.Axis indices outside valid range → throws
std::invalid_argument
.
auto a_hw = host_to_device<int64_t>(a);
moveaxis<int64_t>(a_hw, /*src=*/2, /*dst=*/0);
angelog
v1.0.0 - Initial release.
📑get_slice
get_slice
Since: v1.0.0
This function produces a zero-copy view into the input tensor by modifying the metadata (shape, strides, and pointer offset) based on a slicing specification.
🧩 Call Format
get_slice<T>(input, slices) -> result;
T
: Scalar data typeinput
: Input tensor whose metadata is modifiedslices
: specifying either a fixed index or a range
📥Input Parameters
input
std::shared_ptr<DeviceTensor<T>>
Tensor to slice (metadata modified in-place)
slices
std::vector<SliceArg>
Slice specification per axis (see below)
Each SliceArg
can be:
int64_t
: Take a single index → collapses that axisSlice
: A struct of(start, end, step)
(defaultstep=1
), with:start
(inclusive)end
(exclusive)step
> 0
📤 Output
std::shared_ptr<DeviceTensor<T>>
A new view of the input tensor with updated shape, strides, and offset. No memory is copied.
Logic
Performs slicing without allocating a new buffer (zero-copy).
May collapse axes when single index is selected.
All slicing rules follow PyTorch-style semantics.
Negative indices are not currently supported.
❗ Error Conditions
slices.size()
≠input.rank()
→ throwsstd::invalid_argument
Index out of bounds → throws
std::out_of_range
Invalid range (e.g.
end ≤ start
, orstep ≤ 0
) → throwsstd::invalid_argument
auto a = torch::tensor({{10,20,30,40},{50,60,70,80}}, torch::kInt32);
std::vector<SliceArg> slices = {
Slice(0, 2),
Slice(1, 3)
};
auto a_hw = host_to_device<int32_t>(a);
auto out_hw = get_slice<int32_t>(a_hw, slices);
auto out = device_to_host<int32_t>(out_hw);
📝 Changelog
v1.0.0 - Initial release.
Last updated