# Interface Specifications

## Introduction

This document specifies the standardized interface required by hardware vendors integrating with the Homomorphic Encryption Abstraction Layer (HEAL). HEAL is designed to abstract the complexity of fully homomorphic encryption (FHE) computations, enabling efficient and scalable implementations across diverse hardware architectures.to integrate

The interface defined in this document includes essential functions categorized into:

* **Memory Management Functions:** Operations responsible for allocation, initialization, and efficient transfer of tensor data between host (CPU) and hardware devices.
* **Shape Manipulation Functions:** Provides operations to change the shape, layout, or dimension arrangement of tensors without copying data, enabling flexible transformations for downstream computations.
* **Tensor Value Assignments:** Provides utility functions to assign constant values to all elements of a tensor without changing its shape or memory layout.
* **Arithmetic Operations (Modular Arithmetic):** Essential modular arithmetic computations required in FHE workflows.
* **Modular Arithmetic Axis Operations:** Performs modular arithmetic computations across a specified tensor axis, combining elements using summation or product-reduction patterns.
* **NTT Transforms:** Implements forward and inverse Number-Theoretic Transforms (NTT/INTT) for efficient polynomial operations in the encrypted domain.
* **Other Compute Operations:** Additional tensor computations and transformations essential to specialized FHE processes.

The core data structure managed through this interface is the **tensor**, a multi-dimensional array representing polynomial coefficients and associated metadata for homomorphic computations. A detailed explanation of tensors, supported data types, shapes, memory management strategies, and data flow considerations can be found in the Memory Management and Data Structures documentation.

## 📙Memory Management Functions

This chapter defines the interface for memory management operations within the HEAL framework. These functions enable efficient allocation, initialization, and transfer of tensor data between host (CPU) memory and hardware memory.\
They ensure that data is correctly formatted, aligned, and accessible for hardware execution, serving as the foundation for all subsequent FHE computations.

Memory management functions include:

<table><thead><tr><th width="93">Index</th><th width="197">Name</th><th>Description</th></tr></thead><tbody><tr><td>1</td><td>zeros</td><td>Allocates <strong>initialized</strong> memory for a tensor</td></tr><tr><td>2</td><td>empty</td><td>Allocates <strong>uninitialized</strong> memory for a tensor</td></tr><tr><td>3</td><td>host_to_device</td><td>Transfers tensor data from host to device memory</td></tr><tr><td>4</td><td>device_to_host</td><td>Transfers tensor data from device to host memory</td></tr><tr><td>5</td><td>contiguous</td><td>Ensures tensor has contiguous memory layout; makes copy if needed.</td></tr></tbody></table>

***

### 📑`zeros`

<pre><code><strong>Introduced in v0.1.0
</strong>Renamed in v1.0.0 (formerly allocate_on_hardware)
</code></pre>

The function allocates memory on the device for a tensor with the specified shape. It **initializes the contents** of the allocated memory with zero values.

#### 🧩Call Format

```cpp
device_tensor = zeros<T>(dims);
```

* `T`: Scalar data type of the tensor elements (e.g.,int32, int64, float32, float64, complex64, complex128)
* `dims`: A list of dimensions representing the desired shape of the tensor.
* `device_tensor`: A smart pointer to a newly allocated tensor on the device with **initialized memory**.

#### 📥 Input

| Name   | Type                   | Description                                                                                                                                                                                                                                                                                                 |
| ------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `dims` | `std::vector<int64_t>` | <p> Tensor shape - list of dimension sizes. Assumes values are valid and > 0.</p><div data-gb-custom-block data-tag="hint" data-style="info" class="hint hint-info"><p>This parameter defines the tensor shape, not the data type. The data type is set by the template parameter <code>T</code> </p></div> |

#### 📤 Output

| Name            | Type                               | Description                                                                            |
| --------------- | ---------------------------------- | -------------------------------------------------------------------------------------- |
| `device_tensor` | `std::shared_ptr<DeviceTensor<T>>` | A new tensor object on the device with **initialized memory** and associated metadata. |

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
// Define a 2D tensor shape
std::vector<int64_t> dims = {8, 8};

// Allocate an initialized float32 tensor on the device
std::shared_ptr<DeviceTensor<float>> device_tensor = zeros<float>(dims);
```

{% endtab %}
{% endtabs %}

#### **⚠️ Error Messages**

The function assumes:

* The shape is valid (e.g., not negative).
* Memory allocation on the device succeeds.

It performs no internal validation or exception handling.

#### 📝 Changelog

* v1.0.0: Renamed from `allocate_on_hardware`to `zeros`
* v0.1.0: Initial version

***

### 📑`empty`

*`Since: v1.0.0`*

The function allocates memory on the device for a tensor with the specified shape. It does **not initialize the contents** of the allocated memory.

This function is useful when the memory is going to be immediately overwritten by subsequent operations, allowing for faster allocation without the overhead of zero-initialization.

* Allocates memory on the device but does **not** initialize values.
* Computes default row-major strides.
* Returns a valid `DeviceTensor<T>` that can be used as the output of other operations.
* The total number of elements is computed as the product of dimensions in `dims`.

❗ **Error Conditions**

* If memory allocation fails (`malloc` returns `nullptr`), the implementation should raise an error or return `nullptr`.
* Negative or invalid dimension values may cause incorrect behavior and should be guarded by the caller.

#### 🧩Call Format

```cpp
device_tensor = empty<T>(dims);
```

* `T`: Scalar data type of the tensor elements&#x20;
* `dims`: Shape of the tensor (each dimension must be ≥ 0)
* `device_tensor`: A smart pointer to a newly allocated tensor on the device with **uninitialized memory**.

#### 📥 Input

| Name   | Type                   | Description                                                               |
| ------ | ---------------------- | ------------------------------------------------------------------------- |
| `dims` | `std::vector<int64_t>` | Tensor shape - list of dimension sizes. Assumes values are valid and > 0. |

#### 📤 Output

| Name            | Type                               | Description                                                                                          |
| --------------- | ---------------------------------- | ---------------------------------------------------------------------------------------------------- |
| `device_tensor` | `std::shared_ptr<DeviceTensor<T>>` | Newly allocated device tensor with the given shape and inferred strides. Contents are uninitialized. |

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
// Define a 2D tensor shape
std::vector<int64_t> dims = {8, 8};

// Allocate an uninitialized float32 tensor on the device
std::shared_ptr<DeviceTensor<float>> device_tensor = empty<float>(dims);
```

{% endtab %}
{% endtabs %}

#### **📝 Changelog**

* **v1.0.0**: Initial version

***

### 📑`host_to_device`

*`Since: v0.1.0`*

Transfers data from a host-side tensor (e.g., PyTorch, NumPy) to a newly allocated device tensor suitable for computation on accelerator hardware.

<details>

<summary>What does <code>host_to_device</code> do? (Click to expand)</summary>

The function copies data from a tensor located in host (CPU) memory to a device-specific tensor allocated on accelerator hardware (e.g., GPU, ASIC, FPGA). The actual type of host and device tensors is backend-dependent:

* The **host tensor** may be any type that exposes shape and raw data access (e.g., PyTorch, NumPy).
* The **device tensor** is created as an instance of the backend-defined `DeviceTensor<T>`, where the scalar type `T` matches that of the host tensor.

The function performs:

* Allocate a new device tensor of the same shape and type.
* Copy the host tensor's data to device memory.
* Return a pointer to the newly allocated device tensor.

</details>

#### **🧩 Call Format**

```cpp
device_tensor = host_to_device<T>(host_tensor);
```

* `T`: Scalar data type (e.g., `int32_t`, `float`, etc.)
* `host_tensor`: A tensor in host memory (e.g., PyTorch, NumPy) with scalar type `T`.
* `device_tensor`: A smart pointer to a device-side representation of the tensor.

#### **📥 Input Parameters**

| Name          | Type                         | Description                                                                                                                   |
| ------------- | ---------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| `host_tensor` | `TensorLike` (e.g., PyTorch) | <p>Host-side tensor with shape and data accessible for transfer<br>The data type must match template type <code>T</code>.</p> |

#### **📤 Output**

| Name            | Type                               | Description                                                  |
| --------------- | ---------------------------------- | ------------------------------------------------------------ |
| `device_tensor` | `std::shared_ptr<DeviceTensor<T>>` | Device-allocated tensor containing copied data from the host |

{% tabs %}
{% tab title="▶️ Example Usage" %}

<pre class="language-cpp" data-overflow="wrap"><code class="lang-cpp"><strong>// Create a PyTorch (illustration only) tensor with int32 data on the host (CPU)
</strong>torch::Tensor host_tensor = torch::tensor({1, 2, 3}, torch::kInt32);

// Transfer the tensor to device memory using the HEAL interface
auto device_tensor = host_to_device&#x3C;int32_t>(host_tensor);
</code></pre>

{% endtab %}
{% endtabs %}

#### **⚠️ Error Messages**

The function does not currently include explicit error handling for mismatches or null inputs. It assumes:

* The host tensor has correct and accessible data for the scalar type `T`.
* Allocation on the device succeeds.

#### 📝 Changelog

* **v0.1.0 -** Initial release.

***

### 📑 `device_to_host`

*`Since: v0.1.0`*

Transfers data from a device-side tensor to a host-side tensor, facilitating the retrieval of computation results from accelerator hardware to the host environment.

<details>

<summary>What does <code>device_to_host</code>do? (Click to expand)</summary>

The function enables the copying of data from a tensor residing in device memory (e.g., GPU, FPGA) back to a tensor in host memory (e.g., CPU). This operation is essential for accessing and utilizing the results of computations performed on accelerator hardware within the host application.

The function performs:

1. Allocate a new host tensor of the same shape and type.
2. Copy the device tensor's data to host memory.
3. Return the newly allocated host tensor containing the copied data.

</details>

#### **🧩 Call Format**

```cpp
host_tensor = device_to_host<T>(device_tensor);
```

* `T`: Scalar data type (e.g., `int32_t`, `float`)
* `device_tensor`: A `std::shared_ptr<DeviceTensor<T>>` residing on the device
* `host_tensor`: A tensor in host memory containing the copied data

#### **📥 Input Parameters**

| Name            | Type                               | Description                                                                                            |
| --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------ |
| `device_tensor` | `std::shared_ptr<DeviceTensor<T>>` | <p>Device-side tensor to be copied to host memory<br>Data type must match template <code>T</code>.</p> |

#### **📤 Output**

| Name          | Type         | Description                                         |
| ------------- | ------------ | --------------------------------------------------- |
| `host_tensor` | `TensorLike` | Host-side tensor containing data copied from device |

{% hint style="info" %}
*The specific type of the returned host tensor depends on the host tensor library in use (e.g., PyTorch, NumPy).*
{% endhint %}

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
// Assume this device tensor was created using host_to_device earlier
std::shared_ptr<DeviceTensor<int32_t>> device_tensor = ...;

// Transfer the tensor back to host memory as a PyTorch (implementation example) tensor
torch::Tensor host_tensor = device_to_host<int32_t>(device_tensor);
```

{% endtab %}
{% endtabs %}

#### 📝 Changelog

* **v0.1.0 -** Initial release.

***

### 📑`contiguous`

```
- Introduced in v0.1.0
- Renamed in v1.0.0 (formerly make_contiguous)
```

The function ensures that a tensor has a standard, contiguous memory layout.\
If the tensor is already contiguous, it returns immediately.\
If not, it creates a new memory buffer, copies the elements into contiguous layout, updates strides, and modifies the tensor in-place.

<details>

<summary>Why do we need <code>contiguous</code>? (Click to expand)</summary>

Some tensor operations, like transposing or slicing, can change the **memory layout** of a tensor without changing its shape.\
This can make the tensor **non-contiguous**, meaning the elements are not laid out sequentially in memory.

A **contiguous tensor** has elements stored in standard row-major order, without gaps or jumps in memory.

Hardware accelerators and many algorithms expect **contiguous tensors** for best performance (and sometimes for correctness).

The `contiguous` function checks whether a tensor is already contiguous:

✔ If it is, nothing changes.

✘ If not, it **creates a contiguous copy** and updates the tensor’s memory and strides.

</details>

#### **🧩**Call Format

```cpp
contiguous<T>(tensor);
```

* `T`: Scalar data type (`int32_t`, `int64_t`, `float`, `double`)
* Tensor `tensor` is modified in-place if needed.

***

#### 📥📤 Parameters

| Name     | Type                               | Direction    | Description                                      |
| -------- | ---------------------------------- | ------------ | ------------------------------------------------ |
| `tensor` | `std::shared_ptr<DeviceTensor<T>>` | Input/Output | Input tensor to be made contiguous if necessary. |

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
// Non-contiguous tensor example: result of transpose
auto a = host_to_device<int32_t>(torch::randint(0, 10, {3, 4}).transpose(0, 1));

// Make contiguous (copying into new memory if needed)
contiguous<int32_t>(a);

// After the call, 'a' now has standard contiguous memory layout
```

{% endtab %}
{% endtabs %}

#### 📝 Changelog

* v1.0.0: Renamed from `make_contiguous` to `contiguous`
* v0.1.0: Initial version

***

## 📔Tensor Value Assignments

Functions in this chapter assign constant values to all elements of a tensor without changing its shape or memory layout.

<table><thead><tr><th width="93">Index</th><th width="197">Name</th><th>Short Description</th></tr></thead><tbody><tr><td>6</td><td>pad_single_axis</td><td>Appends zeros at the end of a specific axis, expanding shape.</td></tr><tr><td>7</td><td>set_const_val</td><td>Sets all elements of a tensor to a constant value; in-place, no allocation.</td></tr></tbody></table>

### 📑`pad_single_axis`&#x20;

*`Since: v1.0.0`*

<details>

<summary>What does <code>pad_single_axis</code> do? (Click to expand)</summary>

The `pad_single_axis` function takes a tensor and **adds zeros at the end of a chosen axis**.

For example:

* Input: `[1, 2, 3]`, pad = 2, axis = 0 → Result: `[1, 2, 3, 0, 0]`
* Input: `[[1, 2, 3], [4, 5, 6]]`, pad = 2, axis = 1 → Result: `[[1, 2, 3, 0, 0], [4, 5, 6, 0, 0]]`

It doesn’t change the other dimensions — only the one you specify.

Negative axis values count from the end:

* `axis = -1` → last axis,
* `axis = -2` → second-to-last, etc.

**Summary:**\
Copy existing data →  Add zeros at end →  Output padded tensor.

</details>

#### **🧩** Call Format

```cpp
pad_single_axis<T>( a, pad, axis, result);
```

#### 📥📤Parameters

| Name     | Type                               | Shape        | Role   | Since  | Description                                                                      |
| -------- | ---------------------------------- | ------------ | ------ | ------ | -------------------------------------------------------------------------------- |
| `a`      | `std::shared_ptr<DeviceTensor<T>>` | Any shape    | Input  | v1.0.0 | Input tensor to pad.                                                             |
| `pad`    | `int64_t`                          | —            | Input  | v1.0.0 | Number of zeros to append (must be ≥ 0).                                         |
| `axis`   | `int64_t`                          | —            | Input  | v1.0.0 | Axis index along which to pad (negative values allowed, e.g., `-1` = last axis). |
| `result` | `std::shared_ptr<DeviceTensor<T>>` | Padded shape | Output | v1.0.0 | Output tensor; same as `a` but with padded axis expanded by `pad`.               |

#### &#x20;Logic

* Expands the dimension at `axis` by `pad` elements.
* Copies existing values from `a` into `result`.
* Fills padded positions with zero (`0` of type `T`).
* Supports negative axis indices (`-1` = last, `-2` = second-to-last, etc.).

❗ Throws `std::invalid_argument` if:

* `pad < 0`
* Axis out of bounds.
* Input and result ranks mismatch.
* Result shape does not match expected padded dimensions.

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
// a: shape [2, 3]
// pad: 2 on axis 1 → result shape: [2, 5]
auto result = empty<int64_t>({2, 5});
pad_single_axis<int64_t>(a, 2, 1, result);
```

{% endtab %}
{% endtabs %}

#### 📝 Changelog

* **v1.0.0 -** Initial release.

***

### 📑`set_const_val`&#x20;

Sets **all elements of a tensor** to a given constant value. This is a utility function often used to initialize intermediate tensors or reset memory before computation.

*`Since: v1.0.0`*

#### **🧩** Call Format

```cpp
set_const_val<T>(tensor, val);
```

* `T`: Scalar data type (`int32_t`, `int64_t`)
* Tensor  `std::shared_ptr<DeviceTensor<T>>`

#### 📥 📤Parameters

| Name  | Type                               | Role         | Description                             |
| ----- | ---------------------------------- | ------------ | --------------------------------------- |
| `a`   | `std::shared_ptr<DeviceTensor<T>>` | Input/Output | Tensor to overwrite. Modified in-place. |
| `val` | `T`                                | Input        | Scalar value to assign to all elements. |

#### Logic

* Iterates over all elements of the tensor, replacing each with `val`.
* Supports tensors of any rank, including scalar (0D) tensors.
* Leaves the tensor shape unchanged.
* Throws:

  `std::invalid_argument` if the input tensor `a` is null.

{% tabs %}
{% tab title="▶️ Example Usage" %}

<pre class="language-cpp"><code class="lang-cpp"><strong>// This sets every element in the tensor to zero.
</strong><strong>auto hw_tensor = host_to_device&#x3C;int32_t>(torch::rand({10}));
</strong>set_const_val&#x3C;int32_t>(hw_tensor, 0);
</code></pre>

{% endtab %}
{% endtabs %}

#### 📝 Changelog

* **v1.0.0 -** Initial release.

***

## 📘Arithmetic Operations (Modular Arithmetic)

Functions in this chapter perform element-wise or structured computations such as modular addition, and modular multiplication.&#x20;

All modular arithmetic functions in HEAL accept a modulus parameter `p`, which can be:

* A **scalar** (same modulus applied to all elements), or
* A **1D tensor** of shape `[k]`, where `k` matches the size of the result tensor’s last dimension.

All results are reduced **modulo `p`**, and the outcome is always in the range **`[0, p)`**, even if intermediate values (e.g., inputs or intermediate sums/products) are negative: `(-1 % 5) → 4`; `(-7 % 5) → 3`

This ensures correctness and consistency across all platforms and encryption schemes.

***

### ✖️ Modular Multiplication Functions (modmul)

This section defines the modular multiplication functions supported by the HEAL interface. These functions compute element-wise `(a * b) % p` using different combinations of tensor and scalar inputs.

* `ttt`: all inputs are tensors
* `ttc`: modulus is a scalar
* `tct`: multiplier `b` is scalar
* `tcc`: both multiplier and modulus are scalars

The result is stored in a pre-allocated output tensor, which must match the expected broadcasted shape of inputs `a` and `b`.

<table><thead><tr><th width="93">Index</th><th width="197">Name</th><th>Short Description</th></tr></thead><tbody><tr><td>8</td><td>modmul_ttt</td><td>Modular multiplication (tensor-tensor-tensor)</td></tr><tr><td>9</td><td>modmul_ttc</td><td>Modular multiplication (tensor-tensor-constant)</td></tr><tr><td>10</td><td>modmul_tct</td><td>Modular multiplication (tensor-constant-tensor)</td></tr><tr><td>11</td><td>modmul_tcc</td><td>Modular multiplication (tensor-constant-constant)</td></tr></tbody></table>

*`Since: v0.1.0`*

#### 🧩 Call Format

```cpp
// tensor * tensor % tensor
modmul_ttt<T>(a, b, p, result);

// tensor * tensor % constant
modmul_ttc<T>(a, b, p_scalar, result);

// tensor * constant % tensor
modmul_tct<T>(a, b_scalar, p, result);

// tensor * constant % constant
modmul_tcc<T>(a, b_scalar, p_scalar, result);
```

* `T`: Scalar data type (`int32_t`, `int64_t`, etc.)
* `a`, `b`, `p`: Shared pointers to `DeviceTensor<T>` objects
* `p_scalar`, `b_scalar`: Scalar values of type `T`
* `result`: Pre-allocated output tensor (`std::shared_ptr<DeviceTensor<T>>`) on the device

#### 📥 Parameters by Function Variant

| Function     | Input A (tensor)  | Input B           | Modulus P         | Output                            |
| ------------ | ----------------- | ----------------- | ----------------- | --------------------------------- |
| `modmul_ttt` | `DeviceTensor<T>` | `DeviceTensor<T>` | `DeviceTensor<T>` | `DeviceTensor<T>` (pre-allocated) |
| `modmul_ttc` | `DeviceTensor<T>` | `DeviceTensor<T>` | `T` (scalar)      | `DeviceTensor<T>` (pre-allocated) |
| `modmul_tct` | `DeviceTensor<T>` | `T` (scalar)      | `DeviceTensor<T>` | `DeviceTensor<T>` (pre-allocated) |
| `modmul_tcc` | `DeviceTensor<T>` | `T` (scalar)      | `T` (scalar)      | `DeviceTensor<T>` (pre-allocated) |

* The `result` tensor must be pre-allocated and have a shape compatible with broadcasted inputs

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
auto a = host_to_device<int32_t>(torch::tensor({1, 2, 3}));
auto b = host_to_device<int32_t>(torch::tensor({4, 5, 6}));
auto p = host_to_device<int32_t>(torch::tensor({7, 7, 7}));
auto result = zeros<int32_t>({3});

modmul_ttt<int32_t>(a, b, p, result);
```

{% endtab %}
{% endtabs %}

#### 📝 Changelog

* **v1.0.0 -** Initial release.

***

### ➕ Modular Addition Functions (modsum)

This section defines the modular addition functions supported by the HEAL interface. These functions compute element-wise modular addition: `result[i] = (a[i] + b[i]) % p[i]`&#x20;

The input can consist of tensors or scalars, and broadcasting is supported. The result is stored in a pre-allocated output tensor that must be shape-compatible with the broadcasted inputs.

<table><thead><tr><th width="93">Index</th><th width="197">Name</th><th>Short Description</th></tr></thead><tbody><tr><td>12</td><td>modsum_ttt</td><td>Modular summation (tensor-tensor-tensor)</td></tr><tr><td>13</td><td>modsum_ttc</td><td>Modular summation (tensor-tensor-constant)</td></tr><tr><td>14</td><td>modsum_tct</td><td>Modular summation (tensor-constant-tensor)</td></tr><tr><td>15</td><td>modsum_tcc</td><td>Modular summation (tensor-constant-constant)</td></tr></tbody></table>

*`Since: v0.1.0`*

#### 🧩 Call Format

```cpp
// tensor + tensor % tensor
modsum_ttt<T>(a, b, p, result);

// tensor + tensor % constant
modsum_ttc<T>(a, b, p_scalar, result);

// tensor + constant % tensor
modsum_tct<T>(a, b_scalar, p, result);

// tensor + constant % constant
modsum_tcc<T>(a, b_scalar, p_scalar, result);
```

* `T`: Scalar data type (`int32_t`, `int64_t`, etc.)
* `a`, `b`, `p`: Shared pointers to `DeviceTensor<T>` objects
* `p_scalar`, `b_scalar`: Scalar values of type `T`
* `result`: Pre-allocated output tensor (`std::shared_ptr<DeviceTensor<T>>`) on the device

#### 📥 Input Parameters by Function Variant

| Function     | Input A (tensor)  | Input B           | Modulus P         | Output Result                     |
| ------------ | ----------------- | ----------------- | ----------------- | --------------------------------- |
| `modsum_ttt` | `DeviceTensor<T>` | `DeviceTensor<T>` | `DeviceTensor<T>` | `DeviceTensor<T>` (pre-allocated) |
| `modsum_ttc` | `DeviceTensor<T>` | `DeviceTensor<T>` | `T` (scalar)      | `DeviceTensor<T>` (pre-allocated) |
| `modsum_tct` | `DeviceTensor<T>` | `T` (scalar)      | `DeviceTensor<T>` | `DeviceTensor<T>` (pre-allocated) |
| `modsum_tcc` | `DeviceTensor<T>` | `T` (scalar)      | `T` (scalar)      | `DeviceTensor<T>` (pre-allocated) |

* All tensors must be pre-allocated and reside in device memory

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
auto a = host_to_device<int32_t>(torch::tensor({1, 2, 3}));
auto b = host_to_device<int32_t>(torch::tensor({4, 5, 6}));
auto p = host_to_device<int32_t>(torch::tensor({7, 7, 7}));
auto result = zeros<int32_t>({3});

modsum_ttt<int32_t>(a, b, p, result);
```

{% endtab %}
{% endtabs %}

#### 📝 Changelog

* **v0.1.0 -** Initial release.

***

### ％ Modular Remainder Functions (mod)

*`Since: v1.0.0`*

These functions compute element-wise `a % b` using different combinations of tensor and scalar inputs.

tensor. The result is stored in a pre-allocated output tensor, which must match the shape of the input tensor(s).

<table><thead><tr><th width="93">Index</th><th width="197">Name</th><th>Short Description</th></tr></thead><tbody><tr><td>16</td><td><code>mod_tt</code></td><td>tensor % tensor</td></tr><tr><td>17</td><td><code>mod_tc</code></td><td>tensor % scalar</td></tr><tr><td>18</td><td><code>mod_ct</code></td><td>scalar % tensor</td></tr></tbody></table>

#### 🧩 Call Format

```cpp
// tensor % tensor
mod_tt<T>(a, b, result);

// tensor % scalar
mod_tc<T>(a, b_scalar, result);

// scalar % tensor
mod_ct<T>(a_scalar, b, result);
```

* `T`: Scalar data type (`int32_t`, `int64_t`, etc.)
* `a`, `b`: `std::shared_ptr<DeviceTensor<T>>`
* `a_scalar`, `b_scalar`: `int64_t`
* `result`: Pre-allocated output tensor `std::shared_ptr<DeviceTensor<T>>` on the device

#### **📥📤 Parameters**

| Function | a                 | b                 | result            |
| -------- | ----------------- | ----------------- | ----------------- |
| `mod_tt` | `DeviceTensor<T>` | `DeviceTensor<T>` | `DeviceTensor<T>` |
| `mod_tc` | `DeviceTensor<T>` | `int64_t`         | `DeviceTensor<T>` |
| `mod_ct` | `int64_t`         | `DeviceTensor<T>` | `DeviceTensor<T>` |

The result tensor must be pre-allocated and have a shape compatible with `a` and/or `b`.

#### ▶️ Example Usage

```cpp
auto a = host_to_device<int64_t>(torch::tensor({5, 10, 15}));
auto b = host_to_device<int64_t>(torch::tensor({3, 4, 5}));
auto result = empty<int64_t>({3});

mod_tt<int64_t>(a, b, result);

// Or tensor % scalar
mod_tc<int64_t>(a, 7, result);

// Or scalar % tensor
mod_ct<int64_t>(9, b, result);
```

#### **📝 Changelog**

* **v1.0.0**: Initial version of `mod_tt`, `mod_tc`, `mod_ct` functions.

***

### ➖ Modular Negation Functions (modneg)&#x20;

`Since in v1.0.0`

Performs **modular negation**, computing:

```
(-a) % p   →   (-(a % p) + p) % p
```

This ensures the result is non-negative and lies in the range `[0, p)`.

<table><thead><tr><th width="93">Index</th><th width="197">Name</th><th>Short Description</th></tr></thead><tbody><tr><td>19</td><td><code>modneg_tt</code></td><td> (- tensor) % tensor</td></tr><tr><td>20</td><td><code>modneg_tc</code></td><td>(- tensor) % scalar</td></tr></tbody></table>

#### 🧩 Call Formats

```cpp
// Tensor % Tensor
modneg_tt<T>(a,p,result);

// Tensor % Scalar
modneg_tc<T>(a, p_scalar,result)
);
```

#### **📥📤 Parameters**

| Name       | Type              | Role                | Description                                                                                 |
| ---------- | ----------------- | ------------------- | ------------------------------------------------------------------------------------------- |
| `a`        | `DeviceTensor<T>` | Input               | Input tensor to be negated.                                                                 |
| `p`        | `DeviceTensor<T>` | Input (`modneg_tt`) | <p>Modulus tensor. </p><p>Broadcastable to <code>a</code></p><p>Broadcasts elementwise.</p> |
| `p_scalar` | `T`               | Input (`modneg_tc`) | Scalar modulus.                                                                             |
| `result`   | `DeviceTensor<T>` | Output              | Preallocated output tensor.                                                                 |

{% hint style="warning" %}
Shapes must be broadcast-compatible. If not, the function will throw `invalid_argument`.
{% endhint %}

#### ▶️ Example Usage

```cpp
modneg_tt<int64_t>(a_tensor, p_tensor, result);
modneg_tc<int64_t>(a_tensor, 7, result);
```

#### 📝 Changelog

* **v1.0.0 -** Initial release of `modneg_tt` (tensor/tensor) and `modneg_tc` (tensor/scalar)

***

## 📓 Modular Arithmetic Axis-wise

This chapter includes functions that perform **modular arithmetic along a specific axis** of a tensor. Instead of applying operations to each element one by one, these functions work across one dimension of the tensor.

<table><thead><tr><th width="93">Index</th><th width="197">Name</th><th>Short Description</th></tr></thead><tbody><tr><td>21</td><td>axis_modsum</td><td>sums values along a given axis and reduces them modulo <code>p</code></td></tr><tr><td>22</td><td>modmul_axis_sum</td><td>Computes a modular sum of products over a specified axis between tensors <code>a</code> and <code>b</code>, with optional permutation</td></tr></tbody></table>

### 📑`axis_modsum`

*`Since: v0.1.0`*

The function performs a modular summation along a specific axis of a tensor. This means it reduces values across that axis by summing them, then applies a modulus operation on each result, using a provided vector of moduli `p`.

This is commonly used in FHE workloads for reducing polynomials or batched data along structural axes.

<details>

<summary>What does <code>axis_modsum</code> do? (Click to expand)</summary>

For each "slice" of the tensor along the selected axis:

* It adds all values in that slice
* Then applies `% p[i]` for each element in the last dimension
* The result is written into the output tensor

#### Example

Assume you have an input tensor `a` with shape `[2, 3, 4]`:

```
a = [
  [[ 1,  2,  3,  4],
   [ 5,  6,  7,  8],
   [ 9, 10, 11, 12]],
  
  [[13, 14, 15, 16],
   [17, 18, 19, 20],
   [21, 22, 23, 24]]
]
```

And a modulus tensor `p = [11, 13, 17, 19]` (shape `[4]`, matching the last dimension).

Calling:

```cpp
axis_modsum(a, p, result, axis=1);
```

Will reduce across axis 1 (i.e., over the second dimension — the rows). The output will be:

```
[
  [(1+5+9)%11, (2+6+10)%13, (3+7+11)%17, (4+8+12)%19],
  [(13+17+21)%11, (14+18+22)%13, ...]
]
```

The output shape is `[2, 4]`, same as the input shape with axis 1 removed.

#### Summary

* The reduction is performed across the selected axis
* The modulus is applied element-wise across the last dimension
* The result tensor has one fewer dimension than the input

</details>

**🧩 Call Format**

```cpp
axis_modsum<T>(a, p, axis, result);
```

* `T`: Scalar data type (`int32_t`, `int64_t`, etc.)
* All tensor arguments are `std::shared_ptr<DeviceTensor<T>>`
* `axis` is an integer index specifying the dimension to reduce

**📥📤 Parameters**

| Name     | Type                               | Direction | Description                                                                 |
| -------- | ---------------------------------- | --------- | --------------------------------------------------------------------------- |
| `a`      | `std::shared_ptr<DeviceTensor<T>>` | Input     | Input tensor. Must have shape `[..., k]` where `k = p->dims[0]`.            |
| `p`      | `std::shared_ptr<DeviceTensor<T>>` | Input     | Modulus vector of shape `[k]`, where `k` matches the last dimension of `a`. |
| `axis`   | `int64_t`                          | Input     | Axis to reduce over.                                                        |
| `result` | `std::shared_ptr<DeviceTensor<T>>` | Output    | Output tensor with shape equal to `a` with the `axis` dimension removed.    |

**▶️ Example Usage**

```cpp
auto a = host_to_device<int32_t>(torch::tensor({{1, 2}, {3, 4}}, torch::kInt32));  // [2, 2]
auto p = host_to_device<int32_t>(torch::tensor({5, 5}, torch::kInt32));           // [2]
auto result = zeros<int32_t>({2});                                 // axis=0 reduced

axis_modsum<int32_t>(a, p, /*axis=*/0, result);
```

#### 📝 Changelog

* **v0.1.0 -** Initial release.
* **v1.0.0** - Repositioned the result parameter to the end of the parameters list for consistency with other functions

***

### &#x20;📑`modmul_axis_sum`

Computes a modular sum of products over a specified axis between tensors `a` and `b`, optionally applying a permutation. This function performs elementwise modular multiplication followed by summation.

{% hint style="info" %}
Unlike most other HEAL functions, this function **reads and updates the existing values** in the `result` tensor, performing an incremental accumulation.
{% endhint %}

<details>

<summary>What does <code>modmul_axis_sum</code> do? (click to expand)</summary>

The `modmul_axis_sum` function **multiplies two tensors together along one axis, sums the results, and applies modular reduction**.

In simpler words:

* For each output position, it:
  1. Multiplies matching elements from `a` and `b`.
  2. Adds up all those multiplied values  along a chosen axis.
  3. Applies a modulo (`% p`) operation to keep the result within `[0, p)`.

It’s like doing:

```
sum over i (a[...] * b[...]) % p
```

#### Example

```
a = [ [1, 2, 3],  [4, 5, 6] ]   # shape [2, 3]
b = [ [7, 8, 9],  [10, 11, 12] ] # shape [2, 3]
p = [13]                        # modulus
```

If we sum over axis 1 (columns), we do:

```
For row 0: (1*7 + 2*8 + 3*9) % 13
For row 1: (4*10 + 5*11 + 6*12) % 13
```

The result is a 1D tensor with one value per row, each reduced modulo 13.

***

#### &#x20;What does `apply_perm` do?

If `apply_perm = true`, you **reorder** the positions along that axis **before multiplying and summing**, using a `perm` vector.

Example:

* If `perm = [2, 0, 1]`, it means:
  * “Position 0 becomes 2, 1 becomes 0, 2 becomes 1”
  * You apply this shuffle before doing the calculations.

</details>

*`Since: v1.0.0`*

#### **🧩** Call Format

```cpp
modmul_axis_sum<T> ( a, b,p, perm,log2p_list,mu_list,apply_perm,result);
```

* `T`: Scalar data type (`int32_t`, `int64_t`, etc.)
* All tensor arguments are `std::shared_ptr<DeviceTensor<T>>`

#### 📥📤 Parameters

| Name                                             | Type                               | Shape (axis = -1)                | Role             | Since                      | Description                                                                      |
| ------------------------------------------------ | ---------------------------------- | -------------------------------- | ---------------- | -------------------------- | -------------------------------------------------------------------------------- |
| `a`                                              | `std::shared_ptr<DeviceTensor<T>>` | `[reps, sum_size, k, n]`         | Input            | v1.0.0                     | Left input tensor.                                                               |
| `b`                                              | `std::shared_ptr<DeviceTensor<T>>` | `[sum_size, k, n]`               | Input            | v1.0.0                     | Right input tensor.                                                              |
| `p`                                              | `std::shared_ptr<DeviceTensor<T>>` | `[k]`                            | Input            | v1.0.0                     | Modulus per RNS channel.                                                         |
| `perm`                                           | `std::shared_ptr<DeviceTensor<T>>` | `[n]`                            | Input (optional) | v1.0.0                     | Permutation indices if `apply_perm` is true.                                     |
| `log2p_list`                                     | `std::shared_ptr<DeviceTensor<T>>` | `[k]`                            | Input (optional) | v1.0.0                     | Barrett log2(p) values.                                                          |
| `mu_list`                                        | `std::shared_ptr<DeviceTensor<T>>` | `[k]`                            | Input (optional) | v1.0.0                     | Barrett mu constants.                                                            |
| ~~`axis`~~                                       | ~~`int64_t`~~                      | ~~`-1` or `-3`~~                 | ~~Input~~        | v1.0.0 (removed in v1.1.0) | <p>Previously indicated which axis (-1 or -3) represented transform dimension m. |
| <br>Now the operation always uses axis = -1.</p> |                                    |                                  |                  |                            |                                                                                  |
| `apply_perm`                                     | `bool`                             | —                                | Input            | v1.0.0                     | Whether to apply permutation.                                                    |
| `result`                                         | `std::shared_ptr<DeviceTensor<T>>` | `[reps, k, n]` or `[reps, n, k]` | Output           | v1.0.0                     | Accumulator tensor; **updated by adding new values to existing contents.**       |

#### Logic

* For output: `result[...] = (result[...] + sum_i (a[...] * b[...]) % p) % p`

  1. Computes the sum over `i`:\
     `(a[...] * b[...]) % p`
  2. **Adds this sum to the existing value** in `result[...]`.
  3. Applies `% p` again to keep the value within modular bounds.

  **Key difference from other HEAL functions:**\
  The `result` tensor is not cleared or zero-initialized internally.\
  If you need to avoid accumulation, you **must initialize it to zero yourself** before the call.
* If `apply_perm` is true, applies the `perm` permutation before computation.
* Modular arithmetic is performed per `k`-channel, ensuring overflow safety.

❗ Throws `std::invalid_argument` on:

* Shape mismatches.
* Non-positive moduli.
* Invalid permutation indices.

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
// Initialize result with zeros if you want overwrite behavior
auto result = empty<int64_t>({reps, k, n});
set_const_val<int64_t>(result, 0);

modmul_axis_sum<int64_t>(
    a, b, p, perm, nullptr, nullptr, false, result);
```

{% endtab %}
{% endtabs %}

#### 📝Changelog

**v1.1.0 -**  Removed axis (-1 or -3)  parameter, which represented transform dimension m.\
Now the operation always uses axis = -1.

**v1.0.0 - I**nitial release.

***

## 📒**Number Theoretic Transform (NTT, INTT)** functions

This section describes the forward and inverse **Number Theoretic Transform** operations used in modular polynomial arithmetic.

<table><thead><tr><th width="93">Index</th><th width="197">Name</th><th>Short Description</th></tr></thead><tbody><tr><td>23</td><td>ntt</td><td>Applies Number-Theoretic Transform - performs a forward transform on batched, multi-channel input tensors, converting them to the NTT domain for efficient polynomial multiplication.</td></tr><tr><td>24</td><td>intt</td><td>Applies Inverse Number-Theoretic Transform, returning the data to its original (coefficient) domain.</td></tr></tbody></table>

### 📑`ntt`

Applies the **forward Number Theoretic Transform (NTT)** on a batched, multi-channel tensor. This transform converts data from the coefficient domain into the NTT domain, enabling efficient modular polynomial multiplication.

```
- Introduced in v0.1.0
- Signature updated in v1.0.0
  — Added support for axis parameter (enabling [l, r, k, m] layout)
  — Added optional log2p_list and mu_list for Barrett reduction
```

🧩 **Call Format**

```cpp
// Forward NTT
ntt<T>(
    a,          // [l, m, r, k]
    p,          // [k]
    perm,       // [m]
    twiddles,   // [k, m]
    log2p_list, // [k] — optional (v1.0.0+)
    mu_list,    // [k] — optional (v1.0.0+)
    axis,       // required (v1.0.0+)
    skip_perm,
    result      // [l, m, r, k]
);
```

* `T`: Scalar data type&#x20;
* All inputs are `std::shared_ptr<DeviceTensor<T>>`
* `result` is a pre-allocated output tensor

📥 📤**Parameters**

| Name         | Shape          | Direction        | Since  | Description                                                                           |
| ------------ | -------------- | ---------------- | ------ | ------------------------------------------------------------------------------------- |
| `a`          | `[l, m, r, k]` | Input            | v0.1.0 | Input tensor: l = left batch, m = transform length, r = right batch, k = RNS channels |
| `p`          | `[k]`          | Input            | v0.1.0 | Vector of modulus values for each RNS channel                                         |
| `perm`       | `[m]`          | Input            | v0.1.0 | Permutation vector for final reordering                                               |
| `twiddles`   | `[k, m]`       | Input            | v0.1.0 | Twiddle factors for forward transform                                                 |
| `log2p_list` | `[k]`          | Input (optional) | v1.0.0 | Precomputed log₂(pᵢ) per modulus:  used for optional Barrett reduction.               |
| `mu_list`    | `[k]`          | Input (optional) | v1.0.0 | Precomputed Barrett constants (2²ⁿ / pᵢ) per modulus.                                 |
| `axis`       | `-3` or `-1`   | Input            | v1.0.0 | Which axis represents transform dimension `m`.                                        |
| `skip_perm`  | boolean        | Input            | v1.0.0 | Indicates whether to skip the permutation step.                                       |
| `result`     | `[l, m, r, k]` | Output           | v0.1.0 | Output tensor. Must be pre-allocated and match shape of input `a`.                    |

#### Logic

* Executes staged butterfly operations over the specified `axis` (`-1` or `-3`).
* Uses twiddle factors and modulus values to perform modular arithmetic.
* Writes the transformed result into the `result` tensor.
* By default, it applies a final permutation using `perm`. Set `skip_perm = true` to skip this step.

❗ Throws:

* `std::invalid_argument` if axis is invalid or shapes mismatch.

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
// Forward transform
ntt<int32_t>(a, p, perm, twiddles, nullptr, nullptr, -3, true, result);
```

{% endtab %}
{% endtabs %}

📝 **Changelog**

* **v1.0.0**
  * Added support for `axis = -1` layout (`[l, r, k, m]`)
  * Introduced `log2p_list` and `mu_list` for optional Barrett reduction
* **v0.1.0**
  * Original implementation with fixed `[l, m, r, k]` layout

***

### 📑`intt`

Applies the **inverse Number Theoretic Transform (INTT)** to return tensors from the NTT domain back to the coefficient domain.

```
- Introduced in v0.1.0
- Signature updated in v1.0.0
  — Supports optional Barrett reduction parameters
```

**🧩 Call Format**

```cpp

// Inverse NTT
intt<T>(
    a,
    p,
    perm,
    inv_twiddles,
    m_inv,
    log2p_list,  // optional (v1.0.0+)
    mu_list,     // optional (v1.0.0+)
    result
);
```

* `T`: Scalar data type (e.g., `int32_t`, `int64_t`)
* All inputs are `std::shared_ptr<DeviceTensor<T>>`
* `result` is a pre-allocated output tensor

📥 📤**Parameters**

| Name           | Type                               | Shape          | Role             | Since  | Description                                                             |
| -------------- | ---------------------------------- | -------------- | ---------------- | ------ | ----------------------------------------------------------------------- |
| `a`            | `std::shared_ptr<DeviceTensor<T>>` | `[l, m, r, k]` | Input            | v0.1.0 | Input tensor in NTT domain.                                             |
| `p`            | `std::shared_ptr<DeviceTensor<T>>` | `[k]`          | Input            | v0.1.0 | Modulus values (one per RNS channel).                                   |
| `perm`         | `std::shared_ptr<DeviceTensor<T>>` | `[m]`          | Input            | v0.1.0 | Reordering vector to restore canonical element order.                   |
| `inv_twiddles` | `std::shared_ptr<DeviceTensor<T>>` | `[k, m]`       | Input            | v0.1.0 | Inverse twiddle factors.                                                |
| `m_inv`        | `std::shared_ptr<DeviceTensor<T>>` | `[k]`          | Input            | v0.1.0 | Modular inverse of transform size `m`.                                  |
| `log2p_list`   | `std::shared_ptr<DeviceTensor<T>>` | `[k]`          | Input (optional) | v1.0.0 | ⌊log₂(pᵢ)⌋ values for Barrett reduction (not used yet in default impl). |
| `mu_list`      | `std::shared_ptr<DeviceTensor<T>>` | `[k]`          | Input (optional) | v1.0.0 | Barrett constants (2²ⁿ / pᵢ).                                           |
| `result`       | `std::shared_ptr<DeviceTensor<T>>` | `[l, m, r, k]` | Output           | v0.1.0 | Must be preallocated and match input shape.                             |

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
// Inverse transform
intt<int32_t>(
    transformed, p, perm, inv_twiddles, m_inv,
    nullptr, nullptr, restored
);
```

{% endtab %}
{% endtabs %}

📝 **Changelog**

* **v1.0.0**
  * Introduced `log2p_list` and `mu_list` for optional Barrett reduction
* **v0.1.0**
  * Original implementation with fixed `[l, m, r, k]` layout

***

## 📕Other Compute Operations

This chapter includes additional computational functions that are not strictly arithmetic or shape-related but are essential to support **specialized FHE workloads**.&#x20;

Other compute operations functions include:

<table><thead><tr><th width="93">Index</th><th>Group</th><th width="197">Name</th><th>Short Description</th></tr></thead><tbody><tr><td>25</td><td>Other Compute Operations</td><td>apply_g_decomp</td><td>Applies gadget decomposition (HE-specific operation)</td></tr><tr><td>26</td><td>Other Compute Operations</td><td>take_along_axis</td><td>Selects values from a tensor along a specified axis using provided indices. </td></tr></tbody></table>

***

### 📑`apply_g_decomp_relative_to_full_q`

*`Since: v1.1.0`*

```
Replaced apply_g_decomp (removed in v1.1.0)
```

This function performs **gadget decomposition** of RNS-represented tensors into base-g digit form.\
Unlike the previous version, the decomposition is performed **relative to the full modulus product**\
\\(Q = ∏q\_i ).

Each element is first reconstructed from its CRT residues into an integer, then decomposed into `g_exp` digits in base `2^g_base_bits`.

<details>

<summary>What does <code>apply_g_decomp_relative_to_full_q</code> do? (Click to expand)</summary>

* Input tensor `a` of shape `[reps_l, q_list, reps_r]` contains residues in RNS/CRT form.
* The function reconstructs each element modulo Q with the Q is a product of  q\_i (`[0, ..., q_list-1]` )&#x20;
* Each element is expressed as a sum of digits in base  $$2^{g\_base\_bits}$$ : n

$$x = d\_0 \cdot 2^{0 \cdot g\_base\_bits} ;+; d\_1 \cdot 2^{1 \cdot g\_base\_bits} ;+; \dots ;+; d\_{g\_exp-1} \cdot 2^{(g\_exp-1) \cdot g\_base\_bits}$$

* The result is written into a new tensor `out` with shape `[reps_l, g_exp, reps_r]`,\
  where the second axis enumerates the digits of the decomposition.

</details>

#### **🧩** Call Format

```cpp
apply_g_decomp_relative_to_full_q<T, U>( a, q_list, g_exp, g_base_bits, out);
```

* `T`: scalar type of input residues (e.g., `int32_t`, `int64_t`)
* `U`: scalar type of output digits (can be same or different from `T`)
* All tensors are `std::shared_ptr<DeviceTensor<...>>`

#### 📥📤 Parameters

| Name          | Type                               | Direction | Since  | Description                                                                       |
| ------------- | ---------------------------------- | --------- | ------ | --------------------------------------------------------------------------------- |
| `a`           | `std::shared_ptr<DeviceTensor<T>>` | Input     | v1.1.0 | Input tensor of shape `[reps_l, q_list_len, reps_r]` containing RNS residues.     |
| `q_list`      | `std::shared_ptr<DeviceTensor<T>>` | Input     | v1.1.0 | Vector of RNS moduli `[q_list_len]`.                                              |
| `g_exp`       | `int`                              | Input     | v1.1.0 | Number of digits to extract in the decomposition.                                 |
| `g_base_bits` | `int`                              | Input     | v1.1.0 | Bit-width of each digit (defines base = 2^g\_base\_bits).                         |
| `out`         | `std::shared_ptr<DeviceTensor<U>>` | Output    | v1.1.0 | Output tensor of shape `[reps_l, g_exp, reps_r]` containing decomposition digits. |

***

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp

// Run gadget decomposition relative to full Q
apply_g_decomp_relative_to_full_q<int32_t, int32_t>(a, q_list, g_exp, g_base_bits, out);

// 'out' now holds base-4 digits for each reconstructed integer from a
```

{% endtab %}
{% endtabs %}

#### 📝 Changelog

* **v1.1.0**: Introduced `apply_g_decomp_relative_to_full_q`. Replaces `apply_g_decomp` by performing decomposition relative to full modulus QQQ.
* **v1.0.0**: Renamed from `g_decomposition` to `apply_g_decomp`.Repositioned the `result` parameter to the end of the parameters list for consistency with other functions
* **v0.1.0**: Initial version

***

### 📑`take_along_axis`&#x20;

Selects values from a tensor along a specified axis using provided indices. This function performs an axis-wise gather operation and writes the selected values to a result tensor.

*`Since: v1.0.0`*

<details>

<summary>What does <code>take_along_axis</code> do? (Click to expand)</summary>

This function allows you to select specific values from a tensor by specifying the exact positions (indices) you want along a chosen axis.

It’s like requesting *“From each row (or column, or depth slice), give me the element at position X.”*

***

### **1D Example**

Suppose you have a 1D tensor:

```
a = [5, 10, 15, 20]
```

And you want to select items in this order: `[2, 0, 3, 1]` → which means:\
pick the 3rd, then 1st, then 4th, then 2nd element.

You call:

```cpp
take_along_axis(a, indices = [2, 0, 3, 1], axis = 0)
```

You get:

```
[15, 5, 20, 10]
```

***

### 2D example:

If you have a 2D tensor:

```
a = [[10, 20, 30],
     [40, 50, 60],
     [70, 80, 90]]
```

And you want to select:

```
indices = [[2, 1, 0],
           [1, 0, 2],
           [0, 2, 1]]
```

#### 🔹 Case 1: `axis = 0` → Gather **down rows**

* At each column, we choose from the **vertical direction (axis 0),** so rows are selected.
* Shape of `indices` must match `a`, and each index points to a row.

We apply:

```cpp
take_along_axis(a, indices, axis = 0)
```

Result:

```
[[70, 50, 30],  // from rows [2,1,0]
 [40, 20, 90],  // from rows [1,0,2]
 [70, 80, 60]]  // from rows [0,2,1]
```

Each column's value is selected from the specified row in that column.

#### 🔹 Case 2: `axis = 1` → Gather **across columns**

* At each row, we choose from the **horizontal direction (axis 1)** — so columns are selected.

We apply:

```cpp
take_along_axis(a, indices, axis = 1)
```

```
[[30, 20, 10],  // from columns [2,1,0]
 [50, 40, 60],  // from columns [1,0,2]
 [70, 90, 80]]  // from columns [0,2,1]
```

Each value is selected from the **same row**, just pulling the column specified by `indices`.

***

### **Another example**

Input tensor's shape = `[3, 4]` with 3 rows (axis 0) and 4 columns (axis 1):

```
a =  [[10, 20, 30, 40],
      [50, 60, 70, 80],
      [90,100,110,120]]
```

We want to  **gather 2 values from each row** →

```
indices = [[2, 0],
           [1, 3],
           [3, 1]]
```

```cpp
take_along_axis(a, indices, axis=1)
```

Returns:

```
[[ 30,  10],
 [ 60,  80],
 [120, 100]]
```

The result shape matches `indices.shape`, because the **indices are driving the size along the axis** you’re gathering from.

***

</details>

#### **🧩** Call Format

```cpp
take_along_axis<T>( tensor, indices, axis, result);
```

* `axis` tells which dimension you’re indexing into.
* You can use negative numbers to count from the end. For example, `-1` means “last element”
* `indices.shape` must match `a.shape`, except along `axis`.
* The resulting shape is **always the same as `indices`**.
* This does **not** sort; it **selects** values based on your index map.

#### 📥📤 Parameters

| Name      | Type                                     | Shape                                               | Role   | Description                                           |
| --------- | ---------------------------------------- | --------------------------------------------------- | ------ | ----------------------------------------------------- |
| `a`       | `std::shared_ptr<DeviceTensor<T>>`       |                                                     | Input  | Source tensor to gather from.                         |
| `indices` | `std::shared_ptr<DeviceTensor<int64_t>>` | Broadcast-compatible with `a` (except along `axis`) | Input  | Indices to select along `axis`; broadcastable to `a`. |
| `axis`    | `int64_t`                                | —                                                   | Input  | Axis along which to take values (can be negative).    |
| `result`  | `std::shared_ptr<DeviceTensor<T>>`       | Same as `a`                                         | Output | Output tensor (preallocated).                         |

#### &#x20;Logic

* For each coordinate in `indices`, selects an element from `a` along `axis`.
* Supports negative `axis` values.
* Supports negative indices ( `-1` means last element).
* Requires:

  * `indices.shape` broadcast-compatible with `a`.
  * `result.shape == a.shape` (new in v1.1.0, simplified API).

  ❗ Throws:

  * `std::invalid_argument` if shapes mismatch.
  * `std::out_of_range` if indices are out of bounds or axis is invalid.

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
auto a = torch::tensor({5, 10, 15, 20});
auto indices = torch::tensor({2, 0, 3, 1});
auto result = empty<int64_t>({4});

take_along_axis<int64_t>(a_hw, indices_hw, 0, result_hw);
//Result: [15, 5, 20, 10]
```

{% endtab %}
{% endtabs %}

#### 📝 Changelog

* **v0.1.0 -** Introduced function `permute`- This function rearranged elements of a tensor along a specified axis according to a batch-wise permutation pattern.
* **v1.0.0 -** Function `permute` was replaced by `take_along_axis`, which generalizes the behavior and aligns more closely with established tensor APIs.

***

## 📗Shape Manipulation Functions

This chapter outlines operations used to manipulate the shape and structure of tensors without unnecessary data duplication. These functions are critical for enabling memory-efficient transformations during FHE program execution.

Shape manipulation functions include:

<table><thead><tr><th width="93">Index</th><th width="197">Name</th><th>Short Description</th></tr></thead><tbody><tr><td>27</td><td>flatten</td><td>Collapses a range of dimensions into a single axis, reshaping the tensor while preserving element order.</td></tr><tr><td>28</td><td>expand</td><td>Expands tensor dimensions without copying data</td></tr><tr><td>29</td><td>unsqueeze</td><td>Adds a dimension of size 1 at a specified position</td></tr><tr><td>30</td><td>squeeze</td><td>Removes dimensions of size 1 from tensor</td></tr><tr><td>31</td><td>reshape</td><td>Changes tensor shape while preserving data</td></tr><tr><td>32</td><td>moveaxis</td><td>Updates dims and strides metadata so that the tensor appears to have the same data but with one axis relocated.</td></tr><tr><td>33</td><td>get_slice</td><td>Produces a zero-copy sliced view using index or range.</td></tr></tbody></table>

***

### 📑`flatten`&#x20;

Collapses a range of dimensions into a single axis, reshaping the tensor while preserving element order. Commonly used to reduce rank before linear processing or output.

*`Since: v1.0.0`*

<details>

<summary>What does <code>flatten</code> do? (click to expand)</summary>

The function **collapses several dimensions of a tensor into a single dimension**, without changing the actual data, just how it’s shaped.

You tell it:

* which dimensions to flatten together (from `start_axis` to `end_axis`),
* and it will **replace that range with one combined dimension**.

#### Simple Analogy

Imagine your tensor is a box of LEGO bricks organized by color, shape, and size:

* `[2, 3, 4]` = 2 colors × 3 shapes × 4 sizes = 24 bricks.

If you flatten from axis `0` to `1`, you mix colors and shapes into one group:

* → `[6, 4]` = 6 combined color-shape combos, still 4 sizes each.

***

### Examples

#### **Flatten middle dimensions**

```cpp
// Tensor shape: [2, 3, 4, 5]
flatten(a, 1, 2) 
// Result shape: [2, 12, 5]
```

We flatten `[3, 4]` into `12`.

#### **Flatten all dimensions**

```cpp
// Tensor shape: [2, 3, 4, 5]
flatten(a, 0, -1)
// Result shape: [120]
```

All axes are collapsed into one long row.

#### **Flatten with negative indices**

```cpp
// Tensor shape: [4, 5, 6]
flatten(a, -3, -2)
// Result shape: [20, 6]
```

***

#### ⚠️ Important Notes

* **Does not change the data**, just reshapes the view of it.
* **Input must be contiguous** in memory (no transposes before flatten).
* Axes are **inclusive**, so `flatten(a, 1, 3)` flattens 3 axes, not 2.

</details>

#### **🧩** Call Format

<pre class="language-cpp"><code class="lang-cpp"><strong>tensor = flatten&#x3C;T>( tensor, start_axis, end_axis);
</strong></code></pre>

#### 📥Input Parameters

| Name         | Type                               | Role         | Description                                                             |
| ------------ | ---------------------------------- | ------------ | ----------------------------------------------------------------------- |
| `a`          | `std::shared_ptr<DeviceTensor<T>>` | Input/Output | Input tensor to flatten. Metadata is updated in-place.                  |
| `start_axis` | `int64_t`                          | Input        | Start of axis range to flatten (inclusive). Supports negative indexing. |
| `end_axis`   | `int64_t`                          | Input        | End of axis range to flatten (inclusive). Must be ≥ `start_axis`.       |

#### 📤 Returns

| Type                               | Description                                               |
| ---------------------------------- | --------------------------------------------------------- |
| `std::shared_ptr<DeviceTensor<T>>` | **Same** tensor as input, with updated shape and strides. |

#### Logic

* Flattens dimensions `[start_axis, end_axis]` into a single dimension.
* All other dimensions remain unchanged.
* Operates in-place: modifies tensor metadata but not the data buffer.
* Input tensor must be **contiguous**. Non-contiguous tensors will throw an error.
* Negative axes are normalized (`-1` = last axis, etc.).
* Throws on invalid ranges (e.g., `start > end`, or axes out of bounds).

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
// From [2, 3, 4, 5]:
flatten(a, 1, 2)   // shape becomes [2, 12, 5]
flatten(a, 0, -1)  // shape becomes [120]
```

{% endtab %}
{% endtabs %}

#### 📝 Changelog

* **v1.0.0 -** Initial release.

***

### 📑`expand`

*`Since: v0.1.0`*

The `expand` function virtually replicates a singleton dimension of a tensor along a specified axis, modifying its shape and stride metadata without duplicating memory.

<details>

<summary>What does <code>expand</code> do? (Click to expand)</summary>

**Operation**

You specify:

* Which axis you want to expand (`axis`)
* How many times to repeat the dimension (`repeats`)

The selected axis must originally have size **1**, because only dimensions of size 1 can be "stretched" safely by broadcasting.

Internally, the stride along that axis becomes `0`, meaning all repeated positions point to the same memory location.

**Example**

Suppose you have a tensor with shape `[2, 1, 3]`:

```cpp
[
 [[1, 2, 3]],
 [[4, 5, 6]]
]
```

If you call:

```cpp
expand(a, axis=-2, repeats=4);
```

The shape becomes `[2, 4, 3]`.

Every value along the second axis is repeated without copying:

```cpp
[
 [[1, 2, 3],
  [1, 2, 3],
  [1, 2, 3],
  [1, 2, 3]],

 [[4, 5, 6],
  [4, 5, 6],
  [4, 5, 6],
  [4, 5, 6]]
]
```

</details>

#### **🧩** Call Format

```cpp
expand<T>(a, axis, repeats);
```

* `T`: Scalar data type (`int32_t`, `int64_t`, `float`, `double`)
* Tensor `a` is modified in-place.

#### 📥📤 Parameters

| Name      | Type                               | Direction    | Description                                                   |
| --------- | ---------------------------------- | ------------ | ------------------------------------------------------------- |
| `a`       | `std::shared_ptr<DeviceTensor<T>>` | Input/Output | Tensor whose dimension will be expanded in-place.             |
| `axis`    | `int64_t`                          | Input        | Axis to expand (can be negative to count from the end).       |
| `repeats` | `int64_t`                          | Input        | Number of times to replicate the dimension; must be positive. |

{% hint style="danger" %} <mark style="color:red;">**Warning:**</mark>\ <mark style="color:red;">**This function modifies the input tensor**</mark><mark style="color:red;">**&#x20;**</mark><mark style="color:red;">**`a`**</mark><mark style="color:red;">**&#x20;**</mark><mark style="color:red;">**in-place by changing its dimensions and strides.**</mark>
{% endhint %}

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
// Expand along axis 1 (currently size 1) to make it size 3
auto expanded = expand<int32_t>(a, /*axis=*/1, /*repeats=*/3);
```

{% endtab %}
{% endtabs %}

#### 📝 Changelog

* **v1.0.0 -** Initial release.

***

### 📑`unsqueeze`

*`Since: v0.1.0`*&#x20;

The function inserts a new axis of size `1` into a tensor’s shape.\
This is a metadata-only operation: no data is changed, copied, or moved.

It is commonly used to align tensor shapes for broadcasting or to explicitly add batch, channel, or dimension markers.

<details>

<summary>How Unsqueeze Works (Click to expand)</summary>

The `unsqueeze` function **adds a new dimension of size 1** into the tensor.

Imagine a tensor of shape `[5, 10]`.\
If you call:

```cpp
unsqueeze(a, 0)
```

You insert a new leading dimension → new shape is `[1, 5, 10]`.

If you instead call:

```cpp
unsqueeze(a, -1)
```

You insert a new trailing dimension → new shape is `[5, 10, 1]`.

</details>

#### **🧩** Call Format

```cpp
unsqueeze<T>(a, axis) → result
```

* `T`: Scalar data type (`int32_t`, `int64_t`, `float`, `double`)
* Returns: `std::shared_ptr<DeviceTensor<T>>`

#### 📥 Input Parameters

| Name   | Type                               | Description                                                                        |
| ------ | ---------------------------------- | ---------------------------------------------------------------------------------- |
| `a`    | `std::shared_ptr<DeviceTensor<T>>` | Input tensor to be reshaped. This tensor is modified in-place.                     |
| `axis` | `int64_t`                          | The axis at which to insert a new dimension of size 1. Supports negative indexing. |

#### 📤 Output

| Name     | Type                               | Description                                                                                                   |
| -------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- |
| `result` | `std::shared_ptr<DeviceTensor<T>>` | A reference to the same tensor `a`, with an updated shape and stride metadata reflecting the added dimension. |

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
// Input: shape [3, 4]
auto a = host_to_device<int32_t>(torch::rand({3, 4}, torch::kInt32));

// Insert new dimension at axis 1 → shape becomes [3, 1, 4]
auto result = unsqueeze<int32_t>(a, 1);
```

{% endtab %}
{% endtabs %}

#### 📝 Changelog

* **v1.0.0 -** Initial release.

***

### 📑`squeeze`&#x20;

*`Since: v0.1.0`*

The function removes a dimension of size 1 at the specified axis.\
This is a metadata-only operation — no data is copied or moved.

It is often used after broadcasting or slicing to clean up unnecessary singleton dimensions.

<details>

<summary>How Squeeze Works (Click to expand)</summary>

The `squeeze` function removes a dimension of size 1 at a specific axis. This is helpful when tensors have extra "empty" dimensions from operations like broadcasting or slicing.

For example:

If you have a tensor of shape `[3, 1, 4]` and you call:

```cpp
squeeze(a, 1)
```

The output will have shape `[3, 4]`.

Nothing is copied — the underlying data stays in place. Only shape and stride metadata is adjusted.

✔ Saves memory\
✔ Keeps tensors clean\
✔ Makes broadcasting more predictable

</details>

#### **🧩** Call Format

```cpp
squeeze<T>(a, axis) → result
```

* `T`: Scalar data type (`int32_t`, `int64_t`, `float`, `double`)
* Returns: `std::shared_ptr<DeviceTensor<T>>`

#### 📥 Input Parameters

| Name   | Type                               | Description                                                                                                     |
| ------ | ---------------------------------- | --------------------------------------------------------------------------------------------------------------- |
| `a`    | `std::shared_ptr<DeviceTensor<T>>` | Input tensor to be reshaped. Modified in-place.                                                                 |
| `axis` | `int64_t`                          | Axis to remove. Must be within valid range and must point to a dimension of size 1. Supports negative indexing. |

#### 📤 Output

| Name     | Type                               | Description                                                                            |
| -------- | ---------------------------------- | -------------------------------------------------------------------------------------- |
| `result` | `std::shared_ptr<DeviceTensor<T>>` | Same tensor as input, with one fewer dimension. Shape and stride metadata are updated. |

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
// Input: shape [3, 1, 4]
auto a = host_to_device<int32_t>(torch::rand({3, 1, 4}, torch::kInt32));

// Remove axis 1 → shape becomes [3, 4]
auto result = squeeze<int32_t>(a, 1);
```

{% endtab %}
{% endtabs %}

#### 📝 Changelog

* **v1.0.0 -** Initial release.

***

### 📑`reshape`&#x20;

*`Since: v0.1.0`*

The `reshape` method updates a tensor’s shape and stride metadata to match a new specified shape, as long as the total number of elements remains unchanged (excluding broadcasted dimensions).

<details>

<summary>How Reshape Works (Click to expand)</summary>

The `reshape` function changes how a tensor’s data is interpreted — without changing the data itself.

**Example**

Suppose you have a tensor with shape `[2, 3, 4]`:

```
[
  [[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9,10,11]],
  [[12,13,14,15], [16,17,18,19], [20,21,22,23]]
]
```

This tensor has 24 elements.\
Now call:

```cpp
a->reshape({6, 4});
```

The shape becomes `[6, 4]`, and the data is interpreted as:

```
[
 [ 0, 1, 2, 3],
 [ 4, 5, 6, 7],
 ...
 [20, 21, 22, 23]
]
```

{% hint style="warning" %}
You must **preserve the number of elements**.\
If the original had 24 elements, so must the new shape.
{% endhint %}

</details>

#### **🧩** Call Format

```cpp
a->reshape(new_dims)
```

* Operates in-place: modifies the current tensor's shape and stride metadata

#### 📥Input Parameters

| Name       | Type                               | Direction    | Description                                                           |
| ---------- | ---------------------------------- | ------------ | --------------------------------------------------------------------- |
| `a`        | `std::shared_ptr<DeviceTensor<T>>` | Input/Output | The tensor to reshape. Shape and strides will be modified in-place.   |
| `new_dims` | `std::vector<int64_t>`             | Input        | Desired new shape. Total element count must match the current tensor. |

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
auto a = host_to_device<int32_t>(torch::arange(24).reshape({2, 3, 4}));

// Reshape from [2, 3, 4] to [6, 4]
a->reshape({6, 4});
```

{% endtab %}
{% endtabs %}

#### 📝 Changelog

* **v1.0.0 -** Initial release.

***

### 📑`moveaxis`&#x20;

*`Since: v1.0.0`*

The function updates the internal metadata (dims and strides) of a tensor to simulate movement of one axis to a new position, without modifying the underlying memory.

<details>

<summary>What does <code>moveaxis</code> do? (Click to expand)</summary>

* `moveaxis` changes the **order of axes** in a tensor **without moving any actual data in memory**.
* It updates the tensor’s **shape (`dims`) and stride (`strides`) metadata** so that one axis appears in a new position.
* This is equivalent to **reordering dimensions**, like how PyTorch's `movedim()` or NumPy's `moveaxis()` works.

#### **How it works**

* You specify:
  * Which axis to move: `axis_src`
  * Where to move it: `axis_dst`
* Both axes can be negative — e.g., `-1` means the last axis, `-2` the second-to-last, etc.
* Internally, the function:
  * Removes the source axis from the `dims` and `strides` vectors
  * Reinserts it at the target position
* The underlying data buffer stays unchanged — only how the tensor *interprets* that data is updated.

{% hint style="info" %}
**🧠 Intuition**\
It’s like **cutting one column from a spreadsheet and pasting it in a different position**, without changing the actual cell contents.
{% endhint %}

***

### **Example**

Suppose we have a tensor with shape `[2, 3, 4]`:

```
a.shape = [2, 3, 4]
```

Now we call:

```cpp
moveaxis(a, axis_src=2, axis_dst=0)
```

This means:

* Take axis 2 (which had size 4 — the last dimension)
* Move it to the front (position 0)

The result:

```
a.shape = [4, 2, 3]
```

So the dimensions are now rearranged: what used to be the last axis is now the first.

***

</details>

#### **🧩** Call Format

```cpp
moveaxis<T>(tensor, axis_src, axis_dst)
```

* `tensor`: Tensor to update (metadata modified in-place)
* `axis_src`: Axis to move (may be negative)
* `axis_dst`: Target position (may be negative)

#### 📥Input Parameters

| Name       | Type                               | Description                                         |
| ---------- | ---------------------------------- | --------------------------------------------------- |
| `tensor`   | `std::shared_ptr<DeviceTensor<T>>` | Tensor to be modified in-place                      |
| `axis_src` | `int64_t`                          | Source axis index (supports negative indexing)      |
| `axis_dst` | `int64_t`                          | Destination axis index (supports negative indexing) |

#### Logic&#x20;

* Modifies the tensor’s `dims` and `strides` vectors to simulate a move of one axis.
* Negative axis values are normalized using the tensor’s rank.
* If `axis_src == axis_dst`, the operation is a no-op.
* Invalid axis indices will raise `std::invalid_argument`.

❗ **Error Conditions**

* Null pointer input → throws `std::invalid_argument`.
* Axis indices outside valid range → throws `std::invalid_argument`.

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
auto a_hw = host_to_device<int64_t>(a);

moveaxis<int64_t>(a_hw, /*src=*/2, /*dst=*/0);

```

{% endtab %}
{% endtabs %}

#### angelog

* **v1.0.0 -** Initial release.

***

### 📑`get_slice`&#x20;

*`Since: v1.0.0`*

This function produces a zero-copy **view** into the input tensor by modifying the metadata (shape, strides, and pointer offset) based on a slicing specification.

<details>

<summary>What does <code>get_slice</code> do? (Click to expand)</summary>

`get_slice` lets you select a **portion of a tensor,** like cutting out a smaller block from a larger one, without copying any data.

It works by adjusting how the tensor is *viewed*:

* No new memory is created.
* The function just updates shape and stride metadata to make it **look like a smaller tensor**.

You control the slicing with one instruction **per axis**:

* You can either **pick a single index** (removes that axis), or
* **Select a range** of elements using a `start`, `end`, and optional `step`.

***

### Types of Slice Instructions

You can use:

1. **Single Index**\
   Select one specific element along the axis and remove (**virtually - done via metadata only)** that axis from the shape.

   ```
   SliceArg = int64_t(2)  // pick index 2 only
   ```
2. **Range (Slice)**\
   Select multiple elements using a start, end, and optional step (default is 1).

   ```
   SliceArg = Slice(start=1, end=4, step=1)  // pick indices 1, 2, 3
   ```

***

### Examples

#### **Example 1: Slice 1D Tensor**

Suppose your tensor is:

```
a = [10, 20, 30, 40, 50]
```

```cpp
get_slice(a, { Slice(1, 4) })
```

This means: keep items from index 1 to 3 → `[20, 30, 40]`.

#### **Example 2: Use Step**

Same tensor:

```
a = [10, 20, 30, 40, 50]
```

```cpp
get_slice(a, { Slice(0, 5, 2) })
```

This picks every 2nd element → `[10, 30, 50]`

#### **Example 3: Pick a Row in 2D**

```
a = [[ 1,  2,  3],
     [ 4,  5,  6]]
```

```cpp
get_slice(a, { int64_t(1) })
```

This picks **row 1** → `[4, 5, 6]`\
(The output is now 1D - the row axis is collapsed.)

#### **Example 4: Select Sub-Block**

```
a = [[10, 20, 30, 40],
     [50, 60, 70, 80]]
```

```cpp
get_slice(a, { Slice(0,2), Slice(1,3) })
```

* Rows 0 and 1 → keep both rows
* Columns 1 and 2 → keep 20, 30 and 60, 70

Result:

```
[[20, 30],
 [60, 70]]
```

#### **Example 5: Complex Case (3D)**

Imagine a 3D tensor shaped `[2, 3, 4]` - like 2 blocks of 3 rows × 4 columns

```cpp
get_slice(a, {
    int64_t(1),          // Pick block 1 → shape becomes [3, 4]
    Slice(0, 3, 2),      // Rows: take indices 0 and 2 → now shape is [2, 4]
    Slice(1, 4)          // Columns: take indices 1 to 3 → final shape is [2, 3]
})
```

Final result:

* Block 1
* Rows: 0 and 2
* Columns: 1, 2, 3

{% hint style="info" %}
💡 This is like selecting a submatrix or zoomed-in region of a larger tensor - no memory is moved, but the tensor *behaves* like a smaller view.
{% endhint %}

</details>

#### **🧩** Call Format

```cpp
get_slice<T>(input, slices) -> result;
```

* `T`: Scalar data type
* `input`: Input tensor whose metadata is modified
* `slices`: specifying either a fixed index or a range

#### 📥Input Parameters

| Name     | Type                               | Description                                  |
| -------- | ---------------------------------- | -------------------------------------------- |
| `input`  | `std::shared_ptr<DeviceTensor<T>>` | Tensor to slice (metadata modified in-place) |
| `slices` | `std::vector<SliceArg>`            | Slice specification per axis (see below)     |

Each `SliceArg` can be:

* `int64_t`: Take a single index → collapses that axis
* `Slice`: A struct of `(start, end, step)` (default `step=1`), with:
  * `start` (inclusive)
  * `end` (exclusive)
  * `step` > 0

#### 📤 Output

| Type                               | Description                                                                                  |
| ---------------------------------- | -------------------------------------------------------------------------------------------- |
| `std::shared_ptr<DeviceTensor<T>>` | A new view of the input tensor with updated shape, strides, and offset. No memory is copied. |

#### Logic

* Performs slicing without allocating a new buffer (zero-copy).
* May collapse axes when single index is selected.
* All slicing rules follow PyTorch-style semantics.
* Negative indices are not currently supported.

❗ **Error Conditions**

* `slices.size()` ≠ `input.rank()` → throws `std::invalid_argument`
* Index out of bounds → throws `std::out_of_range`
* Invalid range (e.g. `end ≤ start`, or `step ≤ 0`) → throws `std::invalid_argument`

{% tabs %}
{% tab title="▶️ Example Usage" %}

```cpp
  auto a = torch::tensor({{10,20,30,40},{50,60,70,80}}, torch::kInt32);
    std::vector<SliceArg> slices = {
        Slice(0, 2),      
        Slice(1, 3)   
    };
    auto a_hw   = host_to_device<int32_t>(a);
    auto out_hw = get_slice<int32_t>(a_hw, slices);
    auto out    = device_to_host<int32_t>(out_hw);
```

{% endtab %}
{% endtabs %}

#### 📝 Changelog

* **v1.0.0 -** Initial release.

***


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://healdocs.lattica.ai/interface-specifications.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
