Transcript
In Lattica, AI inference is executed through a mechanism called a transcript - a structured JSON file that describes a step-by-step sequence of operations needed to evaluate an AI model. This transcript serves as a deterministic execution plan, processed by the HEAL Runtime.
We generate transcript from an AI model. During this generation process, we determine exactly which operations are needed and which implementation to call for each operation. For instance:
If a hardware backend implements a function like
key_switch
, the transcript will call the hardware-accelerated version of that function.If a hardware backend does not implement the function, the transcript will instead point to our implementation built from a minimal set of low-level operations.
This dynamic generation process enables each transcript to be customized for the specific capabilities of the hardware it's targeting.
To support testing and benchmarking, we also provide in HEAL a special "sandbox" transcript. Instead of retrieving real query data from a client, the sandbox transcript contains pre-encrypted input data embedded directly inside the file. This design enables:
Functionality testing of any hardware-accelerated implementation.
Performance evaluation on real AI model workloads.
Repeatable experiments without dependency on live inputs.
Transcript Structure
Each entry in the transcript is a tuple:
Where:
ExecutionTranscriptOpType
: Specifies the type of instruction:DEVICE_OP
SEGMENT_START
SEGMENT_END
FREE_DEVICE_TENSOR
payload
: Provides the operation's full definition, including input arguments and outputs.
Operation Types
1. DEVICE_OP
: Hardware-Targeted Functions
DEVICE_OP
: Hardware-Targeted FunctionsEach DEVICE_OP
represents a function call defined in the HEAL interface specification - these are the operations that hardware vendors are expected to implement
It consists of:
A function
name
: The function name, such as"ntt"
,"host_to_device"
, or"modmul"
.A list of input arguments (
args
)A designated output (
out
)
Each function call in the transcript includes input and output arguments, all wrapped in DeviceOpArg
objects. These arguments carry both the data and information about what type of value they represent.
The type of each argument is specified using the DeviceOpArgType
enum. This tells the runtime how to interpret the associated value.
Below is a reference table for the possible argument types:
HOST_TENSOR
A tensor that resides on the host (CPU). Typically used when transferring data from host to device.
DEVICE_TENSOR
A pointer (inf_name
) to a tensor stored on the device (e.g., GPU, FPGA). Used for device-resident operations.
SHAPE
Shape information for tensors (e.g., dimensions of a matrix).
INT
A scalar integer value - often used for specifying axes, sizes, or indices.
NONE
Represents a null or intentionally missing argument.
TENSOR_TYPE
Metadata that describes the kind of tensor (e.g., int32, int64).
SLICE
A slicing specification to extract a portion of a tensor.
ELLIPSIS
Used for slicing across multiple dimensions (similar to Python’s ...
).
These functions operate on tensors and are executed directly on the hardware accelerator.
2. SEGMENT_START
/ SEGMENT_END
: Device Execution Boundaries
SEGMENT_START
/ SEGMENT_END
: Device Execution BoundariesA segment marks a block of operations that are intended to run in sequence and may benefit from hardware-level optimization. Real-world AI models include a mix of:
Low-level, hardware-accelerated tensor operations (e.g.,
ntt
,modmul
)Higher-level operations or control logic that may run on the host (CPU)
By enclosing a sequence of operations between SEGMENT_START
and SEGMENT_END
, the transcript defines a logical execution block.
This block allows the hardware backend to analyze and optimize the segment as a whole.
3. FREE_DEVICE_TENSOR
: Explicit Memory Release
FREE_DEVICE_TENSOR
: Explicit Memory ReleaseIn HEAL, we take responsibility for managing device memory explicitly.
The FREE_DEVICE_TENSOR
operation is used to indicate when a tensor's memory should be released.
Encrypted Data for Testing
To enable closed-loop testing, the "sandbox" transcript includes the complete contents of tensors expected to be on the host. This includes both data sent to the device via host_to_device and data expected to be received back from the device via device_to_host.
This setup allows comparison between the actual device computation outputs and the expected results.
Since it is embedded directly in the transcript:
Hardware engineers can test functionality and performance without any additional setup.
All parties run on identical inputs, which enhances reproducibility.
Example snippet:
This ensures that performance and correctness are validated using deterministic input.
How a Transcript Uses Tensors and Pointers
When the transcript runs, each tensor on the device is identified using a unique pointer called inf_name
. This lets the runtime (and your hardware) track where each tensor lives in memory.
Here’s how it works in a typical flow:
host_to_device The transcript begins by uploading a tensor from the host (CPU) to the device. This function allocates memory on the device and returns a pointer to it - for example,
inf_name: 55
.Operations on Device Next, we perform operations (e.g.
ntt
,reshape
,modmul
) using that pointer. Each function receives this pointer as input so it can access the right memory.device_to_host When we’ve finished processing, we call
device_to_host
with the same pointer. This reads the result from device memory and returns it as a host tensor.free_device_tensor Finally, we free the device memory associated with the pointer. This helps avoid memory leaks and lets your hardware manage resources efficiently.
So in the transcript, you’ll often see the same pointer (e.g., inf_name: 55
) used across several steps, from memory allocation to final cleanup. This is how we track data through the model.
Example: Pointer Lifecycle in a Transcript
Below is a simplified excerpt from a real transcript. It shows how a tensor is uploaded to the device, used in operations, sent back to the host, and then freed:
This example shows how one tensor (inf_name: 55
) is allocated, used across multiple steps, and then properly released. Every operation that interacts with the tensor simply references the pointer, making memory handling straightforward and efficient for the hardware.
Ready-to-Use Transcript for Testing
To help hardware teams get started quickly, we provide a ready-made transcript in our GitHub repository. It’s designed for sandbox testing and includes:
A real AI model execution sequence.
Encrypted input data is already embedded in the file.
Calls to a wide range of HEAL functions.
This transcript allows you to test functionality and measure performance of your hardware implementation without needing to connect a client or handle live input. It’s a self-contained example that can run from start to finish using the HEAL Runtime.
Last updated