Skip to main content

The dtype contract

The Neat framework's pipelines have a deliberately small dtype vocabulary at their public boundaries:

  • Inputs to preprocess: FP32 (or images that the framework converts to FP32 before feeding the model).
  • Inside the MLA: either INT8 (with quant/dequant at the boundary) or BF16 (no quant/dequant — straight through).
  • Outputs from postprocess: FP32, ready for the application.

Everything else — quantization, tessellation, layout conversion — is an internal transformation the framework inserts when the model's MPK manifest demands it. This page explains the four corners of that contract and how the planner picks the right preprocess graph family.

The four cases

A model's MPK contract tells the framework two things about its first MLA stage: the MLA input dtype (BF16 or INT8) and whether MLA-side tessellation is part of the compiled kernel. That gives four combinations:

MLA dtypeMLA tessPreprocess graph family the planner picksWhat the framework inserts before the MLA
BF16yesPreprocResize, color convert, normalize. The MLA stage tessellates internally.
BF16noTessResize, color convert, normalize, tessellate.
INT8yesQuantResize, color convert, normalize, quantize. The MLA stage tessellates internally.
INT8noQuantTessResize, color convert, normalize, quantize, tessellate.

The planner picks one of these four PreprocessGraphFamily values when building the preprocess Node. See PreprocessGraphFamily in the C++ reference and ResolvedPreprocessPlan for the field.

What "tessellation" means here

Tessellation is the tile-shuffle that arranges a tensor into the geometry the MLA's input scratchpad expects. It's a pure layout transformation — same bytes, different order. The planner inserts a tess node only when the MLA's compiled kernel does not include tessellation in its first op (the "MLA tess" column above).

The matching detessellation happens after the MLA stage if the MLA's compiled output kernel doesn't include detess. The same four-case table applies on the output side, with Detess/Dequant/DetessDequant/passthrough as the dual operations.

Boundary upgrades — Generic Preproc and BoxDecode

Two upgrades the planner can apply on top of the four-case decision:

  • Generic Preproc: when the application supplies arbitrary user-defined transforms (PreprocessOptions::transforms), the planner upgrades the chosen graph family to a "generic" variant that fuses those transforms with the standard preprocess. The contract at the MLA boundary doesn't change; the upgrade only affects what runs before the boundary.
  • BoxDecode: a postprocess upgrade that fuses NMS / decode steps for detection models. Visible to the application as a BoxDecodeType and a DetectionMeta on the output sample. See BoxDecodeType.h and the BoxDecode how-to.

Both upgrades preserve the FP32-in / FP32-out vocabulary at the application boundary; they only change which kernels run inside the framework.

Application-visible consequences

What the dtype contract means in practice for application code:

  • Sample-level dtype is FP32 at every public boundary. The application writes FP32 inputs into samples and reads FP32 outputs from them. INT8/BF16 only ever exists inside the framework.
  • The application never sees tessellated tensors at the public API. Tessellation/detess is an internal layout — Tensor objects you push or pull are always in their natural layout (HWC, CHW, etc.).
  • Conversion costs are visible only via tracing. If you want to know whether the planner inserted a quantize or tess kernel, enable a ConversionTraceCollector. Each insertion shows up as a ConversionTrace entry with its ConversionKind.

Further reading

  • "Tessellation, quant, and cast" — §17 of the design deep dive (Architecture).
  • "Input planner" — §82 of the design deep dive.