Skip to main content

LatencyProfiler.h File

Per-kernel latency profiler — V1 surface. More...

Included Headers

#include "pipeline/Run.h" #include <cstdint> #include <string> #include <vector>

Namespaces Index

namespacesimaai
namespaceneat

Classes Index

structProfilerKernelInvocation

One kernel-invocation telemetry event. More...

structProfilerMemcpySite

Aggregate counters for one instrumented memcpy site. More...

structProfilerKernelAggregate

Aggregated timings for one (backend, kernel, stage, slot) tuple. More...

structProfilerReport

Snapshot bundle returned by LatencyProfiler::finalize(). More...

structLatencyProfilerOptions

Construction options for LatencyProfiler. More...

classLatencyProfiler

Per-sample latency tracker; attach to a Run to capture timing telemetry. More...

Description

Per-kernel latency profiler — V1 surface.

Attach a LatencyProfiler to a simaai::neat::Run (or to a Session that has produced one) BEFORE you start pushing frames. After your run loop, call finalize() to get a ProfilerReport and pass it to to_text() / to_chrome_trace() to dump a human-readable summary or a JSON trace file loadable in chrome://tracing or Perfetto.

The profiler aggregates four classes of telemetry:

  1. Per-kernel-invocation events (MLA, A65, EV74, BoxDecode, Memcpy) drained from libsimaaineatprofiler.so's cross-shared-library ring. Each event carries (start_ns, end_ns, backend, phase, physical_input_index, output_slot, frame_id, request_id, kernel_name, stage_name, in/out segment names, bytes).
  2. Per-element aggregate timings (existing Run::diag_snapshot()).
  3. End-to-end per-frame stats (existing Run::stats()).
  4. Per-site memcpy totals (calls / total_ns / total_bytes) for the five hot copy sites the runtime instruments.

Off-path overhead is gated by sima_neat_profiler_enabled() — when no profiler is attached, every emit site is one atomic-load + branch.

See Also

Run.h for RunStats, InputStreamStats, RunDiagSnapshot.

File Listing

The file content with the documentation metadata removed is:

1
29// Per-kernel latency profiler library — V1 surface.
30//
31// Attach a `LatencyProfiler` to a `simaai::neat::Run` (or to a Session that
32// has produced one) BEFORE you start pushing frames. After your run loop,
33// call `finalize()` to get a `Report` and pass it to `to_text()` /
34// `to_chrome_trace()` to dump a human-readable summary or a JSON trace file
35// loadable in chrome://tracing or Perfetto.
36//
37// The profiler aggregates four classes of telemetry:
38// 1 Per-kernel-invocation events (MLA, A65, EV74, BoxDecode, Memcpy) —
39// drained from libsimaaineatprofiler.so's cross-shared-library ring.
40// Each event carries (start_ns, end_ns, backend, phase,
41// physical_input_index, output_slot, frame_id, request_id, kernel_name,
42// stage_name, in/out segment names, bytes).
43// 2 Per-element aggregate timings (existing `Run::diag_snapshot()`).
44// 3 End-to-end per-frame stats (existing `Run::stats()`).
45// 4 Per-site memcpy totals (calls / total_ns / total_bytes) for the five
46// hot copy sites the runtime instruments.
47//
48// Off-path overhead is gated by `sima_neat_profiler_enabled()` — when no
49// profiler is attached, every emit site is one atomic-load + branch.
50
51#ifndef SIMAAI_NEAT_PIPELINE_LATENCY_PROFILER_H_
52#define SIMAAI_NEAT_PIPELINE_LATENCY_PROFILER_H_
53
54#include "pipeline/Run.h"
55
56#include <cstdint>
57#include <string>
58#include <vector>
59
60namespace simaai::neat {
61
62class Session; // forward
63
75 std::uint64_t start_ns = 0;
76 std::uint64_t end_ns = 0;
77 std::string backend;
78 std::string phase;
79 std::int32_t physical_input_index = -1;
80 std::int32_t output_slot = -1;
81 std::int64_t frame_id = -1;
82 std::uint32_t request_id = 0;
83 std::uint32_t bytes = 0;
84 std::string kernel_name;
85 std::string stage_name;
86 std::string in_segment;
87 std::string out_segment;
88
90 double duration_ms() const {
91 return static_cast<double>(end_ns - start_ns) / 1.0e6;
92 }
93};
94
106 std::string site_name;
107 std::uint64_t calls = 0;
108 std::uint64_t total_ns = 0;
109 std::uint64_t total_bytes = 0;
110 std::uint64_t max_ns = 0;
111
113 double total_ms() const {
114 return static_cast<double>(total_ns) / 1.0e6;
115 }
117 double avg_ms() const {
118 return calls > 0 ? (static_cast<double>(total_ns) / 1.0e6) / static_cast<double>(calls) : 0.0;
119 }
120};
121
131 std::string backend;
132 std::string kernel_name;
133 std::string stage_name;
134 std::int32_t physical_input_index = -1;
135 std::int32_t output_slot = -1;
136 std::uint64_t count = 0;
137 double total_ms = 0.0;
138 double min_ms = 0.0;
139 double max_ms = 0.0;
141 double avg_ms() const {
142 return count > 0 ? (total_ms / static_cast<double>(count)) : 0.0;
143 }
144};
145
159 // Reused snapshots
163
164 // New
165 std::vector<ProfilerKernelInvocation> kernel_invocations;
166 std::vector<ProfilerKernelAggregate> kernel_aggregates;
167 std::vector<ProfilerMemcpySite> memcpy_sites;
168
169 std::uint64_t profiler_emits = 0;
170 std::uint64_t profiler_dropped = 0;
171
172 std::string mpk_path;
173 std::string description;
174 std::int64_t frames_total = 0;
175 std::int64_t warmup_frames = 0;
176};
177
184 bool capture_kernels = true;
185 bool capture_memcpy = true;
186 std::size_t ring_capacity = 8192;
187 std::int64_t warmup_frames = 0;
188};
189
203public:
206
211
216
224 // Attach to a Run (Session-level path). After this call, every kernel
225 // event emitted by the runtime is captured in the profiler's ring.
226 void attach(Run& run);
227
234 // Optional: attach a Session directly so per-output frame_id stamping can
235 // hook the existing tensor_callback. No-op for V1 (placeholder for V2).
236 void attach(Session& session);
237
244 // Reset all counters/event ring to mark the boundary between warmup and
245 // measured frames. Call after pushing `Options::warmup_frames` inputs.
247
257 // Drain the event ring and snapshot every reused telemetry source into a
258 // Report. Safe to call multiple times; each call drains incremental
259 // events since the previous drain (use mark_warmup_done() to discard).
261
262 // Convenience helpers for serialization.
264 static std::string to_text(const ProfilerReport& report);
266 static std::string to_chrome_trace(const ProfilerReport& report);
267
268private:
269 Options options_;
270 Run* attached_run_ = nullptr;
271 Session* attached_session_ = nullptr;
272 bool enabled_at_attach_ = false;
273};
274
275} // namespace simaai::neat
276
277#endif // SIMAAI_NEAT_PIPELINE_LATENCY_PROFILER_H_

Generated via doxygen2docusaurus 2.0.0 by Doxygen 1.9.8.