`LatencyProfiler.h` File

Per-kernel latency profiler — V1 surface. More...

Included Headers

#include "pipeline/Run.h" #include <cstdint> #include <string> #include <vector>

Namespaces Index

namespace	simaai


namespace	neat

Classes Index

struct	ProfilerKernelInvocation
	One kernel-invocation telemetry event. More...

struct	ProfilerMemcpySite
	Aggregate counters for one instrumented memcpy site. More...

struct	ProfilerKernelAggregate
	Aggregated timings for one (backend, kernel, stage, slot) tuple. More...

struct	ProfilerReport
	Snapshot bundle returned by LatencyProfiler::finalize(). More...

struct	LatencyProfilerOptions
	Construction options for LatencyProfiler. More...

class	LatencyProfiler
	Per-sample latency tracker; attach to a Run to capture timing telemetry. More...

Description

Per-kernel latency profiler — V1 surface.

Attach a LatencyProfiler to a simaai::neat::Run (or to a Session that has produced one) BEFORE you start pushing frames. After your run loop, call finalize() to get a ProfilerReport and pass it to to_text() / to_chrome_trace() to dump a human-readable summary or a JSON trace file loadable in chrome://tracing or Perfetto.

The profiler aggregates four classes of telemetry:

Per-kernel-invocation events (MLA, A65, EV74, BoxDecode, Memcpy) drained from libsimaaineatprofiler.so's cross-shared-library ring. Each event carries (start_ns, end_ns, backend, phase, physical_input_index, output_slot, frame_id, request_id, kernel_name, stage_name, in/out segment names, bytes).
Per-element aggregate timings (existing Run::diag_snapshot()).
End-to-end per-frame stats (existing Run::stats()).
Per-site memcpy totals (calls / total_ns / total_bytes) for the five hot copy sites the runtime instruments.

Off-path overhead is gated by sima_neat_profiler_enabled() — when no profiler is attached, every emit site is one atomic-load + branch.

See Also: Run.h for RunStats, InputStreamStats, RunDiagSnapshot.

File Listing

The file content with the documentation metadata removed is:

29// Per-kernel latency profiler library — V1 surface.

30//

31// Attach a `LatencyProfiler` to a `simaai::neat::Run` (or to a Session that

32// has produced one) BEFORE you start pushing frames. After your run loop,

33// call `finalize()` to get a `Report` and pass it to `to_text()` /

34// `to_chrome_trace()` to dump a human-readable summary or a JSON trace file

35// loadable in chrome://tracing or Perfetto.

36//

37// The profiler aggregates four classes of telemetry:

38// 1 Per-kernel-invocation events (MLA, A65, EV74, BoxDecode, Memcpy) —

39// drained from libsimaaineatprofiler.so's cross-shared-library ring.

40// Each event carries (start_ns, end_ns, backend, phase,

41// physical_input_index, output_slot, frame_id, request_id, kernel_name,

42// stage_name, in/out segment names, bytes).

43// 2 Per-element aggregate timings (existing `Run::diag_snapshot()`).

44// 3 End-to-end per-frame stats (existing `Run::stats()`).

45// 4 Per-site memcpy totals (calls / total_ns / total_bytes) for the five

46// hot copy sites the runtime instruments.

47//

48// Off-path overhead is gated by `sima_neat_profiler_enabled()` — when no

49// profiler is attached, every emit site is one atomic-load + branch.

51#ifndef SIMAAI_NEAT_PIPELINE_LATENCY_PROFILER_H_

52#define SIMAAI_NEAT_PIPELINE_LATENCY_PROFILER_H_

54#include "pipeline/Run.h"

56#include <cstdint>

57#include <string>

58#include <vector>

60namespace simaai::neat {

62class Session; // forward

74struct ProfilerKernelInvocation {

75 std::uint64_t start_ns = 0;

76 std::uint64_t end_ns = 0;

77 std::string backend;

78 std::string phase;

79 std::int32_t physical_input_index = -1;

80 std::int32_t output_slot = -1;

81 std::int64_t frame_id = -1;

82 std::uint32_t request_id = 0;

83 std::uint32_t bytes = 0;

84 std::string kernel_name;

85 std::string stage_name;

86 std::string in_segment;

87 std::string out_segment;

90 double duration_ms() const {

91 return static_cast<double>(end_ns - start_ns) / 1.0e6;

92 }

93};

105struct ProfilerMemcpySite {

106 std::string site_name;

107 std::uint64_t calls = 0;

108 std::uint64_t total_ns = 0;

109 std::uint64_t total_bytes = 0;

110 std::uint64_t max_ns = 0;

111

113 double total_ms() const {

114 return static_cast<double>(total_ns) / 1.0e6;

115 }

117 double avg_ms() const {

118 return calls > 0 ? (static_cast<double>(total_ns) / 1.0e6) / static_cast<double>(calls) : 0.0;

119 }

120};

121

130struct ProfilerKernelAggregate {

131 std::string backend;

132 std::string kernel_name;

133 std::string stage_name;

134 std::int32_t physical_input_index = -1;

135 std::int32_t output_slot = -1;

136 std::uint64_t count = 0;

137 double total_ms = 0.0;

138 double min_ms = 0.0;

139 double max_ms = 0.0;

141 double avg_ms() const {

142 return count > 0 ? (total_ms / static_cast<double>(count)) : 0.0;

143 }

144};

145

158struct ProfilerReport {

159 // Reused snapshots

160 RunStats end_to_end{};

161 InputStreamStats input_stream{};

162 RunDiagSnapshot diag{};

163

164 // New

165 std::vector<ProfilerKernelInvocation> kernel_invocations;

166 std::vector<ProfilerKernelAggregate> kernel_aggregates;

167 std::vector<ProfilerMemcpySite> memcpy_sites;

168

169 std::uint64_t profiler_emits = 0;

170 std::uint64_t profiler_dropped = 0;

171

172 std::string mpk_path;

173 std::string description;

174 std::int64_t frames_total = 0;

175 std::int64_t warmup_frames = 0;

176};

177

183struct LatencyProfilerOptions {

184 bool capture_kernels = true;

185 bool capture_memcpy = true;

186 std::size_t ring_capacity = 8192;

187 std::int64_t warmup_frames = 0;

188};

189

202class LatencyProfiler {

203public:

205 using Options = LatencyProfilerOptions;

206

208 explicit LatencyProfiler(Options o = Options());

210 ~LatencyProfiler();

211

213 LatencyProfiler(const LatencyProfiler&) = delete;

215 LatencyProfiler& operator=(const LatencyProfiler&) = delete;

216

224 // Attach to a Run (Session-level path). After this call, every kernel

225 // event emitted by the runtime is captured in the profiler's ring.

226 void attach(Run& run);

227

234 // Optional: attach a Session directly so per-output frame_id stamping can

235 // hook the existing tensor_callback. No-op for V1 (placeholder for V2).

236 void attach(Session& session);

237

244 // Reset all counters/event ring to mark the boundary between warmup and

245 // measured frames. Call after pushing `Options::warmup_frames` inputs.

246 void mark_warmup_done();

247

257 // Drain the event ring and snapshot every reused telemetry source into a

258 // Report. Safe to call multiple times; each call drains incremental

259 // events since the previous drain (use mark_warmup_done() to discard).

260 ProfilerReport finalize();

261

262 // Convenience helpers for serialization.

264 static std::string to_text(const ProfilerReport& report);

266 static std::string to_chrome_trace(const ProfilerReport& report);

267

268private:

269 Options options_;

270 Run* attached_run_ = nullptr;

271 Session* attached_session_ = nullptr;

272 bool enabled_at_attach_ = false;

273};

274

275} // namespace simaai::neat

276

277#endif // SIMAAI_NEAT_PIPELINE_LATENCY_PROFILER_H_

Generated via doxygen2docusaurus 2.0.0 by Doxygen 1.9.8.

Included Headers​

Namespaces Index​

Classes Index​

Description​

File Listing​

Included Headers

Namespaces Index

Classes Index

Description

File Listing