Getting Started¶
Define information structure to be monitored¶
Before using opmonlib it is important to understand and define what needs to be monitored.
Monitorable objects can then be captured in a schema file to create C++ structs using ProtoBuf.
Documentation and instructions on generating schema data structures and using ProtoBuf can be found in the ProtoBuf website.
Relevant pages also include the description of the C++ API.
In general each .protobuf file contains definitions of blocks that are published as single units.
Each schema file will generate a C++ header file containing the structures which hold the monitoring data, as defined in the .proto file.
Typically each module may only need one struct to hold its monitoring information; however it is possible to create multiple nested structs within a schema, which are filled by the same module.
It is preferred to organise the protobuf schemas in the following way:
-
put the opmon related schemas in
schema/opmoninside your repository -
try to group the schemas so that the schema used by the same modules are in the same file; the name of the file should match the name of the module
-
as protobuf generates a lot of code, there might be conflicts with our code - protect the generated code with an additional
opmonnamespace
Here is an example, taken from the DFOModule.proto, which contains the schemas used by DFOModule plugin in dfmodules.
syntax = "proto3";
package dunedaq.dfmodules.opmon;
// regular metric published byt the DFO
message DFOInfo {
// counters
uint64 tokens_received = 1;
uint64 decisions_received = 2;
uint64 decisions_sent = 3;
// time management of the decision thread
uint64 waiting_for_decision = 10 ; // Time spent waiting on Trigger Decisions, in microseconds
uint64 deciding_destination = 11 ; // Time spent making a decision on the receving DF app, in microseconds
uint64 forwarding_decision = 12 ; // Time spent sending the Trigger Decision to TRBs, in microseconds
// time management of the token thread
uint64 waiting_for_token = 15 ; // Time spent waiting in token thread for tokens, in microseconds
uint64 processing_token = 16 ; // Time spent in token thread updating data structure, in microseconds
}
// these counters are published separately for each trigger type
message TriggerInfo {
uint64 received = 1;
uint64 completed = 2;
}
Valid types¶
As a generic schema language, ProtoBuf allows you to use simple types, but also lists, maps, etc.
Be aware that apart from basic types and nested messages, other quantities are ignored by the monitoring system.
An OpMonEntry message is generated whenever a structure with at least one publishable field is passed to the publish method, see next section.
Filling and collecting structures¶
The ProtoBuf C++ API guide describes how to fill the structures you created.
In order to publish the metric, the object has to be created from within a MonitorableObject, see the header file.
In particular, a DAQModule is a MonitorableObject.
Two main functions are relevant for publishing:
void publish( google::protobuf::Message &&,
CustomOrigin && co = {},
OpMonLevel l = to_level(EntryOpMonLevel::kDefault) ) const noexcept ;
virtual void generate_opmon_data(opmon_level) {return;}
-
publishtakes a ProtoBuf schema object, it timestamps it with the time of the function call, it serializes it (synchronously) and publishes it (asynchronously) via one of the configured OpMonFacilities. This function can be called at anytime. -
generate_opmon_datais a function which the monitoring system calls regularly (order of seconds). Its default behaviour is 'null'. Every developer can freely implement this in their MonitorableObject in order to avoid setting up a thread to generate information regularly. Specific implementations are expected to call thepublishfunction to actually publish the metric.
Example¶
An example of metric publication is
void DFOModule::generate_opmon_data() {
opmon::DFOInfo info;
info.set_tokens_received( m_received_tokens.exchange(0) );
info.set_decisions_sent(m_sent_decisions.exchange(0));
info.set_decisions_received(m_received_decisions.exchange(0));
info.set_waiting_for_decision(m_waiting_for_decision.exchange(0));
info.set_deciding_destination(m_deciding_destination.exchange(0));
info.set_forwarding_decision(m_forwarding_decision.exchange(0));
info.set_waiting_for_token(m_waiting_for_token.exchange(0));
info.set_processing_token(m_processing_token.exchange(0));
publish( std::move(info) );
std::lock_guard<std::mutex> guard(m_trigger_mutex);
for ( auto & [type, counts] : m_trigger_counters ) {
opmon::TriggerInfo ti;
ti.set_received(counts.received.exchange(0));
ti.set_completed(counts.completed.exchange(0));
auto name = dunedaq::trgdataformats::get_trigger_candidate_type_names()[type];
publish( std::move(ti), {{"type", name}} );
}
}
Details and good practices about the optional arguments¶
Optional arguments of the publish function, allow you to:
-
specify a level of priority associated to the metric
-
add additional custom information on the source of the metric, in the form of a
map<string, string>, where the key is the type of the source, e.g. channel, and the second is the value, e.g. 4345 -
extend the
opmon_idof the callerMonitorableObjectfor the specific metric with more detailed information on the source of this metric
The OpMonLevel is a priority level designed to control the quantity of metrics generated by a tree. As a default, all messages are published. The lower the level, the higher the priority.
They system can decide to entirely disable the metric publication regardless of the OpMonLevel.
The system already provides some values to specify the OpMonLevel via an enum:
enum class EntryOpMonLevel : OpMonLevel {
kTopPriority = std::numeric_limits<OpMonLevel>::min(),
kEventDriven = std::numeric_limits<OpMonLevel>::max()/4,
kDefault = std::numeric_limits<OpMonLevel>::max()/2,
kLowestPrioriry = std::numeric_limits<OpMonLevel>::max()-1
};
The usage of a custom origin is designed to provide information that is unrelated to software stack.
While the software stack might change (e.g, the name of an application or of a module can change because of configuration), some information like a crate number or a channel are hardware related and they are independent of the software stack that provides this information.
Examples of valid tags to be used in the custom origins are: server name, channel, links, etc.
The value of a tag should not grow indefinitely for retrival efficiency in the database. So, things like run number should not become a custom origin.
Adding information like application name or session in the custom origin is discouraged because it would be redundant.
In the example above, you see an usage example where TriggerInfo contains counters grouped by trigger type.
Registering sub components for your DAQModule¶
In order to work correctly, each MonitorableObject has to be part of a monitoring tree, i.e. every MonitorablreObject has to be registered to the chain.
This is done via the method
void register_node( std::string name, new_node_ptr ) ;
The metrics generated by the child will have an opmon_id in the form parent_opmon_id.child_name.
The registration does not imply ownership of the child by the parent, as internally only weak pointers are utilised.
If the child is destroyed, its pointer will eventually be removed from the chain.
DAQModules will be automatically registered by the application framework and developers have to write their code assuming that the module is registered in the monitoring tree from the moment of its creation.
On the other hand, developers have to take care of the registration of subcomonents living inside their modules.
An example of registration is:
void DFOModule::receive_trigger_complete_token(const dfmessages::TriggerDecisionToken& token) {
if (m_dataflow_availability.count(token.decision_destination) == 0) {
TLOG_DEBUG(TLVL_CONFIG) << "Creating dataflow availability struct for uid " << token.decision_destination;
auto entry = m_dataflow_availability[token.decision_destination] =
std::make_shared<TriggerRecordBuilderData>(token.decision_destination, m_busy_threshold, m_free_threshold);
register_node(token.decision_destination, entry);
}
}
The registration does not imply ownership, so in order to unregister an object you just need to delete the shared pointer.
Configuration¶
Each confmodel::Application has a relationship with an OpMonConf object that specifies the level and the period of the monitoring.
Last git commit to the markdown source of this page:
Author: Marco Roda
Date: Thu Nov 6 11:18:51 2025 +0000
If you see a problem with the documentation on this page, please file an Issue at https://github.com/DUNE-DAQ/opmonlib/issues