Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add solver summary info to diagnostics #208

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,7 @@ class FixedLagSmoother : public Optimizer
ros::Time lag_expiration_; //!< The oldest stamp that is inside the fixed-lag smoother window
fuse_core::Transaction marginal_transaction_; //!< The marginals to add during the next optimization cycle
VariableStampIndex timestamp_tracking_; //!< Object that tracks the timestamp associated with each variable
ceres::Solver::Summary summary_; //!< Optimization summary, written by optimizationLoop and read by setDiagnostics

// Guarded by optimization_requested_mutex_
std::mutex optimization_requested_mutex_; //!< Required condition variable mutex
Expand Down
111 changes: 105 additions & 6 deletions fuse_optimizers/src/fixed_lag_smoother.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -202,9 +202,22 @@ void FixedLagSmoother::optimizationLoop()
// Update the graph
graph_->update(*new_transaction);
// Optimize the entire graph
graph_->optimize(params_.solver_options);
summary_ = graph_->optimize(params_.solver_options);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should do something when the solver doesn't converge. I'd like to discuss the options and listen to your thoughts, so maybe we can do something on another PR.

One option is to check if (summary_.isSolutionUsable()): http://ceres-solver.org/nnls_solving.html#_CPPv4NK5ceres6Solver7Summary16IsSolutionUsableEv

If it's not usable we could do one of the following:

  • Abort
  • Don't notify, but still do the rest, i.e. prepare the marginalization for the next cycle
  • Roll back the graph to the state before adding the transaction, but this isn't trivial

Note that isSolutionUsable() returns true also if the solver doesn't converge and finishes because it reached the max time or iterations limit. In that case, the diagnostic warning should be enough, and we can simply continue as normal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should be checking if the output is usable before using it. Though I'm not sure what should be done when it is not...which is probably why I didn't bother checking in the first place. 🙂

We could try "rolling back" the transaction. For the graph itself, it would not actually be that hard. We clone the graph anyway a few lines down (https://github.com/locusrobotics/fuse/blob/devel/fuse_optimizers/src/fixed_lag_smoother.cpp#L207). With a small modification, we could clone the graph before merging in the transaction, optimize the clone, and swap it in for the "real" graph if the optimization succeeds, or drop the clone if the optimization fails.

However, that doesn't fix any of the supporting infrastructure. The motion models generally track which constraint segments have been sent to the optimizer, and only send new or modified constraints in the future. If the optimizer throws away a transaction without somehow telling the motion models about it, then the two could get out of sync. That would lead to missing constraints and potentially a disconnected graph and future failures.

It would definitely be possible to modify the motion model interface to require some sort of ACK from the optimizer for each motion model segment that was successfully incorporated...but that will be a big change.

And then there are things like visual landmark observations, where the sensor model may create special constraints when a visual landmark is first added to the system (e.g. a prior on the landmark position, or some sort of 3-view constraint to ensure the landmark position is solvable). If such a constraint was thrown out without also informing the sensor models, we again get out of sync and set ourselves up for future optimization failures.

Continuing on is basically what we do now. But based on earlier errors, once the graph cannot be optimized correctly, I'm not sure anything in the future will ever fix that. I suspect you will get a constant stream of failures moving forward.

I'm inclined to go with the "abort" option. Log a descriptive message of the failure and exit(), throw an exception, or similar. If the optimization fails, we should make this as noticeable as possible. This is likely caused by a configuration or usage error, and nothing we can do in code will fix it.

But I'm not sure how you feel about it. I can be convinced otherwise.

Copy link
Collaborator Author

@efernandez efernandez Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sensor and motion models going out of sync if we roll back to the graph before updating it with the last transaction is a very good point. I admit I didn't think much about that, although I already anticipated it wouldn't be trivial to roll back the graph.

I agree aborting is the best (and simplest) thing we can do. In such a case, I could print the last transaction, blaming it.
Indeed, if the optimization failed it should be because of something related with one or more constraints in the transaction, or the initial values provided in the variables, or the implementation of the constraints cost functions.
We would still have to narrow that down, but it sounds like we should be able to that after if we record the graph and transactions.
Actually, we need to print the transaction, because the last transaction hasn't been published yet. It sounds like we should still notify the publishers, so the bad graph and transaction are published. Then, in the optimization loop we only need to print the transaction timestamp when aborting.

I've updated this PR with a commit that does that.

Does this sound like a good plan to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking about this from a "robot in the real world" stance, if the node in question goes down due to the exception or exit call, then even if the node respawns, won't it be completely lost? Just wondering if logging (loudly) and rolling back the graph is a good idea. Won't it be the case that, depending on how things are configured, the "bad" constraint will eventually not be included in the optimization? I guess I'm just wondering if crashing is better/worse than attempting to get back on track and yelling. Obviously the inability to tell the models which constraint/transaction is faulty makes this difficult.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but I believe a failure here only happens if a cost function yields a bad result, like NaN. TBH I've not been able to test a failure mode because my current configuration just works. 😄

I think when a failure happens, the solver itself prints some diagnostic messages, and it usually crashes after, if NaNs pollute the state. This is speculation though. It'd be great if I could test a failure mode. Then, maybe what we can do is be more informative when it happens, in terms of transaction and graph. Ideally, rolling back the graph would be great, but it'd be quite difficult to do that atm, as already pointed out.

Any suggestion to force a failure? Maybe adding a division by zero in a cost function, or sth like that. Or just returning false always, so the cost function always fails to evaluate. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just a one-off test with throwaway code, then maybe throw a quick static counter variable into one of the cost functions, and after it reaches N iterations, we do what you suggested and divide by 0 or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind if I move this conversation to an "issue", so we can discuss outside of this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, go ahead. I'll try to see what happens on failure soon, but I think this PR shouldn't wait on that. A more elaborate error handling could be discussed in that issue and implemented on another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #213


efernandez marked this conversation as resolved.
Show resolved Hide resolved
// Optimization is complete. Notify all the things about the graph changes.
const auto new_transaction_stamp = new_transaction->stamp();
notify(std::move(new_transaction), graph_->clone());

// Abort if optimization failed. Not converging is not a failure because the solution found is usable.
if (!summary_.IsSolutionUsable())
{
ROS_FATAL_STREAM("Optimization failed after updating the graph with the transaction with timestamp "
<< new_transaction_stamp << ". Leaving optimization loop and requesting node shutdown...");
ROS_INFO_STREAM(summary_.FullReport());
ros::requestShutdown();
break;
}

// Compute a transaction that marginalizes out those variables.
lag_expiration_ = computeLagExpirationTime();
marginal_transaction_ = fuse_constraints::marginalizeVariables(
Expand Down Expand Up @@ -495,20 +508,106 @@ void FixedLagSmoother::transactionCallback(
}
}

void FixedLagSmoother::setDiagnostics(diagnostic_updater::DiagnosticStatusWrapper& status)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is a bit weird here. I simply added some helper functions that are used below.

/**
* @brief Make a diagnostic_msgs::DiagnosticStatus message filling in the level and message
*
* @param[in] level The diagnostic status level
* @param[in] message The diagnostic status message
*/
diagnostic_msgs::DiagnosticStatus makeDiagnosticStatus(const int8_t level, const std::string& message)
{
Optimizer::setDiagnostics(status);
diagnostic_msgs::DiagnosticStatus status;

status.level = level;
status.message = message;

return status;
}

if (status.level == diagnostic_msgs::DiagnosticStatus::OK)
/**
* @brief Helper function to generate the diagnostic status for each optimization termination type
*
* The termination type -> diagnostic status mapping is as follows:
*
* - CONVERGENCE, USER_SUCCESS -> OK
* - NO_CONVERGENCE -> WARN
* - FAILURE, USER_FAILURE -> ERROR (default)
*
* @param[in] termination_type The optimization termination type
* @return The diagnostic status with the level and message corresponding to the optimization termination type
*/
diagnostic_msgs::DiagnosticStatus terminationTypeToDiagnosticStatus(const ceres::TerminationType termination_type)
{
switch (termination_type)
{
status.message = "FixedLagSmoother " + status.message;
case ceres::TerminationType::CONVERGENCE:
case ceres::TerminationType::USER_SUCCESS:
return makeDiagnosticStatus(diagnostic_msgs::DiagnosticStatus::OK, "Optimization converged");
case ceres::TerminationType::NO_CONVERGENCE:
return makeDiagnosticStatus(diagnostic_msgs::DiagnosticStatus::WARN, "Optimization didn't converged");
default:
return makeDiagnosticStatus(diagnostic_msgs::DiagnosticStatus::ERROR, "Optimization failed");
}
}

void FixedLagSmoother::setDiagnostics(diagnostic_updater::DiagnosticStatusWrapper& status)
{
Optimizer::setDiagnostics(status);

status.add("Started", started_);
// Load std::atomic<bool> flag that indicates whether the optimizer has started or not
const bool started = started_;

status.add("Started", started);
{
std::lock_guard<std::mutex> lock(pending_transactions_mutex_);
status.add("Pending Transactions", pending_transactions_.size());
}

if (started)
{
// Add some optimization summary report fields to the diagnostics status if the optimizer has started
auto summary = decltype(summary_)();
{
const std::unique_lock<std::mutex> lock(optimization_mutex_, std::try_to_lock);
if (lock)
{
summary = summary_;
}
else
{
status.summary(diagnostic_msgs::DiagnosticStatus::OK, "Optimization running");
}
}

if (summary.total_time_in_seconds >= 0.0) // This is -1 for the default-constructed summary object
{
status.add("Optimization Termination Type", ceres::TerminationTypeToString(summary.termination_type));
status.add("Optimization Total Time [s]", summary.total_time_in_seconds);
status.add("Optimization Iterations", summary.iterations.size());
status.add("Initial Cost", summary.initial_cost);
status.add("Final Cost", summary.final_cost);

status.mergeSummary(terminationTypeToDiagnosticStatus(summary.termination_type));
}

// Add time since the last optimization request time. This is useful to detect if no transactions are received for
// too long
auto optimization_deadline = decltype(optimization_deadline_)();
{
const std::unique_lock<std::mutex> lock(optimization_requested_mutex_, std::try_to_lock);
if (lock)
{
optimization_deadline = optimization_deadline_;
}
}

if (!optimization_deadline.isZero()) // This is zero for the default-constructed optimization_deadline object
{
const auto optimization_request_time = optimization_deadline - params_.optimization_period;
const auto time_since_last_optimization_request = ros::Time::now() - optimization_request_time;
status.add("Time Since Last Optimization Request [s]", time_since_last_optimization_request.toSec());
}
}
}

} // namespace fuse_optimizers
4 changes: 2 additions & 2 deletions fuse_optimizers/src/optimizer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ Optimizer::Optimizer(
private_node_handle_.createTimer(ros::Duration(diagnostic_updater_timer_period_),
boost::bind(&diagnostic_updater::Updater::update, &diagnostic_updater_));

diagnostic_updater_.add("Optimizer", this, &Optimizer::setDiagnostics);
diagnostic_updater_.add(private_node_handle_.getNamespace(), this, &Optimizer::setDiagnostics);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the name, and simplifies the code remove from the children setDiagnostics() method; the one the diff doesn't show well.

diagnostic_updater_.setHardwareID("fuse");

// Wait for a valid time before loading any of the plugins
Expand Down Expand Up @@ -482,7 +482,7 @@ void Optimizer::setDiagnostics(diagnostic_updater::DiagnosticStatusWrapper& stat
return;
}

status.summary(diagnostic_msgs::DiagnosticStatus::OK, "Optimizer");
status.summary(diagnostic_msgs::DiagnosticStatus::OK, "Optimization converged");
Copy link
Collaborator Author

@efernandez efernandez Nov 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message matches the one we get now in normal conditions. Maybe not too elegant, but we can't overwrite it with mergeSummary.

Another option would be to change the status.message directly in the children setDiagnostics() method.


auto print_key = [](const std::string& result, const auto& entry) { return result + entry.first + ' '; };

Expand Down