samedi 15 octobre 2022

Error & exception handling design patterns, across distributed systems

I'm beginning an investigation into how we might handle errors across our many distributed services at work better.

Some services are very isolated, and others obviously not so much, the front end often would like to know about errors to display something predictable, these are all common requirements.

Right now errors are created per service/repo, and tracked/passed up very individually. This can cause its own problems I have discovered, trying to show an error on the front end has hand me creating three or more separate errors in 3 separate repos, passing them up. Each error in service has its own code style and even if i wanted to use the same they might be taken already etc, and the scope for mess is high.

Is there such a pattern or service where errors are, for instance centrally created, and then that service imported (or language specific clients/libraries generated) in each service where needed to share errors and their codes and validated front end error text for copy changes or more maybe.

I either envision a separate service (maybe even with a front end, that would allow an easy creation) this service could also scan our repos to give reports of error logs or exceptions that don't contain a property that links back to the error service, so we can slowly try and migrate over. This however sounds high in complexity.

Or maybe create a pub sub based service that collected and adds to error messages along its way, that sounds like a service that could be a single point of failure for all of this too, as opposed to a client/library import of a build.

I feel like this should have been solved before and i'm not doing a very good job of searching. I'm willing to follow a sensible pattern for distributed systems, or use off the shelf libraries or services that work across multiple languages.

I write this because we use datadog, and its clear when i'm writing monitors/alerts that we lack a coherent strategy across the board and its most obvious when im trying to write a monitor reading log output across multiple services, it would be nice if i could refer to something as a source of truth somewhere.

Aucun commentaire:

Enregistrer un commentaire