Epic Programming: Writing Clear and Maintainable Sagas

In the world of microservices, there is often a need to make coordinated changes to services. One reliable way to achieve this is to use the Saga pattern, which helps to perform distributed transactions and roll back changes gracefully in case of failures. But, as always, there are nuances, ranging from unrealistic materials on this topic to real-world experience.

Choreography and orchestration in sagas: how to use

Let’s imagine that we want to get together with friends and play board games. To come to an agreement, we create a common chat. In sagas, this is called “choreography” :

All services understand that they are participants in some common process.
Each service generates events and sends out messages.
The other participants see these events, react, and the result is a common “dance”.

The second approach is “orchestration” . This is when one person takes on the organization: writes to everyone, calls, negotiates with each person individually. This approach is reminiscent of an orchestra conductor: there is a plan, and he strictly follows it.

To implement this task, we chose orchestration, and here’s why:

Less rework.

In choreography, services must work in a certain way: send events, listen to them, and so on. And we already have a lot written in gRPC services that work according to the “request-response” scheme. This is a completely different model, we would have to change everything.

Orchestration also requires some improvements, but fewer of them. The main thing is to ensure idempotency. And sometimes, point changes are required, for example, adding a two-phase commit or tweaking something else.

It’s easier to debug in the future.

When there is an orchestrator logging everything that happens, it becomes easier to find what you need in the logs. In the general chat, where people actively write something, on the contrary, it can be difficult to figure it out.

What does the Saga consist of?

To give everyone a general idea, especially if you’ve never encountered Saga, I’ll cover the main components. In orchestration, there are a set of steps that the orchestrator must perform. In Go, this can be expressed as:

type Step struct {
    Name string
    Do   StepFn
}
type StepFn func(ctx sagaexec.Context) StepResult

A step has an identifier name and a function that describes how to do it. The function takes one parameter — the saga context, i.e. its current state, internal variables, and everything that happens “under the hood.” And returns an abstract result.

Here’s an important point. The saga saves the result of each step so that it can then continue execution from the next step. This is especially important if the application is running in multiple instances. Then, if one node fails, the saga will continue to execute on another.

Working with the result of one step from another might look like this:

{
    Name: "lookup_place",
    Do: func(sagaexec.Context) sagaexec.StepResult {
        // ...
    },
},
{
    Name: "ask_misha",
    Do: func(sCtx sagaexec.Context) sagaexec.StepResult {
        place := StepResultAs[PlaceInfo](sCtx, "lookup_place")
        // ...
    },
},

In this pseudo-code, the saga performed a step - “Find a room”, saved the result, and then used it in another step - “Ask Misha”.

To make it easier to work with the results, there are special functions like StepResultAs, which use a generic to transform the generalized result to the required type. Then you can use this information and write a message to Misha that the collection will be in such and such a place, and wait for his response.

The next important aspect of the saga is compensation . Imagine: we found a place, agreed with Misha, write to Vasya, and he says: “Sorry, I can’t.” What to do? After all, we already told Misha, we’ll have to tell him that the meeting is cancelled, so that everyone can leave.

To implement this, we add one more function at each step. We have:

A function that describes how to perform an action.
A function that describes how to undo an action.

type Step struct {
    Name       string
    Do         StepFn
    Compensate StepFn
}

The last aspect is the workflow . All the steps described can be combined into a sequence. For example: find a room, write to everyone, book, and so on.

In Go, we usually express such a sequence with a simple slice - that is, we create a slice of steps and describe the necessary parameters and logic for each.

func NewGatherFriends() []sagaexec.Step {
    return []sagaexec.Step{
        {Name: "lookup_place", Do: ...},
        {Name: "ask_misha",    Do: ..., Compensate: ...},
        {Name: "ask_vasya",    Do: ..., Compensate: ...},
        {Name: "ask_kolyan",   Do: ..., Compensate: ...},
        {Name: "book_place",   Do: ..., Compensate: ...},
    }
}

A short recap of the main concepts:

A saga workflow is necessarily a sequential set of steps.
A saga step is a transactional action: we know how to do it and how to cancel it if necessary.
Step Result - The result of an action that needs to be saved for use in other steps.

Such a workflow either completely ends when all the steps are completed, or if one of the steps fails, we roll everything back. In essence, this is a transaction, and a saga is one way to implement it in a distributed form. If you have studied sagas, then you have definitely come across the term “distributed transactions”. And if you have understood distributed transactions, then you have probably heard of sagas.

Saga and the need for branching

We’ve already said that a saga workflow is a sequential set of steps. But in reality, rarely does everything go one way: usually you want to add some branching. A saga workflow looks something like this:

lookup_place
if place is in the north of the city
- ask_misha
- ask_vasya
else
- ask_petya
- ask_nikita
ask_kolyan
book_place

If the place is in the north, then Misha and Vasya agree to go there, but not to another place, they are not mobile. Therefore, if the place is not in the north, you need to invite Petya and Nikita. Kolyan will go anywhere. So he does not have a separate branch.

In this case, the structure is no longer flat, and it cannot be described by a regular slice of steps. The question arises: what to do?

Let’s look at an example code:

{
    Name: "ask_misha",
    Do: func(sCtx sagaexec.Context) sagaexec.StepResult {
        place := StepResultAs[PlaceInfo](sCtx, "lookup_place")
        if place.GetRegion() != "north" {
            // Если место не на севере, пропускаем шаг
            return sagaexec.EmptyStepResult()
        }
        // Здесь логика для приглашения Миши
        // ...
    },
}

There is a step “Ask Misha”. We take the result of the step “Search for a place” and look at where this place is. If we understand that Misha will not go there, then we simply return an empty result and essentially skip the action. The same logic can be set for each person, and then we get a flat list:

lookup_place
ask_misha
ask_vasya
ask_petya
ask_nikita
ask_kolyan
book_place

From the saga’s point of view, all these steps are executed, but in fact some of them may do nothing. The approach works, sometimes it is quite appropriate, but it has a minus: at first glance it seems that we are calling five people, although in fact only three, and to understand this, you need to get into the code, study all the logic, since the list of steps does not reflect what is happening.

When there are many steps, you want to evaluate at a glance who is participating where. That is why we introduced a new step type and called it Select . It takes the results of previous steps, analyzes them and returns a slice of steps, i.e., essentially, launches a new workflow.

{
    Name: "select_friends",
    Select: func(sCtx sagaexec.Context) []sagaexec.Step {
        place := StepResultAs[PlaceInfo](sCtx, "lookup_place")
        if place.GetRegion() != "north" {
            return []sagaexec.Step{}
        }
        return []sagaexec.Step{}
    },
}

We did the “Find a room” step, then got to the Select step , which decides which workflow to run next. The result is no longer a slice of steps, but a tree of lists. This set of workflows, although limited, gives an understanding of the entire logic. If we ever get around to visualization, we can build beautiful diagrams - and then everything will become even clearer.

Plain Go or Structured Go: Choosing an Approach for a Saga

When I solve a problem, I always ask myself: “What exactly do I need to do?” Initially, I needed a saga. But for some reason, I introduced the concept of a step as an abstraction of the structure, began to combine these steps into slices, and when I needed if, I had to solve this problem within the framework of the existing solution, although this does not directly relate to sagas. If I had simply written in Go, I would have used regular ifand without any problems switch.

We tried to do that. Let’s compare what the Go code looks like and the “structured” version. The Go code:

place, err := Do(sCtx, "lookup_place", ...)
if err != nil {
    return nil, fmt.Errorf("lookup place: %w", err)
}
_, err = Do(sCtx, "ask_misha", ...)
if err != nil {
    return nil, fmt.Errorf("ask misha: %w", err)
}
defer func() {
    _err := Compensate(sCtx, "ask_misha", ...)
    if _err != nil {
    }
}()

Go has a function Dofor performing a step that returns an error. You have to write typical code with if err != niland so on. In the next step, we call it again Doand check the error. And in the second step, if something went wrong, we have compensation. Compensations need to be done in reverse order. In Go, we usually do such actions through defer. The result is a rather long piece of code from ifand defer, and it is not very convenient to read.

For comparison, the structural version:

{
    {
        Name: "lookup_place",
        Do: ...,
    },
    {
        Name: "ask_misha",
        Do: ...,
        Compensate: ...,
    },
}

Here everything is visible at once. Code editors have a fold function: fold a block and you are left with a compact list of steps, which makes it easy to understand what is happening.

Let’s compare two options:

Plain Go	Structured Go
Doesn’t limit in any way	It puts you in a frame
• Write simply and quickly	• Writing is more difficult
• Flow is not obvious	• Flow is immediately clear

“Plain Text” does not impose restrictions: we write what we want, and quickly, without unnecessary problems. In “Structured Go” you have to follow the rules. Sometimes these restrictions lead to strange or non-obvious steps to fit into the structure. Therefore, writing can be noticeably more difficult.

However, there is a plus. In “Plain Text” it can be difficult to understand the workflow, to pick out the names of the steps, but in “Structured Go” everything is immediately “on the surface”: it is enough to look at the names of the steps to understand the logic of the process without reading the code in detail.

Another argument in favor of the structured option is the difficulty of using it correctly deferfor compensation. Let’s imagine that we have the following code:

defer func() {
    err := Compensate(sCtx, "ask_misha", ...)
    retErr = multierr.Append(retErr, err)
}()

In deferis called Compensate: it takes sCtx, that is, the saga context checks whether the saga is “broken”. If yes, we make compensation. But how do we know that the saga is broken? To do this, in some step we need to return a special error, for example, AbortError, and switch the context to this state.

This must happen internally Dofor the saga coordinator to change state sCtx:

_, err = Do(sCtx, "ask_vasya", func() {
   if smth {
      return nil, sagaexec.NewAbortError(...)
  }
})

Basically, when we write code in Go, we have no restrictions, we can write like this:

if smth {
    return nil, sagaexec.NewAbortError(...)
}

It seems that everything is returned as before AbortError, but the saga coordinator does not see it, and compensations will not be launched. This is a real bug in our code. That is why we decided that the structured approach is more reliable for us. Coupled with all the other problems in “Plain Text”, it is very easy to make a mistake.

In a structured approach, we do a lot of different things to avoid writing code that will break, because we’re talking about an API proxy - the “face” of the product and the entry point for all the data. We need it to be stable and work well.

Structured code has another advantage: it is easier to test. If you have separate steps, you can test each one separately. If these steps are combined into one large function, you will have to test everything at once, which will lead to an “explosion” of the number of test cases.

Plain Go	Structured Go
Doesn’t limit in any way	It puts you in a frame
• Write simply and quickly	• Writing is more difficult
• Flow is not obvious	• The flow is immediately clear
• It’s easier to screw up	• It’s harder to screw up
	• Much easier to test

The structural approach in practice

We need to constantly evolve the workflow: write new sagas, add new scenarios or make changes to existing ones.

How well does the described solution fit this?

Workflow development

Adding a new workflow is easy. It does not affect the already written code and does not affect the operation of existing scenarios. Moreover, we can reuse ready-made steps, because they are allocated in separate structures.

If you want to change something in the current workflow, it’s not that easy. Imagine a workflow with several steps, and they are distributed across two servers — one performs the first step, the other — the second. If each server uses a different version of the code, we lose consistency. We either need to ensure backward compatibility and understand how different versions can work together, or, if this is not possible, create a new workflow. Similar to how we work with API versioning v1, v2, and so on: all the new logic goes to the second version, and the old one continues to work as before until we gradually migrate to the new version.

In addition to developing the workflow, we want to develop the tool itself, add new features to it. In our example, we need to write to three people. We can write to each in turn, waiting for an answer, or we can write to everyone at once, that is, do it in parallel.

The simplest way is to support a function DoParallelthat takes a saga context but returns not a result but a “set” of steps. Using the same abstractions that we already use to describe sequential steps, we can describe their parallel execution.

{
    Name: "ask_everyone",
    DoParallel: func(sCtx sagaexec.Context) []sagaexec.Step {
        ...
    },
},

Another important term from the world of sagas is Pivot , the point of no return. The principle is this: the saga is executed and reaches a certain step. If this step is completed, then it is no longer possible to roll back.

{
    Name: "book_place",
    Pivot: true,
    Do: ...
}

Let’s say we’ve booked a place and paid a non-refundable deposit. After that, there’s no point in canceling the reservation. So we want all subsequent steps in the saga to be completed and the saga to be completed. This is easy to implement thanks to the structured approach. We simply add a new field to the step object and indicate that this is the point of no return.

Another very important thing in the saga is the retry mechanisms . Let’s say we’ve done some step, say, sent a gRPC request, but the server didn’t respond. Then we try again after some time. Again, thanks to the structured approach, you can extend this as much as you want. For example, you can add a simple back-off like this:

{
    Name: "ask_misha",
    RetryIn: []time.Duration{"1m", "5m", "10m"},
    Do: ...
}

I mentioned at the beginning of the article that idempotency is important for services in the saga . The reason is precisely in retries. Imagine that we call an external system and tell it: “Add 100 rubles to the account”, but we do not receive confirmation that the operation has been completed. We do not know whether we need to compensate for the action, because the money could have already been credited, and we simply did not receive a response. If we make the compensation “Subtract 100 rubles”, we can take from a person funds that he did not have at all.

Therefore, if you need to retry a step, the sagas usually say: “Retry until you succeed”, that is, you need to endlessly repeat the attempt until the step is completed. But this is an unrealistic situation.

What other unrealistic situations can arise? Here are some examples:

Retrying doesn’t help, and it’s impossible to keep trying endlessly.
Abort happened after Pivot: we indicated that “we can’t go back further”, and then suddenly we have to roll back.
Step not found by name or no result found.
The step result is of the wrong type: we expected a specific enum, but suddenly an unknown value came.

What to do in such cases? It’s similar to the problem with retries: something needs to be done, but it’s not clear what. We decided to pause the saga so that the developers could figure it out.

Panic in Go: Why is it really needed?

Most of the time in Go we follow the rule “Don’t panic!” But in truly exceptional situations panic is quite appropriate.

Panic in truly exceptional situations!

And the examples described above - these very situations - are “unrealistic” because during normal development they should not arise in principle.

This approach frees us from the need to invent how to handle something that is theoretically impossible. We just panic. The developers will figure it out later if for some reason this happens. In real practice, there are tests, staging, where such panics are not uncommon. And in production, I saw such panic only once, when I made a slight mistake with the deployment.

Debugging Sagas in a Distributed System

We have a distributed system, and everything should be very complex. But thanks to orchestration, debugging the saga is not so difficult. And personally, I like how we implemented it.

The orchestrator records everything. We have an API proxy: every incoming request is wrapped in a saga, and the saga always starts with some request. In the log, it looks like this:

[SAGA] request=[platform.api.GatherFriendsRequest]:{}
[STEP] id=lookup_place result=[geo.api.PlaceInfo]:{region:"north"}
[STEP] id=ask_misha result=[google.protobuf.Empty]:{}
[STEP] id=ask_vasya result=[google.protobuf.Empty]:{}
[STEP] id=ask_kolyan result=[google.protobuf.Empty]:{}
[STEP] id=book_place result=[geo.api.BookPlaceResponse]:{}
[RESP] [platform.api.GatherFriendsResponse]:{}

I’ve shortened the data here for simplicity, but typically all the request parameters, headers, and other information will be listed in brackets, as well as the full results of each step. Finally, the final response to the API request is generated.

If something went wrong and we need to interrupt the saga, then instead of the usual response, an error with text will be returned. Then the log of the interrupted saga will look like this:

[SAGA] request=[platform.api.GatherFriendsRequest]:{}
[STEP] id=lookup_place result=[geo.api.PlaceInfo]:{region:"north"}
[STEP] id=ask_misha result=[google.protobuf.Empty]:{}
[STEP] id=ask_vasya result=[google.protobuf.Empty]:{}
[ABRT] message="Vasya is busy at sunday."

If the saga needs to be paused, the user will get a message like “Something inside is broken, developers are running to fix it.” The logs will store the internal cause of the error, for example, that some step was missed:

[SAGA] request=[platform.api.GatherFriendsRequest]:{}
[STEP] id=lookup_place result=[geo.api.PlaceInfo]:{region:"north"}
[STEP] id=ask_misha result=[google.protobuf.Empty]:{}
[STEP] id=ask_vasya result=[google.protobuf.Empty]:{}
[STEP] id=ask_kolyan result=[google.protobuf.Empty]:{}
[PAUS] reason="step 'book_place' is missed"

This gives a high-level picture. You can see right away what happened. And if necessary, you can then drill down into the specific service where the saga went.

Results

Saga is much more than a pattern from the Internet. It is interesting to study and convenient to use.

Sagas can be used for GET requests

Initially, the saga is about a distributed transaction, creating changes and being able to roll them back. And in GET requests we only request data. Why then wrap them in a saga?

We use the retry mechanism. If something went wrong during the execution of the request, the saga allows you to easily repeat the request.
We request data in parallel and glue together the response from different pieces. When you need to collect a response from several services, you can go to each in parallel and then combine the result.

Saga is great for making asynchronous APIs

The very nature of sagas implies asynchronous behavior: we start a saga, make entries in the DB, and gradually perform steps. If we need a synchronous API, we simply wait for the end of the saga. If we want an asynchronous one, we return the saga identifier to the client, and then give the ability to check the execution status by this identifier.

Saga develops skills in architecture and development

Writing the MVP of the saga yourself is an interesting weekend task that helps you improve your architectural solutions.

Don’t get into vendor lock. There are ready-made tools like Temporal, but using them you have to put up with other people’s bugs and approaches. We decided to control everything ourselves.
Consider the specifics of the project. Each project has its own nuances. Often, it is much easier to add and support them in your solution than in an existing one that was created for slightly different tasks.
This is the Go-Way. Temporal can do a lot, but we needed short and narrow functionality that is easier to maintain ourselves. This is the Go-approach: solve simple problems ourselves.