Christopher Kujawa, Author at Camunda

One Exporter to Rule Them All: Exploring Camunda Exporter

Christopher Kujawa — Fri, 14 Feb 2025 18:37:30 +0000

When using Camunda 8, you might encounter the concept of an exporter. An exporter is used to push out historic data, generated by our processing engine, to a secondary storage.

Our web applications have historically used importers and archivers to consume, aggregate, and archive historical data provided by our Elasticsearch (ES) or OpenSearch (OS) Exporter.

In the past year, we’ve engineered a new Camunda Exporter, which brings the importer and archiving logic of web components (Tasklist and Operate) closer to our distributed platform (Zeebe). This allows us to simplify our installation, enable scalability for web apps, reduce the latency to show runtime and historical data, and reduce data duplication (resource consumption).

In this blog post, we want to share more details about this project and the related architecture changes.

Challenges with the current architecture

Before we introduce the Camunda Exporter, we want to go into more detail about the challenges with the current Camunda 8 architecture.

A simplified view of the architecture of Camunda 8.7, highlighting 6 process steps

When a user sends a command to the Zeebe cluster (1), it is acknowledged (2) and processed by the Zeebe engine. The engine will confirm the processing with an event.

The engine has its own primary data store for runtime data. The primary data store is optimized for low-latency local access. It contains the execution state that allows the engine to execute process instances and move its data along in their corresponding processes.

Our users need a way to search and visualize process data (running and historical data), so Camunda 8 makes use of Elasticsearch or OpenSearch (RDBMS in the future) as secondary storage. It allows the separation of concerns between runtime data for process execution and history data for querying.

Camunda’s exporters are a bridge between primary and secondary data stores. The exporters allow the Zeebe system to stream out data (events) (3). Within the Camunda 8 architecture, we support both the ES and OS exporters. For more information about this concept and supported exporters, please visit our documentation.

The exported data is stored in what we unofficially refer to as Zeebe indices inside ES or OS. Web applications like Tasklist and Operate make use of importers and archivers to read data from Zeebe indices (4), aggregate, and write them back into their indices (5). Based on these indices, users can query and search process data (6).

Performance

Customers have reported performance issues, which are inherited with this architecture. For example, the delay of data shown in Operate can range from around five seconds to, in the worst scenarios, minutes or hours.

The time is spent in processing, exporting, flushing, importing, and flushing again, before the users see any data change. For more detailed information, you can also take a look at this blog post.

This means the user is never able to follow a process instance in real time. But there is a general expectation that it is at least close to real time, meaning that it should at max take 1-2 seconds to show updates.

Reducing such latency and improving the general throughput needs a general architecture change.

Scalability

What we can see in our architecture above is that when we scale Zeebe clusters and partitions or set them up for large workloads, the web applications do not scale automatically with it, as they are not directly coupled.

This means additional effort to make sure the web applications can handle certain loads. The current architecture limits the general scalability of the web applications, due to the decoupling of exporter-importer and no real partitioning of data in the secondary storage.

We want to make the web application more scalable to handle changing processing workloads.

Installation complexity

You can run the different components of the Camunda platform separately, e.g. separate deployments for Zeebe, Tasklist, Operate, etc. This gives you a lot of flexibility and allows for massive scale. But at the same time, this makes the installation harder to do—even with the help of Helm charts.

We want to support a simpler installation as an alternative. That wasn’t possible in this architecture because of a missing single application and the need for separate components.

Data duplication and resource consumption

Web applications like Operate and Tasklist have historically been grown and developed separately. As we have seen, they could have been deployed separately as well.

This was also why they had separate schemas. Tasklist used a subset of Operate schema but added additional necessary indices to store information about user tasks, etc. When deploying both applications, this caused an unnecessary duplication of data in ES or OS.

As a consequence of this, we are consuming more disk space than necessary. Furthermore, ES/OS has a higher load on indexing new data than should be necessary.

We want to reduce this to minimize the memory and disk footprint needed to run Camunda.

One exporter to rule them all

Understanding those challenges, we rearchitected our platform to get rid of the aforementioned challenges. In the new architecture, we have built a Camunda exporter to replace the exporter/importer from the old architecture.

A simplified view of the new streamlined architecture

The Camunda Exporter brings the importer and archiving logic of web components (Tasklist and Operate) closer to the distributed platform (Zeebe).

The exporter consumes Zeebe records, aggregates data, and stores the related data into shared and harmonized indices that are used by both web applications. Archiving of data is done in the background, coupled with the exporter but not blocking the exporter’s progress.

Introducing this Camunda Exporter allows it to scale with Zeebe partitions and simplifies the installation, as importer and archiver deployments will be removed in the future.

The architecture diagram above is a simplified version of the actual work we have done. It shows an installation for a greenfield and a new cluster (no previous data).

More complex is a brownfield installation as shown in the diagram below, where data already exists.

We were able to harmonize the existing index schema used by Tasklist and Operate, reducing data duplication and resource consumption. Several indices can now be used by both applications without a need to duplicate the data.

With this new index structure, there is no need for additional Zeebe indices anymore.

Note: With 8.8, we likely will still have the importer/exporter (including Zeebe indices) to make use of Optimize (if enabled), but we aim to change that in the future as well.

Migration (brownfield installation)

Brownfield scenarios, where the data already exists and processes are running in an old architecture, are much more complex than greenfield installations. We have covered this in our solution design and want to briefly talk about it in this blog post. A more detailed update guide will follow with the Camunda 8.8 release.

When you update to the new Camunda version, there will be no additional effort for the user regarding data migration. We are providing an additional migration application that takes care of enhancing process data (in Operate indices) which can be used by Tasklist. Other than that, all existing Operate indices Operate can be used by Tasklist.

A simplified view of the brownfield (migration) scenario to the new streamlined architecture

Reducing the installation complexity is a slower process for brownfield installations. Importers still need to be executed to drain the preexisting data in indices created by ES or OS exporters.

After all older data (produced before the update) is consumed and aggregated, importers and exporters can be turned off as well but can also be kept for simplicity. The importers will communicate via metrics if they are done and by writing to a special ES/OS index. More details will be provided in the following update guide.

Conclusion

The new Camunda Exporter helps us achieve a more streamlined architecture, better performance, and stability (especially concerning ES/OS). The target release for the Camunda Exporter project is the 8.8 release.

To recap the highlights of the new Camunda Exporter, we can:

Scale with Zeebe partitions, as exporters are part of partitions. The data injection and the data archiving scales inherently.
Reduce resource consumption with harmonized schema. Data is not unnecessarily duplicated between web applications. ES and OS are not unnecessarily overloaded with index requests for duplicated data.
Improve performance, due to reducing additional hop. As we do not need to wait for ES/OS to flush twice and make data available, we can reduce one flush interval from our equation. We don’t need to import the data and store it in Zeebe indices, so we shorten the data pipeline. This was shown in one of our recent chaos days but needs to be further investigated and benchmarked, especially with higher load scenarios.
Simplify installation by bringing business logic closer to the Zeebe system. We no longer need separate applications or components for importing and archiving data. It can be easily enabled within the Zeebe brokers. The Camunda Exporter has everything built in.

I hope this was insightful and helpful to understand what we are working on and what we want to achieve with the newest Camunda Exporter. Stay tuned for more information about benchmarks and other updates.

Join us at CamundaCon to learn more

Looking to learn more about this new architecture and how the Camunda Exporter will help you? I’ll be giving a talk on the new Camunda Exporter at CamundaCon Amsterdam in May. Join us there in person to catch the session and so much more.

The post One Exporter to Rule Them All: Exploring Camunda Exporter appeared first on Camunda.

Incident.io Daily Statistics with Camunda

Christopher Kujawa — Tue, 19 Mar 2024 19:07:12 +0000

BPMN Process: Get incident statistics

When using tools like incident.io to manage your incidents—not talking about Camunda incidents in your business processes, but incidents that relate to your general system/offering and production system—it is often useful to get an overview of ongoing incidents. Furthermore, it is beneficial to get reports based on certain (custom) filters like affected teams, severity, etc.

Unfortunately, incident.io doesn’t support this type of reporting out-of-the-box. Camunda can help here ????

Today, I want to show you how you can get the necessary statistics from incident.io, extract the necessary results, and post them for example to Slack to create a daily incident statistics update with Camunda.

Screenshot: Incident statistics daily update

Get details from incident.io

The incident.io API is rather simple and well-documented. All necessary information can be found in their API docs https://api-docs.incident.io/.

API Key

As a first step, you have to create an API token. Follow this guide to create one.

After you have this make sure to store it somewhere safe.

Incidents API

We want to get the current statistics for the Incidents for that we have to query: https://api.incident.io/v2/incidents.

In our daily update, we are only interested in ongoing incidents for that we need to filter status_category[one_of]=live. Details of this filtering mechanism are also described in their documentation.

It becomes interesting if you want to filter for custom_fields (which you have defined earlier on your own). That can be for example the affected team, or other details you add to your incidents. Be aware that you have to find the corresponding IDs for the custom_field and also the potential options.

To find these you can either query the https://api.incident.io/v2/custom_fields endpoint or run a query against the incidents endpoint, as the response contains all necessary information.

Example

After you find all the necessary filtering (or just if you want to experiment with something) you can use the following script to try your query.

#!/bin/bash
#
# Script to query incident.io API
set -euo pipefail


if [ -z $1 ];
then
  echo "Must provide an api token to query incident.io api"
fi

token=$1

incidents=$(curl --verbose --get https://api.incident.io/v2/incidents \
  -H "Accept: application/json" \
  -H "Authorization: Bearer $token" \
  --data 'custom_field[][one_of]=&status_category[one_of]=live'
)

# Custom fields need to be addressed with IDs

count=$(echo $incidents | jq '.incidents | length')

echo "$count incidents"
echo $incidents | jq '.incidents[].name'

Example usage

$ ./incidentsStats.sh $token 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  459k    0  459k    0     0  1054k      0 --:--:-- --:--:-- --:--:-- 1055k
10 incidents
"support-123"
"support-2124"
"support-20124"
"123124"
"SUPPORT-1232"
"Some other incident"
"test incident"
"SUPPORT-2012"
"support-2014212"
"SUPPORT-112"

With that script running we have everything ready and can get things automated.

Automating with Camunda

You can either get a trial account here, or host Camunda on your own, for example with the provided Self-managed version.

For simplicity reasons, I will skip the details here, and we expect the usage of the Camunda SaaS offering. Furthermore, will concentrate on the modeling details mostly (using Web-Modeler).

Follow these instructions here if you are unsure how to model your first diagram.

Adding Incident.io API key as Connector secret

When querying the incident.io API we need an API key, as described above. To make this available to our process instances and connectors we have to create a secret in our Camunda cluster.

For more on how to create a secret for your Camunda cluster, take a look at this guide.

In the end, it should be similar to the below:

Screenshot: Connector secrets

Query incident statistics via REST Connector

BPMN: Rest Connector

After we have created our Camunda cluster, added our Connector secret, and started modeling we can add a REST Connector. Details about this can be found in the documentation here.

In the properties panel of the REST Connector, we have to specify all important details, similar to what we used in our script above.

Screenshot: REST connector properties panel

One important part we need to add as well is the result expression which would look like this:

{
   incidentIoResponse: response.body
}

This maps the response body to our variable incidentIoResponse. With that, we are already ready to query the details of our incidents. You can either test this by creating a process instance and verifying the results or you just continue with the next step.

Extracting incident details

After we get all the incident details from incident.io we need to extract the important details. We can do this by defining a script task in our process model and implementing it with FEEL expressions.

BPMN: Extracting statistics

Depending on what you are interested in you want to extract different details. For all the potential properties of incidents, you can take a look at the incident.io API documentation here.

In our example, we are interested in:

Incident name
Incident Severity
Incident permalink, which points to the incident.io page
The related Slack details (channel name and ID)
The incident command (with name and Slack ID)
Incident count

All of this can be extracted with the following FEEL expression:

{
    incidents:
      for incident in incidentIoResponse.incidents 
      return { name: incident.name,
               severity: incident.severity.name,
               permalink: incident.permalink,
               slack_channel_id: incident.slack_channel_id, 
               slack_channel_name: incident.slack_channel_name,              
               ic_name: incident.incident_role_assignments[item.role.shortform = "commander"].assignee.name[1],                 
               ic_slack_id: incident.incident_role_assignments[item.role.shortform = "commander"].assignee.slack_user_id[1]
             }
    ,
    incidents_count: count(incidentIoResponse.incidents)
}

With the FEEL for loop, we are iterating over the incidents, and creating for each incident a new context (object).

The FEEL list filter allows us to find the incident commanders and their respective names and Slack IDs.

Sending statistics via Slack Connector

BPMN: Slack Connector added

Camunda supports several Connectors out-of-the-box, not only the REST Connector but also a Slack Connector. That Slack Connector we use in our example to send our statistics to a respective channel.

Of course also here we need an API key to access the Slack API, and you can follow this guide to create a Slack OAuth token.

Similar to the REST Connector, we need to add the Slack OAuth Token to our Connector Secrets. Follow this documentation if you need to know how to create/add them to your cluster.

Screenshot: Connector Secrets

After adding the secret we can start with modeling the Slack Connector.

We have to specify some details, like which channel or user should get the update. Furthermore, we have to specify what the message should look like.

Screenshot: Slack Connector props

For the Slack message, we can use again a FEEL expression.

":incident-heart: This is your daily incident update :incident-io:\n\n"
+ "We have currently *" +string(incidents_stats.incidents_count) + " ongoing Incidents*\n\n\n"
+
string join(
  for incident in sort(incidents_stats.incidents, function(x,y) x.severity < y.severity)
  return incident.severity + " - <" + incident.permalink + "|" + incident.name + ">" + " IC: <@" + incident.ic_slack_id + ">"
, "\n")

Formatting the Slack message has to follow some guidelines, which we can find in the Slack documentation here.

As the first part of the message, we want to post some header (with some nice Emoji—this of course depends on your Slack workspace whether this is available or not).

As the message field expects a string we have to concatenate everything to one string. We have to iterate over the incidents and join them via the string join function. This function allows us to specify a delimiter, which is in our case a newline.

To sort our incident list by severity we can use the FEEL sort function.

We can now experiment with our model and it should send a message already to a specified channel or DM, when we create a process instance for the definition.

Running it every working day

BPMN: Daily incident statistics

As the last step, we want to execute/run it every working day at the same time. In our example, we want to do it always before lunch (around 12 o’clock CET). To achieve this we model a timer start event with a cycle expression. Details about this can be found in the documentation here.

The example expression we would use is 0 0 11 * * MON-FRI. We are referring here to 11 as we specify here the time in UTC.

Be aware that tools like crontab-guru don’t parse this expression correctly as they don’t expect the seconds (at the beginning). Here the expression would be 0 11 * * MON-FRI.

Daily incident statistics with Camunda

After modeling all our logic (wiring everything together) we can deploy this to our previously created Camunda cluster, and as soon as the right time is reached we will see a Slack message like this:

Screenshot: Daily incident statistics

As you can see there is not much magic behind it, it is quite “easy” to generate a daily report from an external source, like incident.io.

You don’t need to scrape the statistics manually anymore and put them into a good format to share them with others. We were able to completely automate this.

That is the power of Camunda: “Automate Any Process, Anywhere.”

P.S.: The possibilities for further things are endless, for example, auto-assigning an engineer to an incident, based on current ongoing incidents (related to statistics and load).

P.P.S.: If you like this post please let us know on the forum, where you can leave a comment. I’m open to any other feedback as well.

Editor’s note: This post first appeared on Medium. We have republished it here with slight edits for clarity.

The post Incident.io Daily Statistics with Camunda appeared first on Camunda.

Zeebe Performance 2023 Year In Review

Christopher Kujawa — Mon, 05 Feb 2024 21:56:35 +0000

Zeebe is the distributed and fault-tolerant process engine behind Camunda 8 and has been engineered for over five years by a dedicated team at Camunda. At the beginning of 2023, the Zeebe team was split into two separate teams to focus on different aspects of the system. One team concentrates on the distributed system and overall performance (Zeebe Distribution Platform Team, ZDP), and the other fully focuses on all process automation details across BPMN and DMN (Zeebe Process Automation Team, ZPA).

Of course, features like supporting new BPMN elements or having DMN support are more visible than features like improved throughput with large state, etc. This is why we want to spend some time here to recap what happened in 2023 from the perspective of the ZDP team regarding performance and improving the distributed system.

In this blog post, you will find several details about features we have implemented over the year on a high level. Most of the time, we will link to related GitHub issues, so you can dive deeper if you want. Most of these features have been benchmarked with our benchmarking setup. I want to shortly go into the details so we can clarify this upfront.

Benchmarking

We set up performance and load tests for new features or bug fixes where we think it makes sense to see how the change affects the cluster performance. Furthermore, we have weekly benchmarks which are set up automatically to detect regressions.

The basis for our benchmarks is our Camunda Platform Helm chart, where we added some testing applications on top. One of these applications is in charge of creating new instances at a specific rate, called the “Starter.” In all our benchmarks (if not otherwise specified) we will run with the same load of 150 process instances per second (PI/s). The second application is the “Worker” which makes sure that Jobs are activated and completed (after a configurable delay). By default, the completion delay is 50ms. That means at minimum a process instance takes at least 50ms.

We always use the same process model (if not otherwise specified), which consists of one service task, and a big payload.

In all our benchmarks the Zeebe clusters are configured the same, and only the Docker image changes. We run with three brokers, two standalone gateways, three partitions (replication factor three), three workers, and one starter.

For more details, you can check the readme in our Benchmark Helm chart repo.

After that short excursion, let’s jump into the features we have implemented.

Batch processing

The year started with an interesting topic: how can we improve the process instance execution latency in an environment where the network latency is high? The answer? Batch processing.

I already wrote a full article about this here, where you can find more in-depth details.

Summary

Since Zeebe uses Raft and wants to provide strong consistency, Zeebe needs to make sure that user commands (which are driving the execution of a process instance) are committed before being processed.

A command is committed when it is replicated to and acknowledged by a quorum of nodes. This can result in high commit latency if the network is slow. Zeebe splits the processing into user commands and internal commands. Previously, each of them was committed and processed individually. It has shown, based on our investigation and benchmarks, that it is worth batching internal commands together. This is quite similar to what is done in Camunda 7 where we process the process instances until a wait state.

Result

With this change, as you can see below, we were able to reduce the process instance execution latency by half and increase the maximum throughput by 20% in our benchmarks. For more details, see this issue.

Improving the performance of hot data with large state

Due to certain investigations, benchmarking, testing, etc., we observed that our system runs into performance issues when working with a larger state. Particularly, when we hit a certain limit of state in RocksDB (used in Zeebe) the performance unexpectedly drops. We documented this in a Chaos day as well.

We worked on some initiatives to resolve this situation following a few customer requests. For example, the idea that hot data should always be executed fast/performant, and cold data or big state shouldn’t have an impact on new data and their performance.

Summary

After evaluating several approaches on how to improve our performance with accessing RocksDB, we found quite an interesting configuration in RocksDB that changes how we write our data. It separates key-value pairs, based on a prefix into their own SST files.

It is worth mentioning here that Zeebe uses only one real column family in RocksDB (to minimize disk space and memory usage) and uses key prefixes to produce an artificial data separation (also called virtual column families). This works since keys are sorted in RocksDB. This allows one to seek to a certain prefix, iterate with the given prefix to find all related data, and stop when a different prefix has been encountered.

We have observed that having this setup and a lot of data, read access and especially iteration becomes slower as soon we hit a certain size limit. By partitioning data into their own SST files, grouped by prefixes, we were able to restore performance. The performance was similar to what we would get by separating data into real column families but with much less overhead.

We documented several benchmark results here.

Result

Before our change, our benchmark ran stable until ~1 gig of state had been reached, which caused it to drop in accepting and processing more instances. In this scenario, we were blindly creating instances without completing them.

After using the SST partitioning feature from RocksDB, we were able to continuously accept and process new instances until we reached our disk limit:

We did some chaos days (1, 2) to validate and verify the behavior of this feature, especially how it behaves when enabling existing data or disabling it again without breaking running installations.

Cluster availability should not be impacted by runtime state

Similar to the previous topic was cluster availability impacted by a large state in Zeebe. We experienced degradation of availability (and especially recovery) when the state was large.

For example, we had an incident where a large state of data caused Zeebe to no longer recover since our defined liveness probe (with a 45s timeout) caused it to restart the pod every time. This prevented Zeebe from recovering at all.

Summary

We investigated various options to improve the situation and more quickly recover a broker, which includes copying an existing snapshot, opening the database, etc.

Previously when a node bootstrapped, we copied an existing snapshot into the runtime folder, such that we could open and work on the database (RocksDB). This can be quite time-consuming and IO inefficient, depending on the filesystem and size of the state.

RocksDB offers a checkpoint of an open database, which essentially creates hard links to existing SST files; this works great when snapshots and runtime are on the same file system. We use this already for normal snapshots.

One improvement we made was using the same mechanics on recovery. We open the snapshot as a read-only database and create a checkpoint to make the database available as runtime. It turned out to be much faster than copying all the content, especially with larger states. Depending on whether both source and target are different file systems, it might mean that we still copy files, but then we have similar behavior as before.

We previously replicated snapshots into a pending snapshot directory. If the replication was successful, we copied the complete snapshot into the correct snapshot directory; this was to make sure that we only have valid and committed snapshots as part of our snapshot directory. We improved this mechanism in a way that we no longer need to copy the snapshot. We will directly write into the correct folder, and mark the snapshot via a checksum file as complete/valid. This helped to reduce the time spent during snapshot replications. Additionally, we found a way to reuse previously calculated checksums, which gave us a great boost in performance on persisting larger snapshots.

Result

Snapshot persistence has been improved significantly by not calculating checksums multiple times:

We can see that as soon as we reach around one gigabyte, we already have around 10 seconds of snapshot persistence duration (which is the maximum to record, so might be even higher). With our change, we stay under one-half of a second.

Concurrent tasks

Aside from normal stream processing, Zeebe also has to run other tasks. For example, trigger timers or cleaning up buffered messages after they expire.

Previously, these tasks interrupted regular stream processing. This was a problem when these tasks took a long time to process, for example when many timers expired at once. For users with strong latency requirements, this is unacceptable and we needed to find a solution where these tasks don’t interfere with regular processing.

Summary

To not have tasks interfere with regular processing, we decided that they should run on a different thread and concurrently with regular processing. This was complicated by two factors that we needed to address first. First, we needed to ensure that processing and tasks use different transaction contexts. This is important because tasks should not be able to read uncommitted processing results. For example, when we process a new process instance and create timers for it, the processing might fail and the transaction where we create a timer is rolled back.

To fix this, we now create a new transaction for every task execution. This ensures that tasks and processing are isolated from each other and can only read committed results. To improve the safety further, we decided that tasks are not allowed to write to the state directly. For example, the job timeout checker task was no longer allowed to update the job state directly. Instead, all tasks have to follow the usual stream processing rules and can only produce new commands that will be processed eventually.

The second issue was that tasks and processing shared mutable state. This worked fine while both were running on the same thread, but as soon as we ran them on two different threads, access to that data needed to be synchronized. We solved this by making use of normal concurrent data structures such as ConcurrentHashMap.

After addressing both issues, the actual change to run tasks on a new thread and concurrently to processing was trivial.

Result

As a result, processing in Zeebe 8.3 and newer is not interrupted by other tasks. Even if many timers or buffered messages expire at once, normal processing can continue.

This reduces latency spikes and makes processing performance more predictable. While it’s possible to revert to the previous behavior, we collected enough experience with this change to see this as a pure performance and latency improvement that everyone benefits from.

Job push

The following is a summarized version of this blog post, which contains more tests and details. There, we test various workloads as well as resilience use cases.

Before this change, the only way to activate jobs in Zeebe was to poll for them, meaning:

The client sends an ActivateJobs request to the gateway
The gateway, in turn, sends one request to each partition, polling for jobs, until either it activates enough jobs, or exhausts all partitions.
If long polling is enabled and all partitions are exhausted, then the request is parked until new jobs are available. At this point, the gateway starts a new partition polling cycle as described above.

This has several impacts on latency:

Every request—whether client to gateway, or gateway to broker—adds a delay to the activation latency.
If you have a lot of jobs, you can run into major performance issues when accessing the set of available jobs.
The gateway does not know in advance which partitions have jobs available.
Scaling out your clients may have adverse effects by sending out too many requests which all have to be processed independently. Even with long polling, this is only mitigated.
In the worst-case scenario, we have to poll every partition; even with long polling, when receiving a notification, we may still have to poll all partitions.
With long polling, if you have multiple gateways, all gateways will start polling if they have parked requests. Some of them may not get any jobs, but they will still have sent requests to brokers which still all have to be processed. A similar issue exists when many jobs are waiting.

To solve these issues, the team decided to implement a push-based approach to job activation.

Summary

Essentially, we added a new StreamActivatedJobs RPC to our gRPC protocol, a so-called server streaming RPC. In our case, this is meant to be a long-lived stream, such that the call is completed only if the client terminates it, or if the server is shutting down.

The stream itself has the following lifecycle:

The client initiates the stream by sending a job activation request much like with the ActivateJobs RPC.
- Since the stream is meant to be long-lived, however, there is no upper bound on the number of jobs to activate.

The gateway registers the new stream with all brokers in the cluster.
- Note that there is no direct connection between brokers and clients; the gateway acts as a proxy for the client.

When jobs are available for activation (e.g. on creation, on timeout, on backoff, etc.), the broker activates the job and pushes it to the gateway.
The gateway forwards the job to the client.

Experienced readers will immediately notice the risk of overloading clients if the brokers/gateways are pushing too many jobs downstream. To avoid this, we also had to implement a form of client-controlled back pressure. Thankfully, we could reuse the built-in flow control mechanisms from gRPC and HTTP/2.

Result

To compare the advantages of pushing to polling, we ran four different experiments. Note that, unless otherwise specified, we always run a constant throughput of 150 PI/s, and our workers simulate work by waiting 50ms before completing a job.

As our baseline test, we ran a constant throughput of 150 PI/s of a process consisting of a single task:

The results show a sharp decrease in both the p50 and p99 of the job lifetime (i.e. the time between creation and completion). Since this workload only consists of a single task, this is mirrored in the overall process execution latency. Overall, we see that switching to a push approach yields a p50 latency improvement of 50%, and a p99 improvement of 75%!

Broker Scaling

While Zeebe’s design enables horizontal scalability through partitioning, it lacked dynamic scaling capabilities until now. This much-anticipated feature has finally been implemented. As a stepping stone towards infinite scaling, Zeebe can now dynamically add or remove brokers to an existing cluster. While you cannot yet change the number of partitions, this capability proves particularly valuable when clusters are configured with many partitions and vertical scaling is not feasible.

Summary

The main blocker to dynamic scaling was that most Zeebe components relied solely on the static configuration provided at startup. To address this, we introduced a new topology management mechanism over gossip that embeds cluster configuration within the Zeebe cluster itself. This topology maintains up-to-date information about the brokers forming the cluster and the partition distribution. When new brokers are added, the topology is updated accordingly, ensuring that upon restart, brokers utilize the current topology rather than the outdated static one.

When a cluster has to be scaled up, the existing partitions have to be re-distributed. To minimize disruptions to ongoing processing, this migration process is carried out in a step-by-step manner. For each partition to be moved from one broker to another, the partition is first added to the new broker and then removed from the old one. This transition is executed by Raft (the replication protocol employed for replicating partitions) and coordinated by the new topology management mechanism. To support this, we implemented a joint-consensus mechanism in Raft to allow changing the configuration of a Raft replication group safely and correctly.

Scaling is orchestrated by a Rest API exposed in Zeebe Gateway. You can read more about it in Camunda documentation.

Result

To demonstrate the scaling, we did a small experiment to scale a cluster with three brokers and six partitions. The following image shows how throughput can be scaled when the cluster is scaled to six brokers.

We also held several Chaos days to test this feature and build up confidence. You can read more about it in the following blogs:

Conclusion

As you have seen from the above, it was a great year from the Zeebe performance and distributed systems perspective. We were able to ship several new features and improvements to improve our user’s lives.

To better understand the impact of all these improvements, we have run some more benchmarks comparing different released versions. All the benchmarks have used the same process models (single task) and the same load of 150 PI/s.

In the following screenshot, we are showing how the process instance execution time (latency) has been improved. We use here the term p99, which indicates that 99% of requests are completed within the recorded latency, and p50, which indicates that 50% of requests are completed within the recorded latency.

Release 8.1 (Oct, 2022)

We can see that in the 8.1 release, we were in our p99 around four to five seconds of completing a process instance, with one single task. The p50 was around one to two seconds.

Release 8.2 (April, 2023)

In release 8.2, we already reduced the latency by half (mostly due to batch processing). Now having p99 around 2.5 seconds and p50 at around 0.500 seconds.

Release 8.3 (Oct, 2023)

In release 8.3 we were able to further reduce the latency with job push to p99 0.750 and p50 ~0.200 seconds, which is another reduction in latency by a factor of two to three.

This doesn’t take into account changes we made related to big state support, concurrent tasks, or even broker scaling, which would allow us to improve the performance even more.

All in all, year 2023 was filled with great achievements by the Zeebe Team, and we are looking for more things to come in 2024.

Thanks to all who contributed and reviewed this blog post. Namely: Christina Ausley, Deepthi Akkoorath, Ole Schoenburg and Nicolas Pepin-Perreault

The post Zeebe Performance 2023 Year In Review appeared first on Camunda.

Drinking Our Champagne: Chaos Experiments with Zeebe against Zeebe

Christopher Kujawa — Tue, 01 Aug 2023 19:02:23 +0000

At Camunda we have a mantra: Automate Any Process, Anywhere. Additionally, we’ll often say “eat your own dog food,” or “drink your own champagne.”

Two years ago, I wrote an article about how we can use Zeebe to orchestrate our chaos experiments; I called it: BPMN meets chaos engineering. That was the result of a hack day project, in which I worked alongside my colleague Philipp Ossler.

Since then, a lot of things have changed. We made many improvements to our tooling, like creating our own chaos toolkit zbchaos that makes it easier to run chaos experiments against Zeebe (which reached v1.0), improving the BPMN models in use, adding more experiments to it, etc.

Today, I want to take a closer look at how we automate and orchestrate our chaos experiments with Zeebe against Zeebe. After reading this you will see how beneficial it is to use Zeebe as your chaos experiment orchestrator.

The use cases are endless—you can use this knowledge in order to orchestrate your own chaos experiments, set up your own QA test suite or use Zeebe as your CI/CD framework.

We will show you how you leverage the observability of the Camunda Platform stack and how it can help you to understand what is currently executed or where issues may lie.

But first, let’s start with some basics.

Chaos engineering and experiments

“Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
–principlesofchaos.org

One of the principles of chaos engineering is automating defined experiments to ensure that no regression is introduced into the system at a later stage.

A chaos experiment consists of multiple stages; three are important for automation:

Verification of the steady state hypothesis
Running actions to introduce chaos
Verification of the steady state hypothesis (that it still holds or has recovered)

These steps can also be cast into a BPMN model, as shown below:

That is the backbone of our chaos experiment orchestration.

Let’s take a closer look at the process models we designed and use now to automate and orchestrate our chaos experiments.

BPMN meets chaos engineering

If you are interested in the resources take a look at the corresponding GitHub repository zeebe-io/zeebe-chaos/.

Chaos toolkit

The first process model is called: “chaosToolkit” because it bundles all chaos experiments together. It reads the specifications of all existing chaos experiments (the specification for each experiment is stored in a JSON file, which we will see later) and executes them one by one via a sequential multi-instance.

For readers with knowledge of BPMN, be aware that in earlier versions of Zeebe it was not possible to transfer variables with BPMN errors, which is why we used return values of CallActivities and later interrupted the SubProcess.

Chaos experiment

The second BPMN model describes a single chaos experiment, which is why it is called “chaosExperiment”. It has similarities (the different stages) to the simplified version above.

Here we see the three stages, verification, introducing chaos, and verification of the steady state again.

All of the call activities above are delegated to the third BPMN model.

Action

The third model is the most generic one. It will execute any action, which is defined in the process instance payload. The payload will be a chaos experiment specification. The specification can also contain timeouts and pause times which are reflected in the model as well.

Specification

As we have seen, the BPMN process models are quite generic and all of them are enlivened via a chaos experiment specification.

The chaos experiment specification is based on OpenChaos initiative and the Chaos Toolkit specification. We reused this specification to run these experiments as well with chaosToolkit (to run it locally).

An example is the following experiment.json

{
    "version": "0.1.0",
    "title": "Zeebe follower restart non-graceful experiment",
    "description": "Zeebe should be fault-tolerant. Zeebe should be able to handle followers terminations.",
    "contributions": {
        "reliability": "high",
        "availability": "high"
    },
    "steady-state-hypothesis": {
        "title": "Zeebe is alive",
        "probes": [
            {
                "name": "All pods should be ready",
                "type": "probe",
                "tolerance": 0,
                "provider": {
                    "type": "process",
                    "path": "zbchaos",
                    "arguments": ["verify", "readiness"],
                    "timeout": 900
                }
            },
            {
                "name": "Can deploy process model",
                "type": "probe",
                "tolerance": 0,
                "provider": {
                    "type": "process",
                    "path": "zbchaos",
                    "arguments": ["deploy", "process"],
                    "timeout": 900
                }
            },
            {
                "name": "Should be able to create process instances on partition 1",
                "type": "probe",
                "tolerance": 0,
                "provider": {
                    "type": "process",
                    "path": "zbchaos",
                    "arguments": ["verify", "instance-creation", "--partitionId", "1"],
                    "timeout": 900
                }
            }
        ]
    },
    "method": [
        {
            "type": "action",
            "name": "Terminate follower of partition 1",
            "provider": {
                "type": "process",
                "path": "zbchaos",
                "arguments": ["terminate", "broker", "--role", "FOLLOWER", "--partitionId", "1"]
            }
        }
    ],
    "rollbacks": []
}

The first key-value pairs describe the experiment itself. The steady-state-hypothesis and its content describe the verification stage. All of the probes inside the steady-state-hypothesis are executed as actions in our third process model.

The method object is describing the chaos which should be inserted into the system. In this case, it consists of one action, restarting a follower (a broker which is not leader of Zeebe partition).

I don’t want to go into much detail about the specification itself, but you can find several examples of our experiments we already have defined here https://github.com/zeebe-io/zeebe-chaos/tree/main/go-chaos/internal/chaos-experiments

Automation

Let’s imagine we have a Zeebe cluster which we want to run the experiments against. We call it Zeebe target.

As mentioned earlier, the specification is based on the chaos toolkit. This means we can (if we have zbchaos and chaos toolkit installed) run it locally via `chaos run experiment.json`. If Zeebe is installed in Kubernetes and we have the right Kubernetes context set, this would work with zbchaos out of the box.

Zeebe Testbench

Another alternative is orchestrating the previous example with Zeebe itself. We’ll do this by using a different Zeebe cluster which we’ll call Zeebe Testbench.

Our Zeebe Testbench cluster is in charge of orchestrating the chaos experiments. zbchaos, is a job worker in this case and executes all actions. For example, verifying the healthiness of the cluster or of a node, terminating a node, creating a network partition, etc.

We have seen in the chaos experiment specification above that all actions and probes are referencing zbchaos and specifying subcommands. These are executed no matter if zbchaos is used as a CLI tool directly or as a job worker. This means if you execute the chaos specification with the chaos toolkit it will execute the zbchaos CLI. If you orchestrate the experiments with Zeebe, the zbchaos workers will handle the specific actions.

From outside we are deploying the previously mentioned chaos models in Zeebe Testbench. This can happen on the set up of the Zeebe Testbench cluster (or when something changes on the models). New instances can be created either by us locally (e.g. via zbctl, or any other client), via a Timer, or by our GitHub actions.

With our GitHub actions, it is fairly easy to trigger a new Testbench run, which includes all chaos experiments, and some other tests.

To make this even greater, we even have automation to create the Zeebe Target cluster automatically. That can happen before each chaosToolkit execution. This allows us to always start with a clean state. Otherwise, errors might be hard to reproduce (and not to waste resources if no experiment is running).

Run chaos experiments regularly

We run our chaos experiments regularly. This means we create a chaosToolkit process instance every day and execute all chaos experiments against a new Zeebe target cluster. The creation of such process instances happens with earlier mentioned Github actions. This allows us to integrate this more in our CI which we also use in releases, meaning that we can run such tests before every release.

You can find the related GitHub action here:

If an experiment fails or all succeed we are notified in Slack with the help of a Slack Connector.

This happens outside of the chaosToolkit process, which is essentially wrapped again around other larger process models to automate other parts. As I mentioned before, creating clusters, notifications, deleting clusters, etc.

Benefits

Observability

With Operate, you can observe a current running chaos experiment, what cluster it targets, what experiment and action it is currently executing, etc.

In the screenshot above, we can see a currently running chaosToolkit process instance. We can observe how many experiments have been executed (on the left in the “Instance History” green highlighted) and how many we still need to process (based on Variables).

Furthermore, we can see in the Variables tab (with the red border) what type of experiment we currently execute: “Zeebe should be fault-tolerant. We expect that Zeebe can handle non-graceful leader restarts”, and there is even more to dive into.

If we dig deeper into the current running experiment (we can do that via following the call-activity link) we can see that we are in the verification stage.

In the verification after the chaos has been introduced (highlighted in green). We can investigate which chaos action has been executed, like here (highlighted in red): “Terminate leader of partition two non-gracefully”.

When following the call activity again we see which verification is currently executed.

We are verifying that all pods are ready again after the leader of partition two has been terminated. This information can be extracted from the variables (highlighted in red).

As Operate keeps the history of a process, we can also take a look at past experiments. You can check and verify which actions or chaos has been introduced.

You can see a large history of executed chaos experiments, actions, and several other details.

This high degree of observability is important if something fails. Here you will see directly at which stage your experiment failed, what was executed before, etc. The incident message (depending on the worker) can also include a helpful note about why a stage failed.

Drink your own champagne

This setup might sound a bit complex at first, but once you understand the generic approach it actually isn’t and in contrast to scripting it, the BPMN automation greatly benefits observability.

Furthermore, with this approach, we are still able to execute our experiments locally (which helps with development and debugging) and are able to automate them via our Zeebe Testbench cluster. It is fairly easy to use and execute new QA runs on demand. We drink our own champagne which helps us to improve our overall system, and that is actually the biggest benefit of this setup.

It just feels correct to use our own product to automate our own processes. We can sit in the driver’s seat of the car we build and ship, feel what our users feel, and can improve based on that. It allows us to find bugs/issues earlier on, to improve in metrics and other observability measures, and build up confidence that our system can handle certain failure scenarios and situations.

I hope this was helpful to you and enlightened you a bit about what you can do with Zeebe. As I mentioned in the start the use cases and possibilities to use Zeebe are endless, and the whole Camunda Platform stack supports that pretty well.

—

Thanks to Christina Ausley, Deepthi Akkoorath and Sebastian Bathke for reviewing this blog post.

The post Drinking Our Champagne: Chaos Experiments with Zeebe against Zeebe appeared first on Camunda.

Zeebe, or How I Learned to Stop Worrying and Love Batch Processing

Christopher Kujawa — Fri, 03 Mar 2023 20:18:07 +0000

Hi, I’m Chris, Senior Software Engineer at Camunda. I have worked now for around seven years at Camunda and on the Zeebe project for almost six years, and was recently part of a hackday effort to improve Zeebe’s process execution latency.

In the past, we have heard several reports from users where they have described that the process execution latency of Zeebe, our cloud-native workflow decision engine for Camunda Platform 8, is sometimes sub-optimal. Some of the reports raised that the latency between certain tasks in a process model is too high, others that the general process instance execution latency is too high. This of course can also be highly affected by the used hardware and wrong configurations for certain use cases, but we also know we have something to improve.

At the beginning of this year and after almost three years of COVID-19, we finally sat together in a meeting room with whiteboards to improve the situation for our users. We called that performance hackdays. It was a nice, interesting, and fruitful experience.

Basics

To dive deeper into what we tried and why, we first need to elaborate on what process instance execution latency means, and what influences it.

The image above is a process model, from which we can create an instance. The execution of such an instance will go from the start to the end event; this is the process execution latency.

Since Zeebe is a complex distributed system, where the process engine is based on a distributed streaming platform, there are several influencing factors for the process execution latency. During our performance hackdays, we tried to sum up all potential factors and find several bottlenecks which we can improve. In the following post, I will try to summarize this on a high level and mention them shortly.

Stream processing

To execute such a process model, as we have seen above, Zeebe uses a concept called stream processing.

Each element in the process has a specific lifecycle, which is divided into the following:

One command asks to change the state of a certain element and an event that confirms the state change. Termination can happen when elements are canceled either internally by events or outside by users.

Commands drive the execution of a process instance. When Zeebe’s stream processor processes a command, state changes are applied (e.g. process instances are modified). Such modifications are confirmed via follow-up events. To split the execution into smaller pieces, not only are follow-up events produced, but also follow-up commands. All of these follow-up records are persisted. Later, the follow-up commands are further processed by the stream processor to continue the instance execution. The idea behind that is that these small chunks of processing should help to achieve high concurrency by alternating execution of different instances on the same partition.

Persistence

Before a new command on a partition can be processed, it must be replicated to a quorum (typically majority) of nodes. This procedure is called commit. Committing ensures a record is durable, even in case of complete data loss on an individual broker. The exact semantics of committing are defined by the raft protocol.

Source: https://docs.camunda.io/docs/components/zeebe/technical-concepts/clustering/#commit

Committing of such records can be affected by network latency, for sending the records over the wire. But also by disk latency since we need to persist the records on disk on a quorum of nodes before we can mark the records as committed.

State

Zeebe’s state is stored in RocksDB, which is a key-value store. RocksDB persists data on disk with a log-structured merge tree (LSM Tree) and is made for fast storage environments.

The state contains information about deployed process models and current process instance executions. It is separated per partition, which means a RocksDB instance exists per partition.

Performance hackdays

When we started with the performance hackdays, we already had necessary infrastructure to run benchmarks for our improvements. We made heavy use of the Camunda Platform 8 benchmark toolkit maintained by Falko Menge.

Furthermore, we run weekly benchmarks (the so-called medic benchmark) where we test for throughput, latency, and general stability. Benchmarks are run for four weeks to detect potential bugs, regressions, memory leaks, performance regressions, and more as early as possible. This, all the infrastructure around it (like Grafana dashboards,) and knowledge about how our system performs were invaluable to make such great progress during our hackdays.

Measurement

We measured our results continuously, and this is necessary to see if you are on the right track. For every small proof of concept (POC), we ran a new benchmark:

Screenshot of benchmarks over the week

In our benchmark, we used a process based on some user requirements:

Our target was a throughput of around 500 process instances per second (PI/s) with a process execution latency goal for one process instance under one second for the 99th percentile (p99). P99, meaning 99% of all process instance executions should be executed in under one second.

The benchmarks have been executed in the Google Kubernetes Engine. For each broker node, we assigned one n2-standard-8 node to reduce the influence of other pods running on the same node.

Each broker pod had the following configuration:

CPU	8
RAM	16 gig
Disk	ssd-xfs
Disk size	100 gig
CPU Threads	5
IO Threads	2
Partitions	24
Replication factor	4
Log segment size	16

There were also some other configurations we played around with during our different experiments, but the above were the general ones. We had eight brokers running, which gives us the following partition distribution:

$ ./partitionDistribution.sh 8 24 4
Distribution:
P\N|	N 0|	N 1|	N 2|	N 3|	N 4|	N 5|	N 6|	N 7
P 0|	L  |	F  |	F  |	F  |	-  |	-  |	-  |	-  
P 1|	-  |	L  |	F  |	F  |	F  |	-  |	-  |	-  
P 2|	-  |	-  |	L  |	F  |	F  |	F  |	-  |	-  
P 3|	-  |	-  |	-  |	L  |	F  |	F  |	F  |	-  
P 4|	-  |	-  |	-  |	-  |	L  |	F  |	F  |	F  
P 5|	F  |	-  |	-  |	-  |	-  |	L  |	F  |	F  
P 6|	F  |	F  |	-  |	-  |	-  |	-  |	L  |	F  
P 7|	F  |	F  |	F  |	-  |	-  |	-  |	-  |	L  
P 8|	L  |	F  |	F  |	F  |	-  |	-  |	-  |	-  
P 9|	-  |	L  |	F  |	F  |	F  |	-  |	-  |	-  
P 10|	-  |	-  |	L  |	F  |	F  |	F  |	-  |	-  
P 11|	-  |	-  |	-  |	L  |	F  |	F  |	F  |	-  
P 12|	-  |	-  |	-  |	-  |	L  |	F  |	F  |	F  
P 13|	F  |	-  |	-  |	-  |	-  |	L  |	F  |	F  
P 14|	F  |	F  |	-  |	-  |	-  |	-  |	L  |	F  
P 15|	F  |	F  |	F  |	-  |	-  |	-  |	-  |	L  
P 16|	L  |	F  |	F  |	F  |	-  |	-  |	-  |	-  
P 17|	-  |	L  |	F  |	F  |	F  |	-  |	-  |	-  
P 18|	-  |	-  |	L  |	F  |	F  |	F  |	-  |	-  
P 19|	-  |	-  |	-  |	L  |	F  |	F  |	F  |	-  
P 20|	-  |	-  |	-  |	-  |	L  |	F  |	F  |	F  
P 21|	F  |	-  |	-  |	-  |	-  |	L  |	F  |	F  
P 22|	F  |	F  |	-  |	-  |	-  |	-  |	L  |	F  
P 23|	F  |	F  |	F  |	-  |	-  |	-  |	-  |	L

Each broker node had 12 partitions assigned. We used a replication factor of four because we wanted to mimic the geo redundancy for some of our users, which had certain process execution latency requirements. The geo redundancy introduces network latency into the system by default. We wanted to reduce the influence of such network latency to the process execution latency. To make it a bit more realistic, we used Chaos Mesh to introduce a network latency of 35ms between two brokers, resulting in a round-trip time (RTT) of 70ms.

To run with an evenly distributed partition leadership, we used the partitioning rebalancing API, which Zeebe provides.

Theory

Based on the benchmark process model above, we considered the impact of commands and events on the process model (and also in general).

We calculated around 30 commands are necessary to execute the process instance from start to end.

We tried to summarize what affects the processing latency and came to the following formula:

PEL = X * Commit Latency + Y * Processing Latency + OH

PEL – Process Execution Latency

OH – Overhead, which we haven’t considered (e.g. Jobs * Job Completion Latency)

When we started, X and Y were equal, but the idea was to change factors. This is why we split them up. The other latencies were based on:

Commit Latency = Network Latency + Append Latency 
Network Latency = 2 * request duration
Append Latency = Write to Disk + Flush
Processing Latency = Processing Command (apply state changes) + Commit Transaction (RocksDB) + execute side effects

Below is a picture of our whiteboard session, where we discussed potential influences and what potential solution could mitigate which factor:

Proof of concepts

Based on the formula, it was a bit more clear to us what might affect the process execution latency and where it might make sense to change or reduce time. For example, reducing the append latency affects commit latency and will affect process execution latency. Additionally, reducing the factor of how often commit latency is applied will highly affect the result.

Append and commit latency

Before we started with the performance hackdays, there was one configuration already present which we built more than two years ago and made available via an experimental feature: the disabling of the raft flush. We have seen several users applying it to reach certain performance targets, but it comes with a cost. It is not safe to use it, since on fail-over certain guarantees of raft no longer apply.

As part of the hackdays we were interested in a similar performance, but with more safety. This is the reason why we tried several different other possibilities but also compared that with disabling the flush completely.

Flush improvement

In one of our POC’s, we tried to flush on another thread. This gave a similar performance as with completely disabling it, but it also has similar safety issues. Combining the async flush with awaiting the completion before committing brought back the old performance (base) and the safety. This was no solution.

Implementing a batch flush (flush only after a configured threshold,) having this in a separate thread, and waiting for the completion degraded the performance. However, we again had better safety than with disabling flush.

We thought about flushing async in a batch, without waiting for commit and making this configurable. This would allow users to trade safety versus performance.

Write improvement

We had a deeper look into system calls such as madvise.

Zeebe stores its log in a segmented journal which is memory mapped at runtime. The OS manages what is in memory at any time via the page cache, but does not know the application itself. The madvise system call allows us to provide hints to the OS on when to read/write/evict pages.

The idea was to provide hints to reduce memory churn/page faults and reduce I/O

We tested with MADV_SEQUENTIAL, hinting that we will access the file sequentially and a more aggressive read-ahead should be performed (while previous pages can be dropped sooner).

Based on our benchmarks, we hadn’t seen much difference under low/mid load. However, read IO was greatly reduced under high load. We have seen slightly increased write I/O throughput under high load due to reduced IOPS contention. In general, there was a small improvement only in throughput/latency. Surprisingly, still it showed similar page faults as before.

Reduce transaction commits

Based on our formula above, we can see that the processing latency is affected by the RocksDB write and transaction commit duration. This means reducing one of these could benefit the processing latency.

State directory separation

Zeebe stores the current state (runtime) and snapshots on different folders on disk (under the same parent). When a Zeebe broker restarts, we recreate the state (runtime) every time from a snapshot. This is to avoid having data in the state which might not have been committed yet.

This means we don’t necessarily need to keep the state (runtime) on disk, and RocksDB does a lot of IO-heavy work which might not be necessary. The idea was to separate the state directory in a way that it can be separately mounted (in Kubernetes) such that we can run RocksDB in tmpfs, for example.

Based on our benchmarks, only p30 and lower have been improved with this POC:

Disable WAL

RocksDB has a write-ahead log to be crash resistant. This is not necessary for us to recreate the state every time. We considered disabling it, we will see later in this post what influence it has. It is a single configuration, which is easy to change.

Processing of uncommitted

We mentioned earlier that we have thought about changing the factor of how many commits influence the overall calculation. What if we process commands already, even if they are not committed yet, and only send results to the user if the commit of the commands is done?

We worked on a POC to implement uncommitted processing, but it was a bit more complex than we thought due to the buffering of requests, etc. This is why we didn’t find a good solution during our hackdays. We still ran a benchmark to verify how it would behave:

The results were quite interesting and promising, but we considered them a bit too good. The production ready implementation might be different, since we have to consider more edge-cases.

Batch processing

Part of another POC we did was something we called batch processing. The implementation was rather easy.

The idea was to process the follow-up commands directly and continue the execution of an instance until no more follow-up commands are produced. This normally means we have reached a wait state, like a service task. Camunda Platform 7 users will know this behavior, as this is the Camunda Platform 7 default. The result was promising as well:

In our example process model above, this would reduce the factor of commit latencies from ~30 commands to 15, which is significant. The best IO you can do, however, is no IO.

Combining the POCs

By combining several POCs, we reached our target line which showed us that it is possible and gave us some good insights on where to invest in order to improve our system further in the future.

The improvements did not just improve overall latency of the system. In our weekly benchmarks we had to increase the load because the system was able to reach higher throughput. Before we reached ~133 (on avg) process instances per second (PI/s) over three partitions, now 163 PI/s (on avg) while also reducing the latency by a factor of 2.

In the last weeks, we took several ideas from the hackdays to implement some production-ready solutions for Zeebe 8.2. For example:

We plan to work on some more like:

You can expect some better performance with the 8.2 release; I’m really looking forward to April!

Thanks to all participants of the hackdays for the great and fun collaboration, and to our manager (Sebastian Bathke) who made this possible. It was a really nice experience.

Participants (alphabetically sorted):

Thanks to all the reviewers of this blog post: Christina Ausley, Deepthi Devaki Akkoorath, Nicolas Pepin-Perreault, Ole Schönburg, Philipp Ossler and Sebastian Bathke

The post Zeebe, or How I Learned to Stop Worrying and Love Batch Processing appeared first on Camunda.

Zbchaos — A new fault injection tool for Zeebe

Christopher Kujawa — Tue, 27 Sep 2022 18:00:00 +0000

During Summer Hackdays 2022, I worked on a project called “Zeebe chaos” (zbchaos), a fault injection command-line interface (CLI) tool. This allows us engineers to more easily run chaos experiments against Zeebe, build up confidence in the system’s capabilities, and discover potential weaknesses.

Requirements

To understand this blog post, it is useful to have a certain understanding of Kubernetes and Zeebe itself.

Summer Hackdays:

Hackdays are a regular event at Camunda, where people from different departments (engineering, consulting, DevRel, etc.) work together on new ideas, pet projects, and more.

Check out previous Summer Hackdays here:

Zeebe chaos CLI

Working on the Zeebe project is not only about engineering a distributed system or a process engine, it is also about testing, benchmarking, and experimenting with our capabilities.

We run regular chaos experiments against Zeebe to build up confidence in our system and to determine whether we have weaknesses in certain areas. In the past, we have written many bash scripts to inject faults (chaos). We wanted to replace them with better tooling: a new CLI. This allows us to make it more maintainable, but also lowers the barrier for others to experiment with the system.

The CLI targets Kubernetes, as this is our recommended environment for Camunda Platform 8 Self-Managed, and the environment our own SaaS offering runs on.

The tool builds upon our existing Helm charts, which are normally used to deploy Zeebe within Kubernetes.

To use the CLI you need to have access to a Kubernetes cluster, and have our Camunda Platform 8 Helm charts deployed. Feel free to try out Camunda Platform 8 Self-Managed here.

Chaos Engineering:

You might be wondering why we need this fault injection CLI tool or what this “chaos” stands for. It comes from chaos engineering, a practice we introduced back in 2019 to the Zeebe Project.

Chaos Engineering was defined by the Principles of Chaos. It should help to build confidence in the system’s capabilities and find potential weaknesses through regular chaos experiments. We define and execute such experiments regularly.

You can learn more about Chaos Engineering in my talk at CamundaCon 2020.2

Chaos experiments

As mentioned, we regularly write and run new chaos experiments to build up confidence in our system and undercover weaknesses. The first thing you have to do for your chaos experiment is to define a hypothesis that you want to prove. For example, processing should still be possible after a node goes down. Based on the hypothesis, you know what kind of property or steady state you want to verify before and after injecting faults into the system.

A chaos experiment consists of three phases:

Verify the steady state.
Inject chaos.
Verify the steady state.

For each of these phases, the zbchaos CLI provides certain features outlined below.

Verify steady state

In the steady state phase, we want to verify certain properties of the system, like invariants, etc.

One of the first things we typically want to check is the Zeebe topology. With zbchaos you can run:

$ zbchaos topology
0 |LEADER (HEALTHY) |FOLLOWER (HEALTHY) |LEADER (HEALTHY)
1 |FOLLOWER (HEALTHY) |LEADER (HEALTHY) |FOLLOWER (HEALTHY)
2 |FOLLOWER (HEALTHY) |FOLLOWER (HEALTHY) |FOLLOWER (HEALTHY)

Zbchaos will do all the necessary magic for you. Finding a Zeebe gateway, do a port-forward, request the topology, and print it in a compact format. This makes the chaos engineers’ life much easier.

Another basic check is verifying the readiness of all deployed Zeebe components. To achieve this, we can use:

$ zbchaos verify readiness

This verifies the Zeebe Broker Pod status and the status of the Zeebe Gateway deployment status. If one of these is not ready yet, it will loop and not return before they are ready. This is beneficial in automation scripts.

After you have verified the general health and readiness of the system, you also need to verify whether the system is working functionally. This is also called “verifying the steady state.” This can be achieved by:

$ zbchaos verify steady-state — partitionId 2

This command checks that a process model can be deployed and a process instance can be started for the specified partition. As you cannot influence the partition for new process instances, process instances are started in a loop until that partition is hit. If you don’t specify the partitionId, partition one is used.

Inject chaos

After we verify our steady state we want to inject faults or chaos into our system, and afterward check again our steady state. The zbchaos CLI already provides several possibilities to inject faults outlined below.

Before we step through how we can inject failures, we need to understand what kind of components a Zeebe cluster consists of and what the architecture looks like.

We have two types of nodes: the broker, and the gateway.

A broker is a node that does the processing work. It can participate in one or more Zeebe partitions (internally each partition is a raft group, which can consist of one or more nodes). A broker can have different roles for each partition (Leader, Follower, etc.)

For more details about the replication, check our documentation and the raft documentation.

The Zeebe gateway is the contact point to the Zeebe cluster to which clients connect. Clients send commands to the gateway and the gateway is in charge of distributing the commands to the partition leaders. This depends on the command type of course. For more details, check out the documentation.

By default, the Zeebe gateways are replicated as if Camunda Platform 8 Self-Managed was installed via our Helm charts, which makes it interesting to also experiment with the gateways.

Shutdown nodes

With zbchaos we can shutdown brokers (gracefully and non-gracefully) which have a specific role and take part in a specific partition. This is quite useful in experimenting since we often want to terminate or restart brokers based on the participation and role (e.g. terminate the Leader of partition X or restart all followers of partition Y.)

Graceful

A graceful restart can be initiated like this:

$ zbchaos restart -h
Restarts a Zeebe broker with a certain role and given partition.

Usage:
zbchaos restart [flags]

Flags:
  -h, --help help for restart
  --partitionId int Specify the id of the partition (default 1)
  --role string Specify the partition role [LEADER, FOLLOWER, INACTIVE] (default “LEADER”)

Global Flags:
-v, — verbose verbose output

This sends a Kubernetes delete command to the pod, which takes part of the specific partition and has the specific role. This is based on the current Zeebe topology, provided by the Zeebe gateway. All of this is handled by the zbchaos toolkit. The chaos engineer doesn’t need to find this information manually.

Non-graceful

Similar to the graceful restart is the termination of the broker. It will send a delete to the specific Kubernetes Pod, and will set the –gracePeriod to zero.

$ zbchaos terminate -h
Terminates a Zeebe broker with a certain role and given partition.

Usage:
  zbchaos terminate [flags]
  zbchaos terminate [command]

Available Commands:
  gateway Terminates a Zeebe gateway

Flags:
  -h, --help help for terminate
  --nodeId int Specify the nodeId of the Broker (default -1)
  --partitionId int Specify the id of the partition (default 1)
  --role string Specify the partition role [LEADER, FOLLOWER] (default “LEADER”)

Global Flags:
-v, --verbose verbose output

Use “zbchaos terminate [command] --help” for more information about a command.

Gateway

Both commands above target the Zeebe brokers. Sometimes, it is also interesting to target the Zeebe gateway. For that, we can just append the gateway subcommand to the restart or terminate command.

Disconnect brokers

It is not only interesting to experiment with graceful and non-graceful restarts, but it is also interesting to experiment with network issues. This kind of fault undercovers other interesting weaknesses (bugs).

With the zbchaos CLI, it is possible to disconnect different brokers. We can specify at which partition they participate and what kind of role they have. These network partitions can also be set up in one direction if the –one-direction flag is used.

$ zbchaos disconnect -h
Disconnect Zeebe nodes, uses sub-commands to disconnect leaders, followers, etc.

Usage:
 zbchaos disconnect [command]

Available Commands:
 brokers Disconnect Zeebe Brokers

Flags:
 -h, — help help for disconnect

Global Flags:
 -v, — verbose verbose output

Use “zbchaos disconnect [command] — help” for more information about a command.
[zell ~/ cluster: zeebe-cluster ns:zell-chaos]$ zbchaos disconnect brokers -h
Disconnect Zeebe Brokers with a given partition and role.

Usage:
 zbchaos disconnect brokers [flags]

Flags:
 — broker1NodeId int Specify the nodeId of the first Broker (default -1)
 — broker1PartitionId int Specify the partition id of the first Broker (default 1)
 — broker1Role string Specify the partition role [LEADER, FOLLOWER] of the first Broker (default “LEADER”)
 — broker2NodeId int Specify the nodeId of the second Broker (default -1)
 — broker2PartitionId int Specify the partition id of the second Broker (default 2)
 — broker2Role string Specify the partition role [LEADER, FOLLOWER] of the second Broker (default “LEADER”)
 -h, — help help for brokers
 — one-direction Specify whether the network partition should be setup only in one direction (asymmetric)

Global Flags:
 -v, — verbose verbose output

The network partition will be established with ip route tables, which are installed on the specific broker pods.

Right now this is only supported for the brokers, but hopefully, we will add support for the gateways soon as well.

To connect the brokers again, the following can be used:

$ zbchaos connect brokers

This removes the ip routes on all pods again.

Other features

All the described commands support a verbose flag, which allows the user to determine what kind of action is done, how it connects to the cluster, and more.

For all of the commands, a bash-completion can be generated via zbchaos completion, which is very handy.

Outcome and future

In general, I was quite happy with the outcome of Summer Hackdays 2022, and it was a lot of fun to build and use this tool already. I was able to finally spend some more time writing go code and especially a go CLI. I learned to use the Kubernetes go-client and how to write go tests with fakes for the Kubernetes API, which was quite interesting. You can take a look at the tests here.

We plan to extend the CLI in the future and use it in our upcoming experiments.

For example, I recently did a new chaos day, a day I use to run new experiments, and wrote a post about it. In this article, I extended the CLI, with features like sending messages to certain partitions.

At some point, we want to use the functionality within our automated chaos experiments as Zeebe workers and replace our old bash scripts.

Special thanks to Christina Ausley and Bernd Ruecker for reviewing this post

The post Zbchaos — A new fault injection tool for Zeebe appeared first on Camunda.

Advanced Test Practices For Helm Charts

Christopher Kujawa — Wed, 30 Mar 2022 13:00:00 +0000

*Camunda Platform 8, our cloud-native solution for process orchestration, launched in April 2022. Images and supporting documentation in this post may reflect an earlier version of our cloud and software solutions.

I’m a distributed systems engineer working on the Camunda Zeebe project, that’s part of Camunda Platform 8. I’m highly interested in SRE topics, so I started maintaining the Helm charts for Camunda Platform 8.

I’m excited to share below the detailed learnings and experiences I had along my journey of finding a sufficient way to write automated tests for Helm charts. At the end of this blog post, I’ll present to you the current solution we’re using, which is meeting all our requirements.

Please, be aware that these are my personal experiences and might be a bit subjective, but I try to be as objective as possible.

How it Began

We started with the community-maintained Helm charts for Zeebe and Camunda Platform 8-related tools, like Tasklist and Operate. This project had a lack of support and stability issues.

In the past, we often had issues with the charts being broken, sometimes because we added a new feature or property. Or because the property was never used before, and was hidden by a condition. We wanted to avoid that and give the users a better experience.

In early 2022, we at Camunda wanted to create some new Helm charts, based on the old ones we had. The new Helm charts needed to be officially supported by Camunda. In order to do that with a clear conscience, we wanted to add some automated tests to the charts.

Prerequisites

In order to understand this blog post, you should have some knowledge of the following topics:

Helm Testing. What is the Issue?

Testing in the Helm world is, I would say, not as well evolved as it should be. Some tools exist, but they lack usability, or they needed too much boilerplate code. Sometimes it’s not really clear how to use or write them.

Some posts around that topic already exist, but there aren’t many. For example:

This one really helped us, Automated Testing for Kubernetes and Helm Charts using Terratest.

It explains how to test Helm charts with Terratest, a framework to write tests for Helm charts, and other Kubernetes-related things.

We did a comparison of Terratest, writing golden file tests (here’s a blog post about that), and using Chart Testing (CT) . You can find the details in this GitHub issue.

This issue contains a comparison between the test tools, as well as some subjective field reports, which I wrote during the testing. It helped me to make some decisions.

What and How to Test

First of all, we separated our tests into two parts, with different targets and goals:

Template tests (unit tests) – Which verify the general structure.
Integration tests – Which verify whether we can install the charts and use them.

Template tests

With the template tests, we want to verify the general structure. This includes whether it’s yaml conform, has the right values set, has changing default values, or if they’re set at all.

For template tests, we combine both golden files and Terratest. Generally speaking, golden files store the expected output of a certain command or response for a specific request. In our case, the golden files contain the rendered manifest, which are outputted after you run `helm template`. This allows you to verify that the default values are set and changed only in a controlled manner, this reduces the burden of writing many tests.

If we want to verify specific properties (or conditions), we can use the direct property tests. We will come to that again later.

This allows us to use one tool (Terratest) and separate the tests per manifest, such as a test for Zeebe statefulset, the Zeebe gateway deployment, etc. The tests can be easily run via command line or IDE, and CI.

Integration tests

With the integration tests we want to test for two things:

Whether the charts can be deployed to Kubernetes, and are accepted by the K8s API.
Whether the services are running and can work with each other.

Other things, like broken templates, incorrectly set values, etc., are caught by the tests above.

So to turn it around, here are potential failure cases we can find with such tests:

Specifications that are in the wrong place (look like valid yaml), but aren’t accepted by the K8s API.
Services that aren’t becoming ready because of configuration errors, and they can’t reach each other.

The first case we could also solve with other tools, which validates manifests based on the K8s API, but not the second one.

In order to write the integration tests, we tried out the Chart Testing tool and Terratest. We chose Terratest over Chart Testing. If you want to know why, read the next section. Otherwise, you can simply skip it.

Chart testing

While trying to write the tests using Chart Testing, we encountered several issues that made the tool difficult to use, and the tests difficult to maintain.

For example, the options to configure the testing process seem rather limited – see CT Install documentation for available options. In particular, during the Helm install phase, our tests deploy a lot of components (Elasticsearch, Zeebe) that take ages to become ready. However, Chart Testing times out by default after three minutes, and we didn’t find a way to adjust this type of setting. As such, we actually were never able to run a successful test using the `ct` CLI.

Another painful point was the way the tests are shipped, executed, and eventually how results are reported. The Chart Testing tool wraps, simply speaking, the Helm CLI will run the `helm install` and `helm test` command. To be executed using the `helm test` command, the tests have to be configured and deployed as part of the Helm chart. This means the tests have to be embedded inside a Docker image, which might not be super practical, and the Helm chart also needs to be modified to ship with the additional tests settings.

If the tests fail in the CI and you want to reproduce it, you would need the `ct` CLI locally, and run `ct install` to redeploy the whole Helm chart, and run the tests. When the tests fail, the complete logs of all the containers are printed, which can be a big amount of data to inspect. We found it was difficult to iterate on the tests, and quite cumbersome to debug them when they were failing.

All the reasons above pushed us to use Terratest (see next section) to write the tests. The benefit here is that we have one tool for both (unit and IT tests), and more control over it. It makes it easy to run and debug the tests. In general, the tests were also quite simple to write, and the failures were easy to understand.

For more information regarding this, please check the comments in the GitHub issue.

Helm Chart Tests In Practice

In the following section, I would like to present how we use Terratest, and what our new tests for our Helm charts look like.

Golden files test

We wrote a base test, which renders given Helm templates and compares them against golden files. The golden files can be generated via a separate flag. The golden files are tracked in git, which allows us to see changes easily via a git diff. This means if we change any defaults, we can directly see the resulting rendered manifests. These tests ensure that the Helm chart templates render correctly and the output of the templates changes in a controlled manner.

Golden Base

package golden

import (
	"flag"
	"io/ioutil"

	"regexp"

	"github.com/gruntwork-io/terratest/modules/helm"
	"github.com/gruntwork-io/terratest/modules/k8s"
	"github.com/stretchr/testify/suite"
)

var update = flag.Bool("update-golden", false, "update golden test output files")

type TemplateGoldenTest struct {
	suite.Suite
	ChartPath      string
	Release        string
	Namespace      string
	GoldenFileName string
	Templates      []string
	SetValues      map[string]string
}

func (s *TemplateGoldenTest) TestContainerGoldenTestDefaults() {
	options := &helm.Options{
		KubectlOptions: k8s.NewKubectlOptions("", "", s.Namespace),
		SetValues:      s.SetValues,
	}
	output := helm.RenderTemplate(s.T(), options, s.ChartPath, s.Release, s.Templates)
	regex := regexp.MustCompile(`\s+helm.sh/chart:\s+.*`)
	bytes := regex.ReplaceAll([]byte(output), []byte(""))
	output = string(bytes)

	goldenFile := "golden/" + s.GoldenFileName + ".golden.yaml"

	if *update {
		err := ioutil.WriteFile(goldenFile, bytes, 0644)
		s.Require().NoError(err, "Golden file was not writable")
	}

	expected, err := ioutil.ReadFile(goldenFile)

	// then
	s.Require().NoError(err, "Golden file doesn't exist or was not readable")
	s.Require().Equal(string(expected), output)
}

The base test allows us to easily add/write new golden file tests for each of our sub charts. For example, we have the following test for the Zeebe sub-chart:

package zeebe

import (
	"path/filepath"
	"strings"
	"testing"

	"camunda-cloud-helm/charts/ccsm-helm/test/golden"

	"github.com/gruntwork-io/terratest/modules/random"
	"github.com/stretchr/testify/require"
	"github.com/stretchr/testify/suite"
)

func TestGoldenDefaultsTemplate(t *testing.T) {
	t.Parallel()

	chartPath, err := filepath.Abs("../../")
	require.NoError(t, err)
	templateNames := []string{"service", "serviceaccount", "statefulset", "configmap"}

	for _, name := range templateNames {
		suite.Run(t, &golden.TemplateGoldenTest{
			ChartPath:      chartPath,
			Release:        "ccsm-helm-test",
			Namespace:      "ccsm-helm-" + strings.ToLower(random.UniqueId()),
			GoldenFileName: name,
			Templates:      []string{"charts/zeebe/templates/" + name + ".yaml"},
		})
	}
}

Here, we test the Zeebe resources: service, serviceaccount, statefulset, and confimap with default values against golden values. Here are the golden files.

Property test:

As described above, sometimes, we want to test specific properties, like conditions in our templates. Here it’s easier to write specific Terratest tests.

We do that for each manifest, like the statefulset, and then call it `statefulset_test.go`.

In such go test file, we have a base structure, which looks like this:

type statefulSetTest struct {
	suite.Suite
	chartPath string
	release   string
	namespace string
	templates []string
}

func TestStatefulSetTemplate(t *testing.T) {
	t.Parallel()

	chartPath, err := filepath.Abs("../../")
	require.NoError(t, err)

	suite.Run(t, &statefulSetTest{
		chartPath: chartPath,
		release:   "ccsm-helm-test",
		namespace: "ccsm-helm-" + strings.ToLower(random.UniqueId()),
		templates: []string{"charts/zeebe/templates/statefulset.yaml"},
	})
}

If we want to test a condition in our templates, which look like this:

 spec:
      {{- if .Values.priorityClassName }}
      priorityClassName: {{ .Values.priorityClassName | quote }}
      {{- end }}

Then, we can easily add such tests to the `statefulset_test.go` file. That would look like this:

func (s *statefulSetTest) TestContainerSetPriorityClassName() {
	// given
	options := &helm.Options{
		SetValues: map[string]string{
			"zeebe.priorityClassName": "PRIO",
		},
		KubectlOptions: k8s.NewKubectlOptions("", "", s.namespace),
	}

	// when
	output := helm.RenderTemplate(s.T(), options, s.chartPath, s.release, s.templates)
	var statefulSet v1.StatefulSet
	helm.UnmarshalK8SYaml(s.T(), output, &statefulSet)

	// then
	s.Require().Equal("PRIO", statefulSet.Spec.Template.Spec.PriorityClassName)
}

In this test, we set the priorityClassName to a custom value like “PRIO”, render the template, and verify that the object (statefulset) contains that value.

Integration test

Terratest allows us to write not only template tests, but also real integration tests. This means we can access a Kubernetes cluster, create namespaces, install the Helm chart, and verify certain properties.

I’ll only present the basic setup here, since otherwise, it would go too far. If you’re interested in what our integration test looks like, check this out. Here we set up the namespaces, install the Helm charts, and test each service we deploy.

Basic Setup:

//go:build integration
// +build integration

package integration

import (
  "os"
  "path/filepath"
  "strings"
  "time"

  "testing"

  "github.com/gruntwork-io/terratest/modules/helm"
  "github.com/gruntwork-io/terratest/modules/k8s"
  "github.com/gruntwork-io/terratest/modules/random"
  "github.com/stretchr/testify/require"
  "github.com/stretchr/testify/suite"
  v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

type integrationTest struct {
  suite.Suite
  chartPath   string
  release     string
  namespace   string
  kubeOptions *k8s.KubectlOptions
}

func TestIntegration(t *testing.T) {
  chartPath, err := filepath.Abs("../../")
  require.NoError(t, err)

  namespace := createNamespaceName()
  kubeOptions := k8s.NewKubectlOptions("", "", namespace)

  suite.Run(t, &integrationTest{
     chartPath:   chartPath,
     release:     "zeebe-cluster-helm-it",
     namespace:   namespace,
     kubeOptions: kubeOptions,
  })
}

Similar to the properties test above, we have some base structure that allows us to write the integration tests. This is to set up the test environment. It allows us to specify the targeting Kubernetes cluster via kubeOptions.

In order to separate the integration tests from the normal unit tests, we use go build tags. The first lines above, define the tag `integration`, which allows us to run the tests only via `go test -tags integration ./…/integration`.

We create the Kubernetes namespace name either randomly (using a helper from Terratest) or based on the git commit if triggered as a GitHub action. We’ll get to that later.

func truncateString(str string, num int) string {
  shortenStr := str
  if len(str) > num {
     shortenStr = str[0:num]
  }
  return shortenStr
}

func createNamespaceName() string {
  // if triggered by a github action the environment variable is set
  // we use it to better identify the test
  commitSHA, exist := os.LookupEnv("GITHUB_SHA")
  namespace := "ccsm-helm-"
  if !exist {
     namespace += strings.ToLower(random.UniqueId())
  } else {
     namespace += commitSHA
  }

  // max namespace length is 63 characters
  // https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names
  return truncateString(namespace, 63)
}

Go testify suite allows us to run functions before and after a test, which we use to create and delete a namespace.

func (s *integrationTest) SetupTest() {
  k8s.CreateNamespace(s.T(), s.kubeOptions, s.namespace)
}

func (s *integrationTest) TearDownTest() {
  k8s.DeleteNamespace(s.T(), s.kubeOptions, s.namespace)
}

The example integration test is fairly simple, we install the Helm charts with default values, and wait until all pods are available. For that, we can use some helpers, which Terratest offers here for example.

func (s *integrationTest) TestServicesEnd2End() {
  // given
  options := &helm.Options{
     KubectlOptions: s.kubeOptions,
  }

  // when
  helm.Install(s.T(), options, s.chartPath, s.release)

  // then
  // await that all ccsm related pods become ready
  pods := k8s.ListPods(s.T(), s.kubeOptions, v1.ListOptions{LabelSelector: "app=camunda-cloud-self-managed"})

  for _, pod := range pods {
     k8s.WaitUntilPodAvailable(s.T(), s.kubeOptions, pod.Name, 10, 10*time.Second)
  }
}

As written above, our actual integration test is far more complex, but this should give you a good idea of what you can do. Since Terratest is written in go, this allowed us to write all our tests in go, use mechanics like build tags, and use go libraries like testify. Terratest makes it easy to access the Kubernetes API, run Helm commands like `install`, and validate the outcome. I really appreciate the verbosity, since the rendered Helm templates are also printed to standard out on running the tests, which helps to debug them. After implementing the integration tests, we were quite satisfied with the result, and the test coding approach, which stands in contrast to having a separate abstraction around the tests that you would have with the Chart Testing tool.

After creating such integration tests, we, of course, wanted to automate them. We did that with GitHub actions (see next section).

Automation

As written above, we automate our tests via GitHub Actions. For normal tests, this is quite simple. You can find an example here of how we run our normal template tests.

It becomes more appealing for integration tests, where you want to connect to an external Kubernetes cluster. Since we use GKE, we also use the corresponding GitHub actions to authenticate with Google Cloud, and get the credentials.

Follow this guide to set up the needed workload identity federation. This is the recommended way to authenticate with Google Cloud resources from outside and replace the old usage of service account keys. The workflow identity federation lets you access resources directly, using a short-lived access token, and eliminates the maintenance, and security burden associated with service account keys.

After setting up the workload identity federation, the usage in GitHub actions is fairly simple.

As an example, we use the following in our GitHub action:


   # Add "id-token" with the intended permissions.
    permissions:
      contents: 'read'
      id-token: 'write'
      
    steps:
    - uses: actions/checkout@v3
    - id: 'auth'
      name: 'Authenticate to Google Cloud'
      uses: 'google-github-actions/auth@v0'
      with:
        workload_identity_provider: ''
        service_account: '@.iam.gserviceaccount.com'
    - id: 'get-credentials'
      name: 'Get GKE credentials'
      uses: 'google-github-actions/get-gke-credentials@v0'
      with:
        cluster_name: ''
        location: 'europe-west1-b'
    # The KUBECONFIG env var is automatically exported and picked up by kubectl.
    - id: 'check-credentials'
      name: 'Check credentials'
      run: 'kubectl auth can-i create deployment'

This is based on the examples of google-github-actions/auth and google-github-actions/get-gke-credentials. Checking the credentials is the last step to verify whether we have enough permissions to create a deployment, which is necessary for our integration tests.

After this, you just need to install Helm and go into your GitHub action container. In order to run the integration test, you can execute the go test with the integration build tag (described above). We use a Makefile for that. Take a look at the full GitHub action.

Last Words

We are now quite satisfied with the new approach and tests. Writing such tests has allowed us to detect several issues in our Helm charts, which is quite rewarding. It’s fun to write and execute them (the template tests are quite fast), and it always gives us good feedback.

Side note: what I really like about Terratest, is not only the functionality and how easy it is to write the test, but also it’s verbosity. On each run, the complete template is printed, which is quite helpful. In addition, it’s clear where the issue is with an error.

I hope to help you with this knowledge and the examples above. Feel free to tweet me if you have any thoughts to share or better ideas on how to test Helm charts.

The post Advanced Test Practices For Helm Charts appeared first on Camunda.

Argon2 as password-hashing function in Camunda

Christopher Kujawa — Fri, 24 Feb 2017 13:00:00 +0000

Introduction

On the new version of the Camunda Engine Platform (7.7) the user passwords, which are stored in the database, are by default hashed with a SHA-2 family algorithm.
Before the passwords are hashed, they are concated with an individual random generated salt for each user, to prevent dictionary and rainbow table attacks.

For someone who needs a more secure hashing algorithm Camunda introduce a new API, which allows to customize and exchange the default hashing algorithm.
In this blog post I will present this customization and will use argon2 as hashing algorithm. Argon2 is a password-hashing function 1, which is considered as state of the art and also won the Password Hashing Competition at the end of 2015 2.

Customization

To use a different password hashing function you have to implement the PasswordEncryptor interface.
This interface offers the methods to hash and verify the password. In the following example, the
argon2 implementation 3 is used to hash and verify the password.

package org.camunda.bpm.unittest;

import de.mkammerer.argon2.Argon2;
import de.mkammerer.argon2.Argon2Factory;
import org.camunda.bpm.engine.impl.digest.Base64EncodedHashDigest;
import org.camunda.bpm.engine.impl.digest.PasswordEncryptor;
import org.camunda.bpm.engine.impl.digest._apacheCommonsCodec.Base64;

/**
 * @author Christopher Kujawa 
 */
public class Argon2HashAlgorithm extends Base64EncodedHashDigest implements PasswordEncryptor {
  public String hashAlgorithmName() {
    return "argon2";
  }

  @Override
  public boolean check(String password, String encrypted) {

    // Create instance
    Argon2 argon2 = Argon2Factory.create();

    // Verify password
    return argon2.verify(new String(Base64.decodeBase64(encrypted)), password);
  }

  @Override
  protected byte[] createByteHash(String password) {

    // Create instance
    Argon2 argon2 = Argon2Factory.create();

    // Hash password
    // 2 iterations, 65536 Memory, 1 parallelism
    String hash = argon2.hash(2, 65536, 1, password);
    return hash.getBytes();
  }
}

In order to use the created PasswordEncryptor implementation which uses argon2, you have to
set the passwordEncryptor property of the ProcessEngineConfiguration. This can be done in the camunda.cfg.xml
and could look like the following snippet:

For the complete example see this repository. For further information
about password hashing in Camunda, see the documentation.

The post Argon2 as password-hashing function in Camunda appeared first on Camunda.

Camunda BPM 7.6.0-alpha5 Released

Christopher Kujawa — Fri, 14 Oct 2016 15:00:00 +0000

Camunda 7.6.0-alpha5 is here and it is packed with new features. The highlights are:

Implementation of the BPMN Conditional Event
Batch Cancellation of Historic Process Instances
Huge performance improvements due to caching of Scripting Engines and Compiled Scripts in DMN Engine
Expressions in Signal and Message Event Names
Cockpit Usability Improvements
Pluggable Deployment Cache
10 Bug Fixes

List of known Issues

The complete release notes are available in Jira.

You can Download Camunda For Free
or Run it with Docker.

BPMN Conditional Event

This release introduces support for the BPMN Conditional Event. This is the last BPMN 2.0 even type which was not already supported by the Camunda Engine. The Conditional Event provides use cases such as

waiting in the process until a certain condition is fulfilled,
canceling an activity or scope when a condition is fulfilled,
creating a new token when a condition is fulfilled.

This is a extremely powerful features which allows users to build very dynamic and event driven processes.
See for example the following process model:

If the state is after processing invalid, the work will be repeated. Also if the processing should be canceled and before that the current state should be archived, the conditional event can be used.

In Camunda the conditional events are data driven, they are evaluated if a variable changes.
Say the execution stays in the last user task. Is the following code executed, the conditional boundary event will be triggered.

  taskService.setVariable(task.getId(), "state", "invalid");

A corresponding conditional event definition can look like this:

  
    ${state == "invalid"}

For more information see the documentation of conditional events.

Note that there a currently known issues (CAM-6862), which we are addressing for the next alpha release.

Batch Delete of Historic Process Instances

This feature has been requested quite frequently by our users: cleaning up historic data after it is not needed any more. This release introduces both an API and a Cockpit Plugin (UI) to batch delete historic Process Instances based on either a list of Ids or a search query:

This feature is part of the bigger “Batch Operations” cluster in the context of which we have already provided the possibility to cancel large numbers of running process instances. Currently on it’s way but not yet part of this release is the feature to set job retries as batch.

Performance Improvements of DMN Engine

A potential customer evaluated our DMN Engine against another very popular open source business rules engine in a project. Of course our engine was a lot faster but when they showed us the numbers we thought that it could be better still. So we introduced caching of Script Engines and Compiled Scripts (Groovy, …) which made it even faster.

Caching of Compiled scripts is something which we already support for some time in BPMN and now users can enjoy the same benefits and fast performance in DMN.

As a result, we could process the workload (500K evaluations of a decision table) in 12 seconds compared to 76 seconds without the caching. (By comparison, the other open source rule engine took over 2 minutes longer.)

This is the second performance improvement we bring to the DMN engine in the context of the 7.6.0 release cycle.

Expressions in Signal and Message Event Names

It is now possible to use expression names in signal and message event names. This allows subscribing to messages and signals for which the name is only known at runtime. It enables advanced workflow patterns like synchronizing tokens on different branches or canceling groups of process instances based on a common key.

Consider the following example:

The signal thrown on the upper branch is supposed to cancel the task on the bottom branch. This always worked, but the problem with using signals is that they are broadcasted. So while the signal sent from the upper branch would effectively cancel the task on the bottom branch in the current process instance, it would also trigger all catching signal events with the same name in all other process instances.

This release introduces the possibility to use expressions in signal names and by this allows to scope the signal to the current process instance:

The documentation has the details for message events and signal events.

Cockpit Usability Improvements

This release adds a couple of usability improvements to cockpit.

For example, all search and filter bars now use the same search widgets for consistency:

Pluggable Deployment Cache

It is now possible to plugin custom implementations for the caches used for BPMN Process Definitions, CMMN Case Definitions and DMN Decision Definitions. Also, the default cache implementation has now a configurable setting for eviction to prevent out of memory errors in scenarios where users deploy tens of thousands of process definitions.

What’s next?

We are getting closer to the 7.6.0 release which is scheduled for November 30, 2016. The last larger open topic is support for DMN DRDs (Decision Requirements Diagrams) in Modeling and Cockpit. Execution Support for DRDs has been added with 7.6.0-alpha2 and we are now working on bringing them into monitoring (Cockpit) and the Modeling tools (bpmn.io / Camunda Modeler).

Feedback Welcome

Please try out the awesome new features of this release and provide feedback by commenting on this post or reaching out to us in the forum.

The post Camunda BPM 7.6.0-alpha5 Released appeared first on Camunda.

Camunda BPM 7.6.0-alpha4 Released

Johannes Heinemann — Fri, 23 Sep 2016 15:00:00 +0000

Camunda 7.6.0-alpha4 is here and it is packed with new features. The highlights are:

Batch Cancellation of Process Instances
CMMN Monitoring in Cockpit
New Home Page for the Webapplication
Improved Metrics API
25 Bug Fixes

The complete release notes are available in Jira.

You can Download Camunda For Free
or Run it with Docker.

Batch Cancellation of Process Instances

The new alpha version comes along with a new batch operation. This feature is only available in the Enterprise Edition of Cockpit.

It is now possible to cancel process instances asynchronously, based on search criteria andor a list of process instance Ids. This allows users to cancel a huge number of process instances.

The page is accessible from the process instance search on the dashboard in Cockpit.

Batch Cancel operations are also available in the Java and REST API.

CMMN Monitoring in Cockpit

This release introduces Monitoring Capabilities for CMMN in Cockpit. This feature is only available in the Enterprise Edition of Cockpit.

On the dashboard, a new tile named “Cases” is available:

Clicking on “Case Definitions” leads us to the Cases Dashboard where we can see a list of deployed Case Definitions and search for Case Instances:

When we select a Case Definition, the Case Definition View opens. This view shows the diagram of the latest version of the Case Definition. On the diagram, we can see how many times a particular Plan Item (Task) has been created and completed. We can also search for Case Instances:

Finally, if we drill down into a particular Case Instance, the Case Instance View opens. This view provides all relevant information about this particular Case Instance. In addition, it is possible to modify variables and terminate the Case Instance:

CMMN Human Tasks in Tasklist

As a cherry on top, Tasklist now displays the CMMN diagram if the current task is part of a CMMN Case:

New Home Page for the Webapplication

We have added a new “Home Page” to the Camunda Webapplication:

This is the first page the user sees when opening the Webapplication. It provides 3 basic functionalities:

Overview of which applications are available to the user.
Possibility to edit the user profile and change the password.
List of additional links which can be customized.

Improved Metrics API

This alpha release introduces the new metric interval query:

List intervals =  managementService.createMetricsQuery()
    .startDate(startTime)
    .endDate(endTime)
    .name(Metrics.ACTIVTY_INSTANCE_START)
    .interval(sixtyMinutesIntervalInSeconds);

The query returns a list of MetricIntervalValue objects, each of which provides the number of started (created) activity instances during an interval of 60 minutes. The results can be plotted to give an impression of load over time:

Similar data points can be retrieved for the other available metrics: activity-instance-end job-acquisition-attempt, job-acquired-success, job-acquired-failure, job-execution-rejected, job-successful, job-failed, job-locked-exclusive and executed-decision-elements.

Feedback Welcome

Please try out the awesome new features of this release and provide feedback by commenting on this post or reaching out to us in the forum.

The post Camunda BPM 7.6.0-alpha4 Released appeared first on Camunda.

Christopher Kujawa, Author at Camunda

One Exporter to Rule Them All: Exploring Camunda Exporter

Challenges with the current architecture

Performance

Scalability

Installation complexity

Data duplication and resource consumption

One exporter to rule them all

Migration (brownfield installation)

Conclusion

Join us at CamundaCon to learn more

Incident.io Daily Statistics with Camunda

Get details from incident.io

API Key

Incidents API

Example

Example usage

Automating with Camunda

Adding Incident.io API key as Connector secret

Query incident statistics via REST Connector

Extracting incident details

Sending statistics via Slack Connector

Running it every working day

Daily incident statistics with Camunda

Zeebe Performance 2023 Year In Review

Benchmarking

Batch processing

Summary

Result

Improving the performance of hot data with large state

Summary

Result

Cluster availability should not be impacted by runtime state

Summary

Result

Concurrent tasks

Summary

Result

Job push

Summary

Result

Broker Scaling

Summary

Result

Conclusion

Release 8.1 (Oct, 2022)

Release 8.2 (April, 2023)

Release 8.3 (Oct, 2023)

Drinking Our Champagne: Chaos Experiments with Zeebe against Zeebe

Chaos engineering and experiments

BPMN meets chaos engineering

Chaos toolkit

Chaos experiment

Action

Specification

Automation

Zeebe Testbench

Run chaos experiments regularly

Benefits

Observability

Drink your own champagne

Zeebe, or How I Learned to Stop Worrying and Love Batch Processing

Basics

Stream processing

Persistence

State

Performance hackdays

Measurement

Theory

Proof of concepts

Append and commit latency

Flush improvement

Write improvement

Reduce transaction commits

State directory separation

Disable WAL

Processing of uncommitted

Batch processing

Combining the POCs

Next