Selected stories

The following are selected stories about my work on machine learning, distributed systems, event sourcing and system integration. They primarily cover open source work but also some closed source work in industry projects.

Machine Learning

Since 2017, I develop AI search engines for digital asset management (DAM) system providers, supporting semantic, cross-modal search of images, videos, sounds and documents in asset databases. Assets can be retrieved with a combination of semantic search, keyword matching and metadata filtering. The search engines also support facial recognition for identity-constrained searches and image aesthetics assessment for selecting images with highest perceived quality. AI-based query analysis improves the semantic understanding of user queries, leading to better search result quality. AI models in the index and search pipelines are fine-tuned on both customer-specific and synthetic data for a better understanding of the customer’s business domain. An initial version of the AI search engine was offered by MerlinOne as Merlin Accelerated Intelligence (AI) Suite. Its commercial success was a major factor in the acquisition of MerlinOne by Canto in 2023.

More recently I started to explore LLM-based agentic systems, inspired by tools like AutoGen, crewAI or LangChain. Although most applications use the power of commercial, API-based models, I am also interested in using smaller, open LLMs for deployment in resource-constrained environments. I experimented with fine-tuning 7B LLMs on planning tasks with synthetic agent trajectories and was able to reach GPT-4 level planning performance. Application examples include agentic RAG with open source search tools, combined with usage of other tools for sending emails, updating calendars or executing generated code. Despite promising results, it also important to understand the limitations of LLMs in planning tasks and options for using them in combination with external verifiers.

I also implemented models of several research papers from scratch, either as an exercise or because I needed a custom implementation in a project. Examples include single image super-resolution models EDSR, WDSR and SRGAN in the super-resolution repository (citations), selected components from several image captioning papers to implement an image captioning transformer in the fairseq-image-captioning repository (citations) and multimodal perception models Perceiver, Perceiver IO and Perceiver AR in the perceiver-io repository (citations). All repositories are meanwhile used and/or referenced in other research works (see citations). I especially enjoyed implementing Perceiver AR, an auto-regressive sequence model with cross-attention to long-range inputs and rotary position embeddings. It was a great oppurtunity to build and train a GPT-like LLM from scratch.

My work on Bayesian statistics and Bayesian methods for machine learning is collected in the bayesian-machine-learning repository (citations). Each notebook in this repository covers a single topic and combines an introduction, mathematical basics and a simple implementation. A simple connection from theory to running code is something that I missed in available literature when I started to learn about Bayesian approaches. This repository is an attempt to improve on that and also helped me to better understand each topic. It has been gratifying to receive encouraging feedback like in Variational inference in Bayesian neural networks or Bayesian optimization, for example, or in the repository directly. This response suggests the repository may be helpful for others in their learning journey as well.

Distributed systems

From 2014 to 2017, I mainly worked on the global distribution of an international customer’s in-house digital asset management platform. State replication across multiple datacenters, low-latency write access to replicated state and write-availability during inter-datacenter network partitions were important requirements from the very beginning. We decided to follow an event sourcing approach for persistence and developed an event replication mechanism that preserves the causal ordering of events in event streams. For replicated state, we used a causal consistency model which is the strongest form of consistency that is still compatible with AP of CAP. The implementation was based on both generic and domain-specific operation-based CRDTs. I was responsible for the complete distributed systems architecture and the development of the generic platform components. These components evolved into the Eventuate open source project of which I’m the founder and lead developer. Eventuate has several production deployments.

We also used Eventuate to build distributed systems composed of microservices that collaborate via events. With Eventuate, services can rely on consuming events from local event logs in correct causal order, without duplicates. They can also rely on the write-availability of local event logs during network partitions and the reliable delivery of written events to collaborators. Eventuate-based microservice systems are conceptually similar to those that can be built with Apache Kafka and Kafka Streams but Eventuate additionally implements a causal consistency model for systems that are distributed across multiple datacenters.

I already worked with globally distributed systems in 2004 where I developed a distributed computing solution for an international pharmaceutical company. In their drug discovery pipeline they used several computing services, deployed at different locations around the globe, for analyzing chemical structures regarding their biological activity. The developed solution integrated these computing services and enabled researchers to run them with a single mouse click from a chemical spreadsheet. The solution managed the reliable execution of compute jobs, persisted the results and delivered them back to the user. It ran in production for many years and was an integral part of the drug discovery pipeline of that company.

Event sourcing

I use event sourcing since 2011 in my projects. I started to apply that approach during the development of an electronic health record for an international customer. Event sourcing proved to be the right choice in this project given the demanding read and write throughput requirements and the needed flexibility to integrate with other healthcare IT systems. I later generalized that work in the Eventsourced open source project that I developed in collaboration with Eligotech. Eventsourced adds persistence to stateful Akka actors by writing inbound messages to a journal and replaying them on actor restart. Eventsourced was used as persistence solution in Eligotech products.

In 2013, Eventsourced attracted the interest of Lightbend (formerly Typesafe) and we decided to start a collaboration to build Akka Persistence which is now the official successor of Eventsourced. I was responsible for the complete development of Akka Persistence, from initial idea to production quality code. Akka Persistence has numerous production deployments today and is used as persistence mechanism in the Lagom microservices framework. I also developed the Cassandra storage plugin for Akka Persistence which is now the officially recommended plugin for using Akka Persistence in production.

In 2014, I started to further develop the idea of Akka Persistence in the Eventuate open source project. Among other features, Eventuate additionally supports the replication of persistent actors, up to global scale. The replication mechanism supports a causal consistency model which is the strongest form of consistency that is still compatible with AP of CAP. The concepts of Eventuate are closely related to those of operation-based CRDTs which is further described in this blog post (see also section Distributed Systems).

System integration

In 2006, I started to work on a project at ICW in which we integrated hospital information systems of several customers using IHE standards. Technical basis for the integration solutions was the Apache Camel integration framework for which I developed integration components that implement actor interfaces of several IHE profiles and a DSL for processing HL7 messages and CDA documents (see also this article for an introduction). In 2009, these extensions have been open sourced as Open eHealth Integration Platform (IPF) of which I’m the founder and initial lead developer. IPF has many production deployments in international customer projects today and is still actively developed by ICW, the sponsor of the open source project. IPF is a central component of ICW’s eHealth Suite and provides connectivity to a wide range of healthcare information systems. Its standard compliance has been certified in several IHE Connectathons. During my work on IPF I also became an Apache Camel committer.

To meet the increasing scalability requirements in some IPF projects I started to investigate alternatives to Apache Camel’s routing engine. I decided to use Akka actors for message routing and processing which proved to be a better basis for scaling IPF applications under load. The result of these efforts was the akka-camel module that I contributed to Akka in 2011. It implements a generic integration layer between Akka actors and Apache Camel components, including the IHE components of IPF.

I also developed other routing engine alternatives that follow a pure functional programming approach. A first attempt was scalaz-camel which is now superseded by the Streamz project which I actively developed with other contributors over many years. It allows application developers to integrate Apache Camel components into FS2 applications with a high-level integration DSL. It also supports that DSL on top of Akka Streams. Streamz is meanwhile the official replacement for akka-camel and part of the Alpakka ecosystem.