R is a powerfull tool for data scientists but a huge pain for software developers and system administrators.
It’s hard to automate and integrate into other systems. That’s because it was developed mostly in academic environments for adhoc analytics instead of IT departments for business processes.
As every meal gets better if you gratinate it with cheese, every command line tool gets better if you wrap it in a PowerShell Commandlet.


You can do a lot with the R command line tools Rscript.exe, but the developer of R seem to have a strange understanding of concepts like standard streams and return codes. So i created the PowerShell module PSRTools, that provides Commandlets for basic tasks needed for CI/CD pipelines.

If you have build dependecies, you have to install a R package from a repository. You can specify a snapshot to get reproducable builds. An example is:

Install-RPackage -Name 'devtools' -Repository 'https://mran.microsoft.com/' -Snapshot '2019-02-01'

In case you have the package on the disk you can install it with the Path parameter.
In both cases you can specify the library path, to install it on a specific location.

Install-RPackage -Path '.\devtools.tar.gz' -Library 'C:\temp\build\library'

If you have inline documentation in your R functions, you have to generate Rd files for the package. There is the Commandlet New-RDocumentation for that.
The build can be executed using New-RPackage. Both Commandlets have the parameters Path for the project location and Library for required R packages.

Build Script

The build script depends a little on your solution. It could be something like:

param (

$Repository = 'https://mran.microsoft.com/'
$Snapshot = '2019-02-01'
$Library = New-Item `
    -ItemType Directory `
    -Path ( Join-Path (
    ) (

Import-Module PSRTools -Scope CurrentUser
Set-RScriptPath 'C:\Program Files\Microsoft\R Open\R-3.5.2\bin\x64\Rscript.exe'

Install-RPackage -Name 'devtools' -Repository $Repository -Snapshot $Snapshot -Library $Library
Install-RPackage -Name 'roxygen2' -Repository $Repository -Snapshot $Snapshot -Library $Library

New-RDocumentation -Path $Project -Library $Library
$Package = New-RPackage -Path $Project -Library $Library
Copy-Item -Path $Package -Destination $StagingDirectory

For build scripts i recommend InvokeBuild, but wanted a simple example.

Continuous Integration

To achive automated builds, you have to integrate it in a build script.
I tested it successfully in Appveyor and Azure Pipelines. Both work pretty much the same.

If you don’t have your own build agent, that can be prepared manually, you need to install R and RTools in your build script.
That can be done using Chocolatey.

In your appveyor.yml it is like:

  - choco install microsoft-r-open
  - choco install rtools
  - ps: Build.ps1 -Project $env:APPVEYOR_BUILD_FOLDER -StagingDirectory "$env:APPVEYOR_BUILD_FOLDER\bin"

In you azure-pipelines.yml it is like:

- script: |
    choco install microsoft-r-open
    choco install rtools
  displayName: Install build dependencies
- task: PowerShell@2
    targetType: 'inline'
    script: |
      Build.ps1 -Project '$(Build.SourcesDirectory)' -StagingDirectory '$(Build.ArtifactStagingDirectory)'

Not that terrible anymore, right?

As software engineers in fast-paced projects we risk getting overwhelmed by huge workloads and deadlines. Usually, the work is fun and that is why we can cope with it over longer periods, but we need moments to breath and think about other things. I do not want to talk about general work-life balance here. I want to talk about the work-portion here. We need to make breaks or follow a slight activity (like take a short walk in the nearest park). Another thing we experienced to be refreshing are retrospectives! Obviously their objective is to improve processes to increase efficiency and discover impediments (check the SCRUM GUIDE, p. 14), but they are also fun during times with high workloads. Therefore, if you feel like skipping them to have more time that is “productive” – resist.

Shout out

Following are some ideas on how to make your team’s retrospective a bit more exciting. Be careful that they fit your team. Do not force your colleagues to do what they do not want, especially as the following ideas might be a bit too cheesy.
A shout out to various sites in the web where we took ideas and inspiration

In addition, an extra shout out to my colleague Harald Wittmann for coming up with some great ideas.
Without further ado, here are ideas for retrospectives to inspire your team.

Visual and playful

Super hero

Everybody likes superheroes, so do we. It does not matter if you are a fan of the MCU, DC or if you feel indifferent about Superman, Thor & friends; in this retrospective, the hero is just a vessel. They help the team to jump out of character and enable you to see your processes from a different angle.

Superhero retreat: Visualize your main base here. In our case, it is called "the engine tower", because there is "engine" in our project name. In your case, this should represent your project. It is the place where you feel comfortable, a safe space. Here you should collect what you did great, what your strength are and what you can rely on.
Sidekick: Every hero has someone to help him in dangerous times. In the retrospective, this is where you can call for help: write down where you need support. An example could be, "we need more time for research on a topic before we start implementing".
Gadgets: Image a belt with all your tools for converting ideas into software. What tools/skills can you improve or add to your belt?
Villain: This one is easy – what are obstacles in your way to achieve your objectives? Write down what could interrupt and/or destroy your goals.

The three little pigs

Play your team through the fable of the three little pigs. This is visual and just a tiny bit childish. This enables you to jump out of role (again). A piglet in a fairy tale can say completely different things than a senior software engineer. In addition, this retrospective is understandable easily.

The three huts (straw/sticks/brick) visualize things that can slip/rip easy, things that work, but can be improved and things that the team can rely on. Finally, we have the Big Bad Wolf. Like the villain previously, he resembles risks, obstacles and impediments.

The boat

As you might realize by now, we are a visual team. For us, the biggest benefit is that you view things from a different point and say things you would not have thought of or talked about.

The island is your goal. It is the ultimate objective. Be careful, we are talking process goals not software features or releases.
The wind is what gets you there. Collect what makes you efficient and helps you to continue and focus your efforts.
In contrary, post things that delay you near the anchor. Bigger risks (i.e. not just slowing, but also stopping your work) would find their place near the rocks.

The box

The box follows the same approach as the previous ideas, but in a more basic and easy understandable way. Draw a box. It shows your current environment/toolbox. Collect what is in there (keep/recycle), what should be in the box, but is not yet and what you want to kick out.

If you want to, you could try this retro with a physical box.

The traditional

While the visual ideas are great, they require some set up and preparation. That is why we mix in more traditional retrospectives, too.

Liked learned lacked

Like the following ones, this follows the typical retrospective pattern, where you collect what was good, bad and what can be improved.

Liked: What went well, what was new that improved your process?
Learned: When some was harder than expected or took longer than estimated, try to derive what you learned from that to not repeat the mistake.
Lacked: What are things that slowed your progress? What are things you want to get better in? An example could be "degree of automation (around product tooling)".

Sad mad glad

The same approach as the liked – learned – lacked.

What made you sad last sprint (what kept your efforts below expectations) or even mad (things that made parts of the sprint fail). Do not forget to collect what went good (glad).

Start Stop Continue

A mix of the simpler retrospectives and "the box". Sadly, it does not rhyme like the other ones.

Start: Things you could improve, e.g. plan in more time for the presentation of concepts/mock ups to validate them with the business colleagues.
Stop: Things you should stop in the next sprint, e.g. let requests grow too big.
Continue: Things that went well and you should not forget about, e.g. pick a topic and discuss it in a two-man session while you walk through the nearby park.

I Wish / I Like / If everything was possible

This focusses on “I wish” (what could have been better, what is the team/process lacking) and “I like” (what did we do well, what learnings occurred during the sprint) at first.

Then we want to get an element we had in previous ideas (e.g. what would be your superpower…): "if everything was possible…." This enables us to dream and speak openly. While these things are utopic, you can try to make a step in their direction. An example could be, "If everything was possible, I would only write clean code". This could show I had troubles with this in the past. Either I did not have enough time to refactor a section, lack knowledge about clean code or something along those lines. Now we can derive actions from; like look for a book or a training, make a session with typical code smells, estimate more time for refactoring, plan in more code reviews…
In the usual work environment, I would not just say, "Hey folks, I write dirty code" (well maybe…).

Other Ideas

I hope that you found some inspiration reading this. To top it off, here are two additional tips:

  • Before the retrospective, go back to your backlog and write down what you achieved during the last sprint. Then at the start of the meeting, just listen the topics and challenges you overcome. This helps the team to think back and remember things they want to bring up in the retrospective. For us a moral boost helps with such a highly creative artifacts as a scrum retrospective.
  • Try to visualize your team structure somehow. This could be a football field or the ship we saw previously. What position would everybody take? (Think about captain, on board scrubbing the deck…)

Do you have ideas for retrospectives? Leave a comment or send me an email!

In data analytics, there are a lot of nice and shiny buzzwords, products and concepts. Before you decide anything, you should be clear about your actual and future needs. Your analytics infrastructure should enable you to analyze data. But there are aspects that architectures support differently and you have to trade off. There is no free lunch.

Here are some explanations that should help you to orientate what you need and enable you to compare different approaches.

Data Location

It’s important where the data is located. You can leave the data on the source system and read it per analysis. Or you can copy the data to an analytics system.

Pros of leave and read are:

  • You will save space.
  • You will have always the most recent data.

Pros of copy are:

  • You may get fast analyses, since you can optimize the storage for analytics.
  • You maintain fast operations, since you don’t read data that the source system wants to use at the same time.
  • You get consistent results, since you control the updates.
  • You can implement historization and don’t lose information.

Usually the data is copied, but if the volume is big enough or you don’t have the resources you may want to leave it on the source.

Data Structure

How or if your data is structured has a significant impact on your analytics infrastructure. There are structured data, semi structured data and unstructured data. Structured data are for example relational databases. Semi structured data contains information how to separate values and identify structure. Examples for tabular semi structured data are simple Excel sheets or CSV files. Examples for hierarchical are XML or JSON that are often used by web services and APIs. Examples for unstructured data are images, PDF or plain text.

Analytics is all about reduction of information to be consumable by humans or processes that humans create and understand. Reduction requires structured data. Semi structured data can be transformed into structured data. Unstructured data may be transformable into structured data, but not always, not that easy or not error free. Avoid Excel, PDF or Text as data source, whenever possible.

Data Transformation

Data from different sources may be hard to combine, since there is no common identifier or different formats. There are again two approaches: "Schema on read" and "Schema on write". "Schema on read" means, you leave the data in its raw form and transform it when you analyze it. "Schema on write" means, if you write the data you transform it into a common format. You may change the format, the data types, normalize it and deduplicate it.

Pros of "Schema on read" are:

  • You may save effort on integrating new data.
  • The raw data may contain more information than the integrated data.

Pros of "Schema on write" are:

  • It’s less effort for analyses using integrated data in development and computation terms.
  • Investment in quality pays out more, due to the reuse of transformed data.
  • Preaggregated data will speed up analyses.

Data Volume

It’s hard to say what data volume is big, but anyway it should have a big influence on your individual solution. In analytics systems, performance is often provided by redundancy. Some results that are used several times or with reduced latency are precalculated and stored.
So one piece of information is stored many times in different ways. That’s performance efficient but not storage efficient.
Obviously that may be a problem with a lot of data.

Usually one can say that data that is automatically generated, e.g. from sensors or logging functions may come in high volumes. Manually generated data like orders in your ERP system or master data usually not.

Data Velocity

Data velocity means how much time elapses from data generation to analysis. High velocity may come along with other restrictions or increased effort.

Most common are scenarios with updates on a daily-basis. For regulatory supervision it may be enough to update your data once a quarter.

A often misused term in this topic is real-time. If your system is real-time capable, it means that you guarantee a result in a specified time. That’s important for example in embedded systems in automotive or industrial environments. In business context real-time is used as best effort latency and only in special cases necessary. Imagine a manager that makes decisions, based on reports that are updated and changed every 5 minutes. That holds the risk to react to random events instead of pursuit a strategy. That may be different in a cloud-based application scenario, where you want to scale-up or scale-down the system based on the usage. Or think about a process that changes the prices in a ecommerce scenario based on the recent sales.


You see if someone tries to sell you something without listening to your requirements, there is a good chance to end up with something that does not deliver what you need or may be accomplished with less effort.

As IT consultants, we try to solve problems on a daily basis. This is our normal workload, our daily business. But this is not our only duty, we need to keep up with the technical evolution, we need to learn continuously to satisfy our customers. This is why we read about new things in blogs, visit meetups in our free time and go to conferences (like the one this post is inspired by "down to earth architecture" by Uwe Friedrichsen @SAS 2019 in Munich). We are influenced by all these channels and need to be careful how to use the knowledge in our working environments or we end up with one of these stereotypical types of bad software architecture:

Stackoverflow architecture (or google-driven architecture)

We have a problem to overcome in our software system and are not familiar with the topic. Therefore, we search the internet for books, blogs or tutorials. We find a slightly related solved problem on Stackoverflow and copy the solution without much thinking.
We are not talking about copy-pasting code here, but rather abstract solutions like "where should ids be generated in CQRS." We do not want to downplay the absolute knowledge found on Stackoverflow, but we should be sure the solution we found actually fits the problem and/or adapt accordingly.

Conference-driven architecture

Whatever conferences you visit, you always feel attached to your track or topic. These could be things like micro-services, domain-driven design or EventSourcing. While these are very good solutions to their respective problems, they might solve problems you aren’t even encountering in your domain or there are other good solutions.
Additionally, most of the time, we are not starting an application from scratch. If we visit conferences regularly and always incorporate the hot topics, we end up with a mess after some time.

Hype-driven architecture

Similar to the conference-driven architecture, we find the hype-driven architecture. Every (new) application needs to be distributed into micro services. Of course, that’s not true. There are huge benefits in following a micro-service (or SCS) approach, but there are also challenges, constrains and problems! Learning and especially applying a framework is often useful. However, you should not force a framework onto your system if there is no need for it! Most of the time, learning how to solve your domain’s problems (e.g. how to handle consistency in distributed systems or mastering personal data and GDPR) is more beneficial than being a master of a framework.

Strategic architecture (aka PowerPoint architecture)

Usually, when you join a project there are some PowerPoint slides describing the architecture of the system or application. You go through these, but your colleagues advise against doing so: "these are for compliance" or "we made this for the latest steering committee". When the slides diverge too much from the actual structure or code, misunderstandings are about to happen! While there are reasons to display different aspects of your software to different stakeholders, try to minimize this.

Tunnel-vision architecture

As a software engineer or architect, you need to work on some topics in excessive detail. We need to build walls around us and analyze problems in-depth! Occasionally, we need to look around, too. With more experience, we learn to balance the extremes. Especially for younger developers, there is a risk of over-engineering one detail or creating problems on other ends of the system.

Blast-from-the-past architecture

Technology advances, business models evolve and the underlying software architecture needs to do this, too. There are challenges that a lot of software components face; an example is versioning of web APIs. A versioning concept of system-to-system APIs with /v1/, /v2/, /v3/ might work for applications that had a release once a month and a breaking change once a year, but probably won’t work for a fast paced API in an API economy where time-to-marked is a driving factor.

Big design up front

In a world with perfect information, where all user needs and every aspect of your system are clear, Big Design Up Front (BDUF) could work. BDUF is closely related to the waterfall approach of developing software. This clashes with the agile world. Similar to communism and capitalism, BDUF and agile development are two paradigms where neither is inherently bad or good – it’s just that one is more practical in real life. Especially in a fast-moving world where innovation is key, agile development won the battle and there is no place for BDUF architecture.

One-size-fits-it-all architecture

Develop your application as a polyglot, domain-driven micro-service architecture with CQRS and EventSourcing. Use Kubernetes as container orchestrator with Helm for deployment, Prometheus and Grafana for monitoring and GIT as source control system. Frontend is Angular, machine learning is done in python and we use Mongo and Cassandra for persistency. Caching is done through redis and the whole application needs to be cloud agnostic and conform to all cloud native principles. While this is a noble approach and a turn-on for software engineers, this might not suite our business needs in any way. We could solve many problems with this technology selection, but we are likely over-engineering and not optimizing our efforts.

Accidental architecture

Remember the cone of uncertainty? When you start developing a product, about everything is blurry. You don’t know the user-needs; you don’t know the scale of your application and so on. At this stage, you might not be able to find solutions to some problems as you cannot answer essential questions. At this point, you need to act accordingly! Work with interfaces, adapters and libraries that can be switched later easily or don’t put too much effort in some components as you will either replace them later or implement a more sophisticated version anyways.
Don’t just "do it" or you will end up with a mess of decisions that nobody wanted to make. Another way accidental architecture happens is the development team is either unaware of or under-experienced to identify key-issues.

How do we make sure not to end up with one of this? I’ll look for a more detailed answer in another article, but it boils down to this: we should ask why we need architecture initially.
We have requirements, constrains, problems etc. We figure out solutions (for example with an approach like "orient – explore – evaluate – support" from Uwe Friedrichsen). When we follow this path, we protect our systems from the types of bad architecture above. As unlikely as it seems, if we end up with an architecture that is similar to the ones above, it’s fine. We engineered it with the right intentions. Additionally, learn when to not use certain solutions and follow Uwe Friedrichsen’s advices:

  • Think holistically
  • Resist hyper-specialization
  • Get a T-shape profile
  • Leave your comfort zone once in a while
  • Understand your domain
  • Don’t fall for hypes
  • Cope with technology explosion
  • Master the foundation design
  • Don’t overact

I recently had the task to automate a program, with a COM interface, and integrate it in a database application.
I already used PowerShell to automate Docker, SqlPackage and others.
So my first thought was to use PowerShell in this case too, but due to the complexity of the task i decided against it.
I ended up with a C# solution with round about 300 classes and i’m happy with it, but it brought me to the question what good criteria are to choose between PowerShell and something else.

Basically PowerShell is a nice hammer, but not every problem is a nail.
And since almost every programming language is turing complete, you can every problem in every language, but they have their pros and cons.
This is especially true for .NET-based languages like C#, F# and PowerShell, since they share the same libraries.
So you can develop a nice graphical user interface using Windows Presentation Foundation in PowerShell, even though is was originally designed for C# applications.

There are many problems out there, that you can solve in PowerShell with less code than in other languages, which makes it faster to write and easier to maintain.
But now i will show you some cases when PowerShell is a little painful and other languages are a better choice.

Inhertiance and Polymorphism

PowerShell is object-oriented and where object-orientation is, there is polymorphism not far. So you define interfaces and maybe several implementations for it.
But since PowerShell is a interpreted language there is no type or interface checking before runtime.
In default, everything is of type Object and you see if a method is available, when you execute the code.
You can assert types, but not have to.

There are different ways to get new objects in PowerShell.
Often they are created in Commandlets that are written in C#, like Get-Process or New-Item.
Another common option is to create a custom object using New-Object -Type PsCustomObject -Property @{ 'Foo' = 'Bar' }.
That creates a generic object that can be extended by any property or method in runtime.
Another option is to create the object in PowerShell but write the class definition in C#.
You can do that from existing .NET libraries or even in runtime.
Store the C# code in a variable and add the classes with Add-Type.
That were the options for PowerShell version 4, but in version 5 classes were introduced.

All that methods have their reason.
Let me explain that using some questions:

  • Why would you want to create a PowerShell class, if you can use a PsCustomObject?
  • Why would you want to create a PsCustomObject, if you can use a Hashmap?

Commandlets are the default if you want to use existing PowerShell modules.
Hashmaps are the default if you need custom attributes in an object.
But if you want to pass data to existing Commandlets, for example to write it CSV files, then the easiest way is to use a PsCustomObject.
If you write your own functions that require parameters with specific properties and methods, then its better to define a class, that can be easily validated.
The next level of complexity is, if you write a function that has parameters that may be of the one or of the other type with the same interface.
Then you start thinking about abstract methods, reuse of code between these classes.
Here C# supports more expressions to simplify the code and compilation time validation improves the quality.

So if you start to create inherited classes in PowerShell you probably gone to far.
Maybe it’s better to create a PowerShell Module in C# or a .DLL in C# and include it in your Powershell code.

Concurrency and Parallel Computing

Since PowerShell supports using the System.Threading library of .NET, you can do multicore computation in PowerShell.
In some cases this is not even a bad idea.
A common case where PowerShell is used is automation and integration of other tools.
For example run a compiler, call a web service and so on. These tools may produce output that you might want to process while the tool is still working.
Sometimes you have to do that since otherwise you would get a overflow of the output buffer and you don’t get the entire output.
In that case you can define PowerShell block as variable and register it as a event handler.
But there are other cases where you have more parallel processes that need to be synchronized somehow and that may communicate in between. Then C# or F# has better expressions to manage asychronous calls.

Software development is hard. Sure, there are things that can make your live easier (e.g. Containers or ubiquitous language), but sadly there is "No Silver Bullet" as Frederick P. Brooks Jr. concludes in this 16-pager. With our advance in technology, development becomes easier and faster. But some things may not bring the redemption we hoped they were (like "automatic" programming code or even OOP and somewhat newer AI).
One of the more promising members of the redemption-club is the "Great Designer" (p.15) of the software system. They build software "faster, smaller, simpler, cleaner […] with less effort". Today, we call someone with the skillset described by Brooks a "software architect".

In 2019, I went to a great Summit in Munich where Trisha Gee (@trisha_gee) held the key note about required skillset for a software architect. I want to share her insights mixed with my views here:

Master of communication

The software architect is a master of communication. Obviously, this is not limited to verbal communication, but also includes writing skills. Writing does not stop with good programming and documentation skills. Things that matter are e-mails, slack and twitter! Asking questions like "what are we building?" and "what skills does the team have?" are as important as listening to the answers and translating it into software.

"Your code does not speak to the machine. It speaks to the next one who reads it!"

Talk to different people. Talk to developers, domain experts and users. Try to get a feel for their problems, challenges and constrains within their domain.

Adaptability & open minded-ness

Be openminded! There are a thousand views on a simple topic. Users und domain experts might change their minds rapidly, technology and processes change. It is your job to order things, estimate the impact and derive actions.

It’s not the year of K8s!

No, Kubernetes, AI and agile development are not the magic solution to every problem. Always learn what’s needed.

Prioritization & time management

We all work in projects. There is always too much work for too few people – deal with it. Allocate time for yourself. Make a plan for your work, for time at home and absolutely free time. Mental health is an essential part of a "Great Designer". As an architect, your time is limited and valuable. You cannot learn everything, but try to keep up.

Stay technical

Most of the things up until this point are non-technical. But be careful; do not underestimate the "Business Analyst Movement". Trisha points out that especially women are pushed into non-technical and softer roles too often. Don’t become a PM, stay an architect.

Scale out

At some point in the history of software engineering we got to the point where we understood, scaling out may be better than scaling up [Admiral Grace Hopper]. The same applies for great engineers. Instead of just getting better, help others to get better.

If you want to be 10 times more productive, teach 9 people your skillset.

Use "pair programming" more often, but do not stop at development. Do it for deployments and troubleshooting with a DevOps engineer and for domain building with a business analyst. Code reviews and walkthroughs "are not for finding bugs only – they are about sharing information and writing the best system you can". If your company supports it, Trisha recommends 20% time. Another idea to share are book clubs where five people read a book – one or two chapters each and tell the others about key information in their part. This way everybody can get a little knowledge and decide if it’s worth reading the whole thing.

"Nobody knows how good you are! Teaching makes you look good."

There are different ways of teaching and being taught. You can teach in internal, informal (or less formal) sessions during lunch time called

Visit user groups and speak on conferences. As usual, there are pros and cons for each type. Decide what’s the best for you.

If you don’t like sharing with foreigners, share with your colleagues. This way you avoid too narrow specialization and knowledge-silos.

Retention and recruitment

Being a good architect means finding new projects and interesting topics in your environment. That is the easy part. Also, watch out for new colleagues and keep your team(s) happy! Be a good role model, be a paragon for great designers.

Community support

We love stack overflow! We visit conferences and we gather at meetups. You cannot explore every technology yourself – especially not as an emerging architect. You need to consume what the community can provide, but you also need to give back. You can talk about your personal challenges when your first big project failed or you commit to an open source project: maybe there are easy enhancements for your favorite JavaScript library or you build a python wrapper for a public REST API.
Do you like Goldman Sachs? Probably. But aren’t they an evil banking company? Probably. Nevertheless, their developers are vivid contributors to Java libraries. They published their enhanced version of JavaCollections (called GS Collections) and influenced a lot of things like the Java Streaming API.
The same things goes for Microsoft. They open source their .NET Core platform as part of the .NET foundation and publish their code to the best IDE ever created on GitHub.

As data-driven and AI-first applications are on the advance, we extend our best practices for DevOps and agile development with new concepts and tools. The corresponding buzz words would be continuous intelligence and continuous delivery for machine learning (CD4ML).

For our current project, we researched, tried different approaches and build a proof of concept for a continuously improved machine learning model. That’s why I got interested in this topic and went to a meet up at Thoughtwork’s office. Christoph Windheuser (Global Head of Artificial Intelligence) shared their experience in this field and gave a lot of insights. The following post summarizes these thoughts [1] with some notes from our learning process.

CD4ML continuous intelligence cycle

The continuous intelligence cycle

1- Acquire data

Get your hands on data sets. There are multiple ways, most likely the data is bought, collected or generated.

2- Store, clean, curate, featurize information

Use statistical and explorative data analysis. Clean and connect your data. At the end, it needs to be consumable information.

3- Explore models and gain insights

You are going to create mathematical models. Explore them, try to understand them and gain insights in your domain. These models will forecast events, predict values and discover patterns.

4- Productionize your decision-making

Bring your models and machine learning services into production. Apply your insights and test your hypothesizes.

5- Derive real life actions and execute upon

Take actions on your gained knowledge. Follow up with your business and gain value. This generates new (feedback) data. With this data and knowledge, you follow up with step one of the intelligence cycle.

Productionizing machine learning is hard

There are multiple experts collaborating in this process circle. We have data hunters, data scientists, data engineers, software engineers, (Dev)Ops specialists, QA engineers, business domain experts, data analysts, software and enterprise architects… For software components, we mastered these challenges with CI/CD pipelines, iterative and incremental development approaches and tools like GIT and Docker (orchestrators). However, in continuous delivery for machine learning we need to overcome additional issues:

  • When we have changing components in software development, we talk about source code and configuration. In machine learning and AI products, we have huge data sets and multiple types and permutations of parameters and hyperparameters. GitHub for example denies git pushes with files bigger than 100mb. Additionally, copying data sets around to build/training agents is more consuming than copying some .json or .yml files.
  • A very long and distributed value chain may result in a "throw over the fence" attitude.
  • Depending on your current and past history, you might need to think more about parallelism in building, testing and deploying. You might need to train different models (e.g. a random forest and an ANN) in parallel, wait for both to finish, compare their test results and only select the better performing.
  • Like software components, models must be monitored and improved.

The software engineer’s approach

In software development, the answer to this are pipelines with build-steps and automated tests, deployments, continuous monitoring and feedback control. For CD4ML the cycle looks like this [1]:

CD4ML Pipelines

There is a profusely growing demand on the market for tools to implement this process. While there are plenty of tools, here are examples of well-fitting tool chains.

stack discoverable and accessible data version control artifact repositories cd orchestration (to combine pipelines)
Microsoft Azure Azure Blobstorage / Azure Data lake Storage (ADLS) Azure DevOps Repos & ADLS Azure DevOps Pipelines
open source with google cloud platform [1] Google cloud storage Git & DVC GoCD
stack infrastructure (for multiple environments and experiments) model performance assessment monitoring and observability
Microsoft Azure Azure Kubernetes Service (AKS) Azure machine learning services / ml flow Azure Monitor / EPG *
open source with google cloud platform [1] GCP / Docker ml flow EFK *

* Aside from general infrastructure (cluster) and application monitoring, you want to:

  • Keep track of experiments and hypothesises.
  • Remember what algorithms and code version was used.
  • Measure duration of experiments and learning speed of your models.
  • Store parameters and hyperparameters.

The solutions used for this are the same as for other systems:

search engine log collector visual layer
EFK stack elasticsearch fluentd kibana
EPG stack elasticsearch prometheus grafana
ELK stack elasticsearch logstash kibana

[1]: C.Windheuser, Thoughtworks, Slideshare: https://www.slideshare.net/ChristophWindheuser/cd4ml-thoughtworks-meetup-munich-christoph-windheuser-may-8th-2019


Most people want to learn new things. It could be a new skill, a new hobby or simply to broaden your general knowledge horizons. We have this desire to learn and grow, but yet we struggle to find the discipline to achieve our learning goals. I’m sure all of us can attest to learning intentions – be it from a new year’s resolution or some other source of inspiration – that died a silent death along the road side.

So, why can’t we achieve our learning goals? A complex question indeed, but a part of the problem is that we have to actively do something in order to get where we want to be. For example, if you want to learn a new language, then you have to open a book and read; or log onto a website and complete the lessons and tests; or go to evening classes at your local school or college. And there are more than enough challenges in our lives that prevent us from doing this diligently! But what if we could still learn without actually doing something actively, just by engaging in our normal daily routine?

Active vs. Passive Learning

As already mentioned, Active Learning means that you have to initiate an action by yourself to achieve a desired learning objective. It is a decision and a discipline that you have to set into motion by yourself.

Passive learning, on the other hand, means that you learn without initiating something by yourself. To explain this in more detail, let’s continue with the example of learning a new language. Learning a new language actively means that you have to read a book or go to a class. Now, let’s try to find an example of how you could learn a new language passively.

Let’s say that it takes you an hour to drive to work every day. In this time, you could listen to audio tapes or CDs that help you to learn a new language. You are going to drive to work in any ways, so why not use this time to learn something new? This is a good initial example to dive into the idea of passive learning, but it still has some shortcomings: you have to make the decision to switch the language CD on instead of listening to your favorite music or the radio (even though you are not doing anything actively once it is switched on); audio alone is not necessarily enough to learn a new language, since you might also want to look at grammar structures and alphabet of the new language. But at least we have made some progress. We don’t have to open a book or go to a class anymore. In the next section, we look at how we can use technology to further expand the idea of passive learning.

Technology and Passive Learning

The digital era is upon us. Technology is pervasive throughout society. As a result, we also consume large amounts of information electronically. We surf the web to inform ourselves about topics that interest us; we read the news online; and we use a variety messaging systems and social media – to name just a few! These are things that we are going to do every day as part of our routine. So, can we build in a Passive Learning experience while going about our daily routine? The answer is yes, and in the next section we illustrate how this can be achieved by means of a practical example.

Technology and Passive Learning: A Practical Example

In this section we will look at a practical example, again in the context of learning a language. Vocabulary is an important building block in the language learning process. Within a learning context, it is important for us to map words from one language to another so that we can learn the vocabulary of the new language. Flash cards often get used to achieve this goal. The idea is simple: you have a word on one side of a card and you flip the card to see the meaning of this word in another language.

Flash card software (flipping the card with a mouse click) has also been around for a long time. The problem is that this still requires the Learner to be motivated and do something actively. So, we need to find a way in which the Learner can get exposed to the new vocabulary in a passive way.

As mentioned in the previous section, we consume large amounts of information electronically these days. Let’s say that we consume online information in English and we want to learn German. Our proposal is to now develop a web browser plugin that will replace some of the English words on websites with German words. Selecting the correct amount of words to replace is important, since it should still be easy for the reader to understand the text without too much effort. As a starting point, our suggestion is to replace only 10% of the nouns. The image below has some sample text that shows the difference between the original website and the transformed website. You should still be able to understand the content of the transformed website without too much additional effort. Try it out for yourself!

From a programmatic point of view, it is not difficult to tokenize and extract the nouns in a piece of text. Most programming languages have either built-in capability or 3rd party Libraries that does just that. Below is a JavaScript code-snippet (using the pos-tag lib) illustrating this concept.

fs.readFile('input.html', 'utf8', (err, data) => {  
    if (err) throw err; 
    const result = pos(data);
    //extract all the nouns – pos stands for ‘part of speech’
    const nouns = result.filter(item => item.pos === 'NN');        
    nouns.forEach((item) => {
        //get the translation of the extracted nouns    
        var trResult = getTranslation(item.word, 'en', 'de');
        data = data.replace(item.word, '<strong>' + trResult.translation + '</strong>');


Completing the circle: Reintroducing Active Learning into the Passive Learning Experience

So far, we have been making good progress in creating a passive language learning experience. But we can go even further!

The idea is to reintroduce a form of active learning back into our current model. The words that we replaced in our original source text will be created as hyperlinks; when the Learner clicks on one of these words, we will provide more information about the word.
In our case, we will link to a WordNet browser. Wordnet is kind of like an intelligent electronic dictionary that, amongst others, provides synonyms and word-meanings in context. The image below is an example of a popup WordNet browser that would display once the Learner clicks on one of the hyperlinked words in the source text.

The active learning that takes place here is different from the active learning as described earlier. In this case, the Learner would click on the hyperlink out of curiosity and consequently also learn something. It differs from the scenario described earlier, in the sense that the Learner does not have to find some kind of internal motivation to set the learning process into motion. The learning happens as a result of curiosity that was generated by the embedded Passive Learning experience.


Passive Learning ideas can be embedded into technology that we are using on a daily basis. We illustrated how passive learning can be used in the context of language learning as part of our daily web browsing experience. We also showed how Active Learning could be reintroduced into the learning process as a result of the Passive Learning context in which the Learner is operating. This example only scratches the surface of what is really possible when combing passive learning and technology. Some questions – to name but a few – that come to mind for possible future work in this area, are the following:

  • Can the idea be introduced into messaging platforms such as Skype, Slack and WhatsApp? These messaging technologies are pervasive and get used by millions of people on a daily basis.
  • We should also be able to expand the idea so that it applies to a variety of language pairs. Also, we only looked at replacing a certain percentage of nouns in the text, but we can also include adjectives, adverbs and verbs, and make it configurable to suit the Learner’s needs. The image below illustrates how such a configurable setup could look like.
  • And finally, what about other learning domains? Can we make adjustments so that the Passive Learning experience is also possible in other domains such as Math, Engineering, Biology and Social Sciences?

I went to a great session about CQRS, Event Sourcing and domain-driven Design (DDD) on the Software Architecture Summit. The speaker Golo Roden (@goloroden) did a fantastic job selling these concepts to his audience with a great storytelling approach. He explained why CQRS, Event Sourcing and DDD fit together perfectly while replicating the www.nevercompletedgame.com for his daughter. This is what he shared with us.

Domain-driven Design

The more enterprise-y your customer the weirder the neologisms get.

We – as software engineers – struggle to understand business and domain experts. Once we understand something, we try to map it to technical concepts. Understood the word "ferret"? Guess we need a database table called "ferret" somehow. We then proceed to inform our business colleagues, "deploying a new schema is easy as we use Entity Framework or Hibernate as OR-mapper". He thinks we understood, we think he understood. Actually, nobody understood anything.
As software engineers we tend to fit every trivial and every complex problem into CRUD-operations. Why? Because its "easy" and everyone does it. If it was that easy, software development would be effortless. Rather trying to fit problems in a crud pattern, we should transform business stories into software.
That’s why we should use domain-driven design and ubiquitous language.
Golo Roden proceeds to create a view on the nevercompletedgame with ubiquitous language. So nobody asks, "what does open a game mean" and there is no mental mapping.
I won’t go into detail here, but an example can show why we need this.

  • Many words have one meaning: When developing a software for a group of people, sometimes we call them users, sometimes end-users, sometimes customers etc. If we use different words in the code or documentation and developers join a project later – they might think there is a difference between these entities.
  • One word has many meanings: Every insurance software has "policies" somewhere in its system. Sometimes it describes a template for a group of coverages, sometimes it’s a contract underwritten by an insurance, sometimes a set of government rules. You don’t need to be an expert to guess this can go wrong horribly.


Asking a question should not change the answer

Golo Roden jokes, "CQRS is CQS on application level", but actually it’s easy to understand this way, once you read a single article about CQS. Basically, it’s a pattern where you separate commands (writes) and queries (reads): CQS.

  • Writes do not return any values and change the state of an object.
    stack.push(23); // pushes value 23 onto the stack; returns nothing
  • Reads return a value and don’t change the state.
    stack.isEmpty() // does not change state; returns a isEmpty boolean
  • But don’t be fooled! Stacks are not following the CQS pattern.
    stack.pop() // returns a value and changes state

Separating them on application level means: exposing different APIs for reading (return a value; do not change state) and writing (change state; do not return value *). Or phrased differently: Segregate responsibilities for commands and queries: CQRS.

* for http: always returns 200 before doing anything

Enforcing CQRS could have this effect on your application:

For synchronizing patterns see patterns like the saga pattern or 2 phase commit . For more reference see: Starbucks Does Not Use Two-Phase Commit

Event Sourcing

When talking about databases (be it relational or NoSQL) often we save the current state of some business item persistently. When we are ambitious we save a history of these states. Event sourcing follows a different approach. There is only one initial state, change requests to this state (commands) and following manipulating operations (events). When we want to change the state of an object, we set up a command. This triggers an event (that’s published so some kind of queue) and most likely is persistent in a database.

Bank account example: we start with 0 € and do not change this initial value when we add or withdraw money. We save the events something like this:

Date EventId Amount Message
2019-01-07 e5f9e618-39ad-4979-99a7-342cb1962266 0 account created
2019-01-11 f2e98590-7795-4cf7-bdc2-1794ad39874d 1000 manual payment received
2019-01-29 cbf44bfc-7a5e-4514-a906-a313a6e0fb5e 2000 saylary received
2019-02-01 32bc638c-4783-45b8-8c1e-bebe2b4528a1 -1500 rent payed

When we want to see the current balance, we read all the events and replay what happened.

const accountEvents = [0, 1000, 2000, -1500];
const replayBalance = (total, val) =>  total + val;
const accountBalance = accountEvents.reduce(replayBalance);

Once every n (e.g. 100) values we save a snapshot to not have to replay too many events. Aside from the increased complexity this has some side effects which should not be unadressed.

  • As we append more and more events, the data usage increases endless. There are ways around removing "old" events etc. and replacing them with snapshots, but this destroys the intention of the concept.
  • Additionally, as more events are stored, the system gets slower as it has to replay more events to get the current state of an object. Though, snapshotting every n events can get deterministic maximum execution time.

While there are many contra arguments there is one the key benefit why its worth: your application is future proof, as you save "everything" for upcoming changes and new requirements. Think of the account example from the previous step. You can implement/analyze everything of the following:

  • "How long does it take people to pay their rent once they got their salary"
  • "How many of our customers have two apartments? How much is the difference between both rents?"
  • "How many of our customers with two apartments with at least 50% in price difference need longer to pay off their car credit?"

To sum it up and coming back to our initial challenge, our simple CRUD application with domain-driven Design, CQRS and event sourcing would have transformed our architecture to something like this:

While this might solve some problems in application and system development this is neither a cookie-cutter approach nor "the right way" to do things. Be aware of the rising complexity of your application, system and enterprise ecosystem and the risk of over-engineering!

Ein solides File Management ist der erste Schritt zu guter Datenqualität im Datawarehouse. Wie kann man File Management im Modern Datawarehouse realisieren? Ein Ansatz sind die Azure Datafactory und Azure Databricks. Azure Functions bieten hier eine gute Alternative. Bevor ich euch die Lösung zeige, will ich das Problem noch etwas eingrenzen.

File Management

Häufig müssen verschiedenste Arten von Dateien in ein Datawarehouse importiert werden.
CSV-Dateien sind bei großen Datenvolumen, im Vergleich zu HTTP-Schnittstellen, schnell exportiert und schnell importiert.
XLSX-Dateien sind üblich, wenn es eine Schnittstelle mit einem manuellen Prozess gibt und ein Anwender Daten in Excel bearbeitet.
XML-Dateien erlauben den Austausch von komplexen Datenstrukturen zwischen Systemen.
Ein bewährtes Vorgehen ist für Schnittstellen-Dateien Metadaten zu erfassen, etwa wann eine Datei registriert und geladen wurde. Gleichzeitig sollten diese Dateien archiviert werden. Manchmal müssen die Dateien noch konvertiert werden.

Das Modern Datawarehouse

Realisiert man klassische Datawarehouse-Architekturen in der Cloud führt kaum ein Weg an Infrastructure as a Service vorbei. Das größte Potenzial bietet die Cloud aber bei Platform as a Service, da im Leerlauf weniger Ressourcen verbraucht werden und damit Kosten sinken können.
Eine Datawarehouse-Architektur die auf Platform as a Service realisiert werden kann ist das von Microsoft vorgeschlagene Modern Datawarehouse

Das sieht folgendermaßen aus:
Modern Datawarehouse Diagramm

Azure Functions

Azure Functions sind zustandslose Webservices, die auf unterschiedlichste Ereignisse registriert werden können und dann eine bestimmte Aufgabe erledigen sollen. Ereignisse können Zeitpläne, HTTP-Requests oder Änderungen in einem Azure Blob-Store sein. Kosten entstehen pro Aktivierung einer Funktion und skalieren so gut mit der entstandenen Last.

Azure Functions

Problemstellung und Lösungsansatz

Die Datenhaltung im Modern Datawarehouse ist keine Relationale Datenbank sondern ein Data Lake. Ein Data Lake ist im Kern ein Dateisystem.
Die Datenintegration legt Dateien im Lake ab und Konsumenten lesen die Dateien. Dabei ist essentiell, dass die Daten geordnet abgelegt werden. Nach Möglichkeit müssen die Daten partitioniert werden, sodass bei Zugriffen nicht alle Dateien durchsucht werden müssen.

Welche Komponente übernimmt jetzt das File Management? Eigentlich müsste die Azure Data Factory diesen Aspekt abdecken. Ist aber noch recht limitiert und eignet sich eher für simples Data Movement. Alternativ kann das File Management in Databricks implementiert werden.
Azure Functions sind hier aber preislich günstiger und nicht weniger leistungsfähig.

Rechnet man mit wenigen hundert Dateien, die pro Tag verarbeitet werden wird man das Gratisvolumen von 1 Mio. Zugriffen kaum ausschöpfen.
Ein anderer Vorteil gegenüber der Azure Data Factory ist, dass man hier vollwertige Programmiersprachen wie C# oder Python verwenden kann. Mit Konzepten wie Vererbung kann ein großer Teil des Codes wiederverwendet werden und muss pro Datei-Typ nur sehr wenig Code zusätzlich geschrieben werden.


Das Design ist simpel. Es gibt drei Azure Storage Accounts. Einen für den Input, einen für das Archiv und einen für den Data Lake. Im Input und im Archive existiert jeweils für jeden Dateityp ein Blob Container. Im Lake Storage gibt es nur einen gemeinsamen Container.
Jetzt wird eine Datei im Input abgelegt. In meinem Beispiel ist das ein Microsoft Flow, der einen Anhang aus einer E-Mail mit einem speziellen Betreff ablegt. Das könnte aber genauso ein AzCopy-Aufruf von einem Dienstleister sein.
Eine Azure Function ist auf neue Dateien im Input Blob Container registriert und wird aktiviert (Punkt 1 in der folgenden Grafik).
Die Function erzeugt ein Verzeichnis im Archiv Container mit dem aktuellen Zeitstempel und kopiert die Datei unbearbeitet in das neue Verzeichnis. Damit können keine Dateien überschrieben werden oder verloren gehen (Schritt 2).
Im Lake Container gibt es jeweils je Dateityp ein Verzeichnis.
Die Function vergibt der Datei einen neuen eindeutigen Namen, konvertiert bei Bedarf den Zeichensatz und legt die Datei im Lake Storage ab.
Dazu werden noch die Metadaten als eigener Dateityp abgelegt, die bei Analysen verknüpft werden können (Schritt 3). Zuletzt wird noch die Datei aus dem Input gelöscht.

File Management - Flow

Die Struktur der Function ist in der folgenden Grafik gezeigt. Die gesamte Logik liegt in der File-Klasse und ist damit wiederverwendbar. In der spezifischen Klasse wird nur festgelegt in welchem Container der Input liegt und in welchem Verzeichnis die Daten im Lake abgelegt werden sollen. Die Metadaten werden als eigene einheitliche Klasse realisiert und können so leicht serialisiert und im Lake abgelegt werden.

File Management - Structure


Für die Demo habe ich eine DEV-Umgebung in Azure erzeugt. Zusätzlich zu den erklärten Services benötigen die Azure Functions noch einen weiteren Storage Account und einen App Service Plan. Dazu werden noch die Application Insights verwendet um die Lösung zu überwachen.

Ressource Group

Die Application Insights zeigen auch, dass die Lösung tatsächlich läuft.

Application Insights

Im Storage Explorer kann man dann sehen, wie die Dateien im Lake abgelegt werden. Für jede Input-Datei wurde über eine GUID identifizierbar der Inhalt als CSV und die Metadaten als JSON abgelegt.

Lake - files
Lake - zep