In the previous articles of this series, I talked about why we need a new computing model and what that model should look like. The “why” was pretty clear – we’re drowning in accidental complexity while building distributed systems. The “what” painted a picture of a unified computing environment that makes building distributed applications more tenable. Now comes the challenging part – how do we get there?
Starting at the Foundation
When you’re building a skyscraper (yes, I’m bringing back that analogy from Part “Why”), you don’t start with the penthouse. You start with the foundation. In our case, that foundation is a programming model that treats distributed computing as a first-class citizen rather than a bolt-on addition.
As I mentioned earlier, we’re still using programming models that were designed for single-threaded programming on individual computers. We’ve been building distributed applications long enough that we can now identify some of the more durable building blocks of distributed systems. These include inter-service communication, authentication and security, state management, and fault tolerance. We can use these building blocks to design a new programming model that’s specifically tailored for distributed systems.
A new programming model needs to be accompanied by a programming language designed specifically for this new model of distributed systems. Hold the “not another language” thought until you see what it would be capable of doing.
The Power of Grease
Simply removing friction without necessarily bringing any new capability alone makes a great difference in the rate of usage of something. We see it daily with differently designed apps.
It’s the same with programming. Think about how C unlocked systems programming for a whole new generation of developers while still letting assembly wizards dive deep when needed. Or how garbage collection in languages like Java enabled millions of developers to build complex applications faster.
My poisons were PHP and Go. PHP was phenomenally powerful in the way it eliminated the “build” phase by integrating tightly with apache2 and its request lifecycle. Go, for me, unlocked parallel programming in a way that I could never approach with C++, Python, or Java. Elimination of frictional and error-prone boilerplate along with a well defined conceptual model is what made these languages so incrementally productive over their predecessors.
The other thing I’ve learned over decades of programming is that defaults matter immensely. Make something the default, and that’s what people will use 90% of the time. On the other hand, make it require explicit opt-in, and suddenly it’s a “power user feature” that most developers will avoid until absolutely necessary.
That’s why in Krys1 – the programming language for MetaComputer™ – I’ve flipped some of the defaults around so the minimum effort logic is the one that’s most suitable for distributed systems. Here’s a brief preview of Krys:
// A straight-forward PVP game player
type Player {
id: String
health: Int
stamina: Int
kills: Int
}
fn (p Player) modHealth(amount: Int) Int {
p.health += amount
return p.health
}
fn (p Player) modStamina(amount: Int) Int {
p.stamina += amount
return p.stamina
}
fn (p Player) addKill() {
p.kills += 1
}
Now let’s look at the juicy bits – a little game service, followed by a line-by-line explanation:
1service Game {
2 state players: Player
3
4 fn join(player: Player) {
5 players[player.id] = player
6 }
7
8 fn attack(byId: String, onId: String) {
9 let attacker = players[byId]
10 let target = players[onId]
11
12 if attacker.modStamina(-10) < 0 {
13 return ErrNotEnoughStamina
14 }
15 let h = target.modHealth(-10) handle {
16 any => attacker.modStamina(10)
17 }
18 if -10 <= h <= 0 {
19 attacker.addKill() handle { none }
20 }
21 }
22
23 fn healDrop(x: Int, y: Int) {
24 in {
25 for p in playersInRange(x, y) {
26 p.modHealth(10)
27 }
28 } handle { any => return}
29 }
30
31}
Line 1: We define Game
as a service
. A service is a distributed entity that can be deployed across multiple nodes (not specifically defined as containers, VMs or BMs) in an elastically scaling group. It can have state and its methods turn into remote endpoints.
Line 2: We define a player registry called players
, but this is a state
type. This means it’s a distributed persistent store of Player
objects.
Line 4: The join
function adds a player to the registry. Being a service method, it’s a remote endpoint that can be called from anywhere in the network. There is no service discovery, marshalling/unmarshalling, or network handling code to be written for it.
Line 5: The persistent state is usable as if it is a local object and we’re just adding an element to it. players
being a remote data store, the insertion operation can fail or there may already be an existing entry for the player. However, we don’t have any error handling code. Why? This is because, Krys functions have implicit error handling and error propagation. If the insertion fails, the function will return an error to the caller. Note that we haven’t declared any error return from the join
function. This is because every function implicitly returns an optional error. The error types are a separate type tree. Of course, errors can also be explicitly returned or handled.
Line 8: This function defines a game action, attack
. The attacker requires some stamina to be able to attack the target. The attacker loses stamina while the target loses health. If the attacker makes the target’s health go down to 0, that counts as a kill. Let’s go through it.
Line 9: We fetch the attacker from the registry. This is a remote operation, but we don’t have to write any network handling code. We also don’t have to explicitly check whether a player lookup failed. The unhandled error will cause the function to terminate and return the error to the caller.
Line 10: We fetch the target player, same as above. By the way, Krys calls all functions asynchronously by default, returning control to the caller immediately. Only when the caller needs the result does it wait for the response. This is one of the things Krys does to make distributed code cleanly readable while still being efficient.
Line 12: We deduct 10 stamina points from the attacker. This is a remote API call and we check its return value.
Line 13: If the stamina was insufficient (< 0), we return a specific error. If the call fails, on the other hand, we rely on the implicit error handling to propagate the error to the caller. Note, once again, that the attack
function does not have an explicit return type. This is because error returns are implicit and optional. Since the error types are a distinct type tree, the compiler can infer the return flow as well as perform implicit error handling.
Line 15: We deduct 10 health points from the target. This is a remote call and we check its return value.
Line 16: Here we do explicit error handling in the handle
block. If there’s any
error – any
being a catch-all keyword – we add 10 stamina points back to the attacker.
Line 18: We check whether this specific attack caused the target’s health to drop to or below 0
Line 19: If the attack killed the target, we increment the attacker’s kill count. In this case, rather than handling errors, we ignore them. This is a signal to the runtime that it’s a “fire and forget” call. The compiler will then dispatch it as a unidirectional message, ensuring that it gets delivered eventually.
Line 23: This function demonstrates the true syntactic charms of Krys. The function drops a healing item at a specific location and heals all players within a certain radius.
Line 24: The in
block is a special construct that allows for all errors in a block of code to be handled together. We’ll see why this is important here.
Lines 25 and 26: We fetch all players near the drop location co-ordinates and loop over them. The loop fires modHealth
calls in parallel to all the players while the in
block from Line 24 ensures that errors are not implicitly checked until we reach the handle
section. This avoids premature awaiting of responses within each loop iteration that would have been the default.
Line 28: This handle
block covers the block of code demarcated by in
. In this case, we’re suppressing all errors. So this is a best-effort multicast operation, which we are not concerned about failing.
Now let’s take a look at the system architecture that this code would compile and get deployed into.
The broader feature set
So far I’ve only shown a small part of what Krys can do. Here are some of the other features that might convince you of the worthiness of its existence.
Service Memory Model
First class support for services also means that we have a new memory model that is highly suitable for multi-instance services. The model controls aspects like lifetimes, sharable scope, mutability and safe concurrent access.
State Types
Traditional programming languages focus on pushing a raw data store (RDBMS, KV Store, etc.) into the application and burden the programmer with the complexity of managing it. Krys, on the other hand, encapsulates the data store into the implementation of the state types and provides the persistent types as services with methods to interact with. First class support for private data is also in the works.
Under-the-hood Resilience
Traditional programming languages have been unaware of distributed system reliability, leaving such tasks to libraries and other infrastructure components. Krys, on the other hand, has first class support for transparent, resilient inter-service communication that supports adherence to SLOs.
The system condenses service interaction failure modes into a few well-defined standard error types. The programmer can clearly identify whether an operation returned an error (i.e. a functional or business logic error) or whether it failed due to the current condition of the distributed system.
Secure Communication
Krys has built-in support for secure communication between services, including access controls, mutual TLS, and encryption. Defense against supply-chain attacks is also in the works.
Built-in CI/CD
Krys supports a build phase and a deploy phase out of the box. The deploy phase has support for setting up isolated environments. Services and persistent types automatically get deployed on to auto-scaling instance clusters. Persistent types also include data lifecycle management, CDC, schema evolution and, of course, backup and restore.
Let me say this again, Krys will not only build and package your code, but actually roll it out onto physical infrastructure based on the topology that it infers from the service definition code. Repeatability to the max.
Not to forget, this capability would work on private infrastructure as well, not just public clouds.
Deep Observability
Having access to the raw code and the understanding of the runtime topology, Krys can automatically inject observability into the services. This is not mere observability, though, but actual system awareness that includes not only system performance, but also the distinction between success, functional errors and communication errors.
This awareness is then integrated into the resiliency setup. Individual instances, for example, can autonomously decide on hedging, backoffs and retries. They can apply rate limiting or circuit breaking based on the system’s current state.2 Their peer instances being built on the same protocol, can use cooperative rather than adversarial strategies to maintain system health.
SuperState™
SuperState™ is a new level of automation for creation and management of “Big Data”. I can’t talk about it much yet, but it’s designed to make system-wide creation of insights more well-defined, and make the integration of derived insights back into functional code a breeze.
The Path Forward
So far, this is still a thought experiment, albeit a very deeply thought experiment. There are many what-ifs and how-abouts that I haven’t covered in this post. I’ve dealt with distributed systems long enough to know what they are and I either have answers to them or I know that they can be answered.
Building this system is a massive undertaking. That’s why it requires taking an incremental approach:
- First, focus on getting the core programming model right. This is the make-or-break part of the project.
- Then comes the MetaCompiler™, starting with simple deployment patterns and gradually adding more sophisticated optimisations.
- Finally, we build out the full MetaComputer™ environment with all its tooling and ecosystem support.
What’s Next
Getting from here to there is going to be an interesting journey. If you’re interested in following this journey or potentially contributing to it, check out the MetaComputer™ organisation on GitHub.
Remember how in Part “Why” I talked about not enjoying programming as much anymore because of all the accidental complexity? Well, I’m enjoying thinking about this system because every time I figure out how to solve one of these fundamental problems, I can feel us getting closer to making distributed systems development fun again.
That’s really what this is all about – not just making distributed systems development more productive (though that’s important), but making it enjoyable again. Making it something where you can focus on solving real problems instead of wrestling with infrastructure.
Ref: All MetaComputer™ Articles
The name is derived from “Crystal”, as in the glass. ↩︎
There’s already a working prototype for this. ↩︎