Thoughts about latency, observability, and backend systems for charging infrastructure

By Mikhail Aksenov , Principal Engineer at Zaptec.

Every software systems engineer has a nemesis — and that nemesis is the speed of light. You can’t beat it; you have to accept that the transmission of information is not free at any level.

Why latency matters for charging experiences

And while it may seem silly at first (we’re talking nanoseconds for modern CPUs), every hop adds up, and every network transmission matters. In the end, everything affects the user experience.

The speed of light problem

Ever wondered why every public charging station gives you a message saying “a session start might take up to a minute”? That’s because of the speed of light. Imagine that the actual workstation is 1,000 km away from the data centre making all the decisions.

In ideal conditions, it takes 3 ms for information to travel one way. Multiply that by two for a round trip (assuming all other processing is instantaneous), and suddenly everything is piling up.

Distributed systems make it complicated

Smart charging — and IoT in general — is a tricky business. It involves distributed systems with many compute devices on the edge of the network. These devices are connected to backend systems deployed in the cloud, which in turn may call other providers for necessary integrations.

Understanding the problem

As they say, the first step to solving a problem is admitting that the problem exists. Well, that’s easy.

The next question is: How bad is it? How do we understand what contributes most to the time it takes from starting a session in the app to the actual response from the charging station? Wouldn’t it be a dream to see how much time we spend on every hop in the system?

Observability as a tool

Here comes observability — understanding a system’s behaviour through observation [1]. A natural follow-up is: What should we observe, and how should we start?

Why logs are not enough

The natural reaction for a software engineer is to add logs at every possible point of interest. While this works great locally, it quickly becomes very expensive for large-scale IoT systems that handle thousands of requests per second.

The maths is fairly simple: if a device sends a data point every five minutes, one million devices will send about 3,200 requests per second. So just logs are not an option — though they are still very useful for troubleshooting purposes.

Metrics and percentiles

Luckily, we can use metrics — statistics selected to evaluate or monitor a target. One of the ways, besides the obvious minimums, maximums, and averages (average is quite pointless most of the time, to be fair, but still widespread in the field), is to use percentiles.

In performance engineering, percentiles are simple yet powerful. A percentile is a statistical measure that characterises the value below which a given percentage of samples falls.

For example, we have a set of times it took to open a web page. We order this list from lowest to highest and take the 99th value. We can then say that 99 percent of our requests are executed faster than this value.

In reality, we need to look at the data distribution more closely and check what the tail of the list looks like. Normally, everything after the 95th percentile is the major contributor to slow responses for various reasons [2].

What we can do with this knowledge

With this knowledge, we can do a couple of things:

Sum up all tail latencies, multiply them by X (3, for example), and get roughly our one-minute start time for a charging session.
Fight tail latency to get the best possible experience, using the many approaches people have invented over the years [3].

Latency and human perception

Did you notice how AI has shaped our perception of latency? As soon as one replaces the word “Loading” with “Thinking,” expectations change drastically. But at its core, horrible system latency is simply being masked.

That’s a legitimate usage of the first approach, because sometimes it’s not feasible to make waiting time good — either the task is too compute-intensive, or it’s just not justified economically.

The role of traces

The third and most complex part of the logging triad is traces — complete execution paths recorded from start to finish. In a dream world, every request made by a car would be reflected across the full system, with all metadata attached through its lifecycle. Despite being technically possible, it’s very expensive to collect and store so much information, given that it won’t be used often.

What good observability looks like

Referring to what Eda wrote in the article about integrations powering EV charging — having many sources of latency is inevitable.

The best we can do from a systems perspective is to provide as much observability as possible: use built-in cloud provider tooling, write our own monitoring agents, follow industry practices, and finally finish that statistics book we added to the reading list five years ago.

From our experience, having a balanced combination of metrics, sampled traces, and carefully selected logs works very well in most cases — especially where the most obscure part is understanding what happens between cloud provider components.

Beyond just the backend

To complete the picture, we strongly believe in adopting the same practices we use for the backend everywhere: production lines, devices, testing labs. In the end, the whole EV charging station fleet is one huge, complex distributed system.

Is offline smart charging possible?

With all that said, the next natural question is: What other solutions to the latency problem might we have? Could we live without all the backend systems involved? Could we build a smart charging park completely disconnected from the internet?

My answer is: probably.

It’s still a distributed system with all its quirks. The CAP theorem doesn’t go away. It requires considerable compute power at the edge. There must still be a connection to a management system if payment is required.

The challenges of offline systems

Network media would have to be reconsidered — no one wants to run consensus protocols over Wi-Fi.

And most importantly, software updates would be slower, since edge devices are naturally harder to update than backend applications.

Even if we eventually have offline smart parking lots, we’ll still need to monitor them.

Conclusion

I personally don’t see a viable alternative to backend-heavy systems at this point in the field of smart charging. But I do believe that observability is the key to coping in the never-ending fight with physics — to provide the best possible user experience for the EV charging industry.

References

Gregg, B. Systems Performance: Enterprise and the Cloud. Second edition. Addison-Wesley, 2021.
Dean, J. and Barroso, L.A. The Tail at Scale. Communications of the ACM, Volume 56, Issue 2 (Feb. 2013).
Enberg, P. Latency: Reduce Delay in Software Systems. Manning, 2024.

Tags:

Date:

Read time:

Share: