All technologies required to build a fully open source / open hardware system for industrial automation are now available, mostly from European Free Software publishers and Open Source Hardware vendors. We will explain in this article how to build such a system. We will then discuss possible limitations of certain standards and problems which remain to be resolved.
A possible architecture for an industrial automation system has four segments:
The diagram bellow illustrates the role of each segment:
The Cloud segment, which can be public cloud or private cloud, runs backend servers for applications such as ERP, data lake. It could even run MES applications as long as there is some form of proxy at the edge for the real time part of manufacturing execution.
The Edge segment runs services that must be deployed next to the production line. This includes services that provide real time control logic to IoT, real time signal processing to Remote Radio Head (RRH), real-time proxy of a cloud hosted MES, network management services, etc. ERP could also be deployed at the edge too if wide area network (WAN) is not reliable enough to access the cloud segment.
The IoT segment does very simple operations to convert analog signals to digital and vice-versa. The IoT usually runs a micro-controller, possibly without operating system. It is connected to the Edge either through a local area network (LAN) or through a wide area network (WAN). Typical LAN can be Ethernet or Wifi. Typical WAN can be LTE, NR, NB-IoT, etc.
The Client segment provides a user interface to display and control the system. It can also run some local processing.
Industrial automation is an industry looking for standards, but which is dominated by mutually incompatible protocols from Siemens, Schneider, Beckhoff, Omron, Rockwell, etc.
Most parts of an industrial automation system could now be implemented based on four standards:
- OPC-UA (IEC 62541);
- TSN (IEEE 802.1);
- POSIX (ISO/IEC 9945-1:1996);
- HTML5 (W3C).
OPC-UA defines a standard way to exchange messages between Cloud, Edge and IoT. It provides a standard way to describe message content, address each device and ensure a limited form of resiliency with OPC-UA PubSub. It is a fast evolving standard closely linked to the progress of Industry 4.0. It has a lot of open source implementations.
TSN defines a standard for deterministic networking that can handle time constraints typical of video or field buses. Its specification is very wide and covers scheduled traffic (see "Time-Sensitive Networking: From Theory to Implementation in Industrial Automation"). However, it is still far from being widely implemented or adopted. Implementing it requires changes in the Ethernet driver code or chipset. Its open source implementation for Linux is still partial.
POSIX is one of the very few standards for service deployment which is independent of the operating system. The GNU C Library is a mature implementation of POSIX licensed as Free Software which has always taken care to provide upward compatibility. Many IoT operating systems support a POSIX compatibility layer which may not be complete (ex. POSIX compatibility layer for eCos, POSIX wrapper for RIoT).
HTML5 is another standard for service deployment which is independent of the operating system. Although this is little known, HTML5 can run autonomous processes offline and has the power to replace completely POSIX at the Edge. It defines a wide range of APIs which cover data persistence (IndexedDB), multi-processing (Web Workers) and local service provisioning (Service Workers). Pyodide demonstrates how HTML5 can be used to deploy autonomous A.I. processing inside a browser.
The cloud segment requires mainly a mature Service Lifecycle Management (SLM) system, that can automate lifecyle (build, configure, run) and all aspects of service management:
- disaster recovery;
A service can be anything: a database, a virtual machine (VM), an HTTPS front-end, ERP, MES, data lake, etc.
Services are deployed on bare metal servers similar to those used by Facebook (Open Compute Project). The use of virtualisation or OS namespaces/jails for services is optional. Thanks to SlapOS recursive architecture, service lifecyle management can be self-deployed and self- tested. Slave services provide a way to partition a single service into multiple sub-services which can be provisioned individually.
Networking is based on full IPv6 (just like Facebook infrastructure) with modern low latency routing (RFC 6126).
Multi-protocol data collection is implemented through fluentd protocol.
ERP and MES on the cloud segment are based on proprietary standard. It could be interesting to see how to integrate them based on ISA-95 standard reformulation promoted by OPC Foundation.
As far as we know, there is no standard OPC-UA profile for Service Lifecycle Management. It could be interesting to reformulate SLAP protocol into OPC-UA.
The Edge segment is remotely configured and managed by the cloud segment. It is autonomous enough to keep on operating in case of Wide Area Network (WAN) outage.
The Edge segment infrastructure can be based on a micro-servers (Olimex open hardware) or on high performance server (OCP open hardware).
Edge segment usually provides Content Delivery Network (CDN) to support high performance HTTPS content delivery and caching. Another type of common edge service is multi-protocol data collection gateway that converts any data collection protocol (ex. MQTT, syslog, OPC-UA) into the data collection protocol expected by the cloud segment, in our case fluentd.
Edge segment can provide IPv6 addresses to neighbouring IoT and access to resilient WAN through multiple links (ex. LTE/NR, FTTx, etc.).
Edge segment can run a local Service Lifecycle Management (SLM) system dedicated to IoTs connected to the same local area network (LAN).
Some proposed standards for OPC-UA profiles at the Edge are emerging:
The IoT segment is considered here as an extension of the Edge segment. The purpose of each IoT is to act as an extension of the "IoT Logic" service deployed at the edge which does additional processing required to convert digital events to/from analog signal. Communication between Edge and IoT segments may require strict time constraints typical of real time applications.
For example, the Edge segment could implement a complete LTE radio physical emulation (a.k.a. aNodeB) whereas the IoT segment acts as a Remote Radio Head (RRH) which modulates a 2,600 MHz frequency with this 20 Mhz signal. Although this example is related to telecommunication infrastructure rather than industrial automation, it shows clearly the difference of algorithmic complexity between the Edge and the IoT segments.
The IoT segment does simple and slowly evolving - yet possibly very fast - processing whereas the edge segment does complex (and constantly evolving) processing.
The IoT segment may have too little resources (RAM, CPU, storage) to run a full POSIX operating system. The Edge segment is a full blown POSIX operating system.
The IoT segment does not provide much standard API, for now. The Edge segment is based on POSIX standard.
The only common thing to all possible firmware for IoT seems to be the C language. OPC-UA also provides common ground to simplify interfacing Edge and IoT (addressing, payload schema, payload transport, etc.), just like USB simplified interfacing personal computers with an ever growing ecosystem of device.
Yet, multiple incompatible abstractions still exist as a possible API for IoT software developers:
- POSIX OS API abstraction (μCLinux);
- custom OS API abstraction (RIOT, Mongoose, FreeRTOS, mbed, etc.);
The market is still very fragmented. Also, as far as we know, there is no standard OPC-UA standard library at the IoT. Yet, open62541 is commonly used in combination with other IoT libraries or operating systems.
The client may use Teres open hardware laptop from Olimex, a smartphone or an industrial tablet.
It runs an operating system and HTML5 browser which derives from Chromium. This browser was modified to remove any leak of data to Google.
Data analysis and visualisation is based on Iodide and Pyodide frameworks created by Mozilla or on similar frameworks (plotly, perspective).
The client segment could in theory act as an Edge segment or as an IoT segment. HTML5 can actually do much more than what most developers believe. Implementing a complete A.I. engine in HTML5 is quite easy. Such an engine could drive an IoT in real time.
The adoption of OPC-UA and TSN for industrial automation involves certain risks or questions listed bellow.
TSN could become a beautiful standard without implementation IEEE standards such as 802.11 already experienced this issue. The Point Coordination Function (PCF) which provides a way to ensure a form of determinism over Wifi and solve the hidden station problem is still implemented by virtually no chipset (except Atmel). TSN standard is so wide that it could be uneconomical for any vendor to implement it entirely. This could prevent interoperability to happen soon. Even Intel seems to be struggling for implementing TSN entirely with OpenAvnu (see "The Road Towards a Linux TSN Infrastructure").
TSN is layer-2 standard in a Layer-3 world. Routing is the dominant form of networking between cloud, edge and IoT nowadays. One could argue that this makes TSN unsuitable for a modern networking infrastructure which combines distributed radio (ex. LTE, 5G) and wired networks (ex. Ethernet, CPRI, USB, etc.). Routing (see "Delay-based Metric Extension for the Babel Routing Protocol") and traffic control approaches might make more sense (see "tc-fq_codel (8) - Linux Man Pages") for a truly unified architecture.
TSN could be an overkill for industrial automation .A complete LTE/NR physical signal can be transported to IoT over 10 GbE standard switch and processed at the Edge. Is there anything in industrial automation which requires more time constraints than that?
OPC-UA does not define standard payloads. Vendors of OPC-UA hardware could embed binary data into payloads as a way to ensure their data formats remain secret and mutually incompatible with other vendors.
POSIX or TRON might be a better HAL for IoT. Instead of trying to invent yet another abstraction or Hardware Abstraction Layer (HAL), it might be easier to rely on proven abstractions such as POSIX or iTRON already deployed in the industry and supported for decades. A/UX BSD Unix could run on a 512 KB Macintosh with a 68030 CPU. ucLinux requires less than 200 KB to operate. RIOT provides partial POSIX support. eCOS and RTEMs provide both POSIX and TRON APIs.
Existing OS could be a better HAL for IoT. Instead of trying to invent yet another abstraction or Hardware Abstraction Layer (HAL), it might be easier to rely on existing abstraction such as Mongoose OS or LiteOS.
Unsolved Problems and Opportunities
Current standards (OPC-UA, TSN, POSIX, HTML5) do not provide a solution for the following problems:
- Time-sensitive routing (TSR);
- Standard API for non-POSIX IoT;
- Standard cross-platform build and OTA upgrade for non-POSIX IoT.
Selected technologies (SlapOS, open62541) have some limitations:
- Lack of implementation of TSN for most network controllers;
- Lack of proven resiliency of OPC-UA PubSub in most implementations;
- Lack of implementation of non-POSIX service lifecycle management in SlapOS;
- Lack of support of real-time resources or time sensitive orchestration in SlapOS.
Each unresolved problem can be viewed as an opportunity for Open Hardware and Free Software in the field of industrial automation:
- better support of TSN in Linux kernel (ex. AccessTSN project);
- time sensitive routing (TSR) protocol;
- proven resilient PubSub implementation (ex. based on Intel DPS for IoT) ;
- time sensitive extension of SlapOS;
- real time resource extension of SlapOS;
- implementation of IoT support in SlapOS including cross-platform build and OTA upgrade;
- OPC-UA schema for SlapOS SLM;
- OPC-UA schema for ERP;
- OCP-UA schema for MES;
- standard library for OPC-UA IoT logic and processing;
- OPC-UA schema for fluentd;
- OPC-UA support in fluentd and fluentbit.
In most factories, industrial automation can be implemented with Modbus TCP over Ethernet for which many mature I/O products exist from suppliers such as Wago, Advantech (ADAM) or ICPDAS. With 100 Mbps per I/O and 1 Gbps or even 10 Gbps at the edge server side, there is no actual problem of latency, jitter or determinism. One should just be careful enough to provide enough CPU and RAM to the virtual PLC (ProviewR) or to the process running on the I/O side.
However, some applications require precise synchronisation with microsecond level jitter and determinism which goes beyond networking only. What is actually needed is the ability to ensure that two processes provisionned on two different systems (edge server and IoT for I/O) are able to communicate within a certain guarantee of latency and jitter. This problem requires to develop a solution at three different levels:
- networking protocols;
- operating system's scheduler;
- orchestrator's scheduler.
To our knowledge, no solution (open source or proprietary) covers all aspects. Some ad-hoc proprietary solutions may cover one (networking) or two aspsects (networking and operating system) under specific cases. A general solution remains to be invented though.
Introducing time constraints in networking can be achieved through IEEE 802.1 TSN but few vendors implement it. It can also be achieved through well known protocols: NTP and PTP. PTP achieves clock accuracy in the sub-microsecond range. Some NTP clients on LAN may also achieve similar results under specific conditions though. PTP is supported by many switches including some open source hardware of the Open Compute Project. NTP is a pure software solution.
Bringing time constraints from network to operating system requires to either extend Linux or use specific opetaring systems. Linutronix has created a patch for Linux kernel called PREMPT_RT which brings latencies of less than 10 to 60 microseconds to userspace processes on low performance hardware. Linutronix has also implemented PTP for the Linux kernel. The AccessTSN project conducted by Linutronix brings a common code base to the Linux kernel which can accomodate through a unified API different techonologies to manage time constraints (PTP, time division, 802.1 TSN) at different levels (network drivers, userland process).
In a real world system, tasks are allocated to different nodes, some of which are connected on the same LAN and others are reachable through routers. Time constraints should take into account not only a single LAN but the entire communication path from one task running on one system to another task running on another system, possibly interconnected through LANs and routers. Tasks themselves may suffer excess jitter if too many tasks were allocated on a single operating system. Ensuring end-to-end time constraints in a deterministic way may thus require to:
- extend a routing protocol such as babel with metrics that take into account time constraints handled by AccessTSN.
- extend the scheduler of an orchestrator such as SlapOS to define how many tasks can be allocated per computing node while respecting given latency and jitter constraints.
Once this is achieved, end-to-end determinism with microsecond jitter may become possible with IEaaS.
Two technologies should be banned from any industrial automation project:
- Linux containers (including Docker) ;
Docker is not a bad technology. However, most users tend to believe that it provides portability from one Linux distribution to another. This is not the case due to the Kernel ABI mismatch problem, a problem that is not specific to Docker itself but to Linux binary portability in general. As a result, running a Docker binary images at the Edge provides no guarantee of stability, unless both Docker image and Edge server are based on the same Linux kernel with same compilation options. Moreover, containers do not provide any isolation and are thus easy to exploit.
Other (fixable) issues with Docker - and LXC containers - include lack of support of some system calls, increased difficulty to debug kernel related issues (ex. network corruption) or lack of repeatable build in China due to network restrictions. Another (non fixable) issue with Docker is that it is based on Linux, not on POSIX. It is thus not portable to other POSIX operating systems (ex. OpenBSD, μCLinux).
All current Docker limitations were solved in SlapOS 10 years ago.
OpenStack case it is different. Any project using OpenStack has a very high probability to explicitly waste taxpayer's money and implicitly promote proprietary solutions (Huawei, Amazon, Google, Microsoft,etc.).
OpenStack is a bloated project run by a bloated community that has produced unstable software and wasted huge amounts of taxpayer's money. Its design does not follow the basic principles of self-converging systems defined by Mark Burgess, without which it is impossible to operate reliably a large complex system. As a consequence, OpenStack systems need to be entirely rebooted from time to time. The average number of unexpected reboots of an OpenStack VM operated by OVH or Rackspace is 1 to 5 times per year. This compares with 0.11 reboots per year for an average bare metal server operated by OVH.
One of the most famous OpenStack project is the French government sovereign cloud. Rather than using reliable European technologies (GANDI, NIftyName, Proxmox, SlapOS, etc.), highly subsidised companies such as Orange, Thales, Bull (now Atos) and SFR decided to support OpenStack. 10 years after, Orange operates an OpenStack cloud... provided by Huawei and based on a heavily modified version of OpenStack. French taxpayer's money has thus sponsored Chinese industry and proprietary software rather then French or European pioneering SMEs and Free Software.
Nearly all European research projects based on OpenStack have produced very few results that are in use today: Reservoir, CompatibleOne, EASI Clouds, Nuage, Andromede, etc.. SlapOS is one of the few stealth results produced by two of these projects.
Many large companies which tried to operate their own OpenStack cloud also failed and now rely on Amazon AWS, Microsoft Azure (ex. Walmart) or Google Cloud. The list of failure can not be published here because few CIOs are ready to admit it in public. However, anyone can find online examples of failures such as "British Telecom threatens to abandon OpenStack in its current form". Failure is so frequent with OpenStack that it is now part of its own marketing with all kinds of suspicious arguments, such as the size of the team (reminder: it takes 2 days for a single engineer to deploy SlapOS entirely).
Scalable Business Models
An Open Hardware / Free Software solution for industrial automation could be widely adopted if it is:
- available worldwide;
- supported worldwide.
Reliability is easy to achieve as long as simple technologies are adopted, rather than fashionable technologies which eventually become a dead-end.
Five business models can support this level of scalability:
- luxury service (ex. McKinsey);
- branded support (ex. RHEL);
- branded hardware (ex. Olimex);
- online services (ex. ViFiB);
- copyright licensing (ex. MySQL, LASO).
The luxury service business model consists of selling highly qualified, personalised service at high price. What the customer gets is the certainty to receive consistent service from bright brains. It requires to setup a specific education and knowledge sharing process across the organisation. This model was adopted in Nexedi to provide ERP5 professional services.
Branded support consists of distributing a package of Free Software under a brand which is proprietary and attach high quality services to this brand. Red Hat's Linux distribution is based on this idea. Red Hat provides at the same time a system with branded support (RHEL) and a system with no support (CentOS) which share the same code.
Branded hardware consists of distributing open hardware device with a brand. What the customer gets is the certainty to get a working device. It requires to setup a global logistic network. Olimex is based on this model and seems to be a very profitable company.
Online services consist of providing services online that support the implementation of Free Software. What the customer gets is effortless deployment and maintenance. It requires to automate the maintenance service, based on data protected by trade secret. This model is possibly one of the best candidates for IEaaS since it protects the freedom of users (everything is open source) while at the same time providing added value (based on data which can not be shared).
Copyright licensing consists of providing the same code as in a Free Software but under a different license. MySQL is for example installed in many CISCO routers. CISCO has been licensing MySQL code under a proprietary license.
The most suitable business models for IEaaS are: luxury for integration services, branded hardware for micro-servers and I/O and online services for remote lifecycle management.