Within three months, Institut Mines Télécom and Nexedi automated through SlapOS the provisioning and operation of clusters of virtual machines that power Terelab Big Data platform. This included adding support for Ansible and Packer to SlapOS virtual machine provisioning as well as extending re6tnet to support IPv4-only technologies such as HADOOP. Tight security restrictions on data utilization required creative ways for handling interconnectivity, while our fully automated system ensured that Institut Mines Télécom would receive a scaleable and future-proof solution to face growing demand of Big Data analysis.
In 2010 the French government installed a fund called PIA as part of the FSN (National Fund for a Digital Society) to support projects which helped large corporations and academia conduct research in the field of Big Data and foster new ways of thinking regarding data utilization. The rationale behind was the fact that strict IT security requirements and constraints on data usage effectively prevented Big Data Analysis in both state and public organizations (army, tax, insurance). The resulting reluctance to experiment created an innovation-adverse environment in a quickly evolving market putting French organizations at risk of serious competitive disadvantages.
One of the solutions receiveing intial funding was called TeraLab setup by Institut Mines-Télécom. Their marketing through purchase and use of Bullion servers (also used for simulation of nuclear detonations) provided enough incentive, trust and capacity (8 terrabyte memory, 240 cores) for the project to successfully start. Virtual Machines (VM) were setup manually allowing users on the cluster to conduct initial research. However, with currently over a 100 VMs deployed for more than 20 customers, the infrastructure quickly became unmanageable with every VM having to be maintained by manual command line operation.
There was the need for a scalable, industrial grade solution with high security focus that could run a cluster of VMs. While SlapOS did not boast the high-end profile of other providers, our solution was proven to work, as we are using SlapOS ourselves as backbone for our systems and all Nexedi implementations. With Cloudwatt seemlingy being abandonned through merger, VMWare not providing a sovereign solution and Open-Stack seemingly requiring a too large investment to get running reliably, SlapOS won the bid for implementing a solution with an offer that was cheaper by multiples and also included billing and ticketing on top of cloud orchestration.
The key difference to previous implementations of SlapOS was the need to automatically setup a full cluster of VMs versus a single machine. In addition, VMs had to be mutually interconnected and at the same time mutually isolated.
Big Data focus requires to grant any VM very high performance input/output for networking and storage. The sensitive nature of data that is being processed requires to enforce tight security and access rules. The system had to run Hadoop and should provide a fully automated management console for handling security access as well as VM setup, ordering and user management.
An initial challenge we had to solve was the fact that Hadoop was running on IPv4 with no easy way to patch or port it to support IPv6. However, as we planned to use our own re6stnet for IPv6 networking within the implementation, we eventually decided to extend re6stnet and also support IPv4 in order to be able to integrate Hadoop. This took about one man-month to develop and can likely be used in other scenarios or with other legacy software. Hence, a useful feature was added to re6stnet as part of this project.
After that, the main task was the extension of our SlapOS's KVM virtual machine profile to automatically provision a cluster of VMs instead of a single machine using SlapOS. To achieve this, we added support for Ansible and Packer to our SlapOS's KVM recipes, giving us both "micro-level" server provisioning as well as "macro-level" cluster orchestration, both of which will also be useful for future implementations.
When working on this we also had to find a way to enable communication between VMs without having to rely on switching and root access, because this posed too much of an intrusion risk. As we have mastered our own software defined network (SDN) re6tnet, which utilizes routing based on the Babel protocol (RFC 6126), we eventually decided to also use routing to setup communication between VMs. This trick basically gave us the equivalent of a latency optimized VLAN (virtual local area network) over a WAN (wide area network) which uses the correct routes to set isolation policies of individual VMs and regulates access through the respective firewall settings.
After three months of development our solution was put into production providing a network of isolated, yet interconnected VMs deployed throughout multiple datacenters. Teralab manages their cluster through an access server with a single point of entry (X509 certified/open VPN/https). Both this server and its backup have also been done using Ansible Playbooks.
One of the main reasons we were able to setup the new implementation so quickly was the simplicity of the SlapOS design. With just above 3000 line of "glue" between "buildout" devops and ERP5 accounting, SlapOS is quite small and very mature, having been used by Nexedi for many years. For this implementation, we started from our VM deployment base and extended it with the specific requirements of Teralab, enabling us to work in all of the security and isolation features step by step while still running on heterogenous hardware and GNU/Linux distribution. This "growing of a system through extensions" has proved itself once again to be faster and easier to implement versus trying to "replicate the complete feature set" from the beginning.
Finally, as SlapOS is a fully declarative, stateless and autonomous system utilizing promises throughout, exceptions thrown during VM deployment are solved by slapgrid by running again the deployment script as many times as needed until it VMs reach the desired state defined by the end-user. SlapOS is thus able to recover from any unexpected situation - from network failure to hardware failure - that would result in about any other system to deployment failure and undefined state of the system. Its simpler achitecture - based on only 2 software components (ERP5 and buildout) - also mean that SlapOS is cheaper and more reliable than competing solutions: installing a SlapOS master with configurator can be done ready-to-work in a couple of hours. We haven't seen this with any other system so far. Going a little bit further, SlapOS provides high quality, open-source, reliable and proven cloud infrastructure services while judging by available press (here, here and here) other high-end solutions seem to fail.
The project also proved it was possible to develop and setup an automated, customized private infrastructure on existing hardware with operating cost of 1k-2k € per month, fullfilling all security and data access requirements. Moreover, with SlapOS, a single person was able to spend half of his time maintaining a large amount of computers (120 in this case) while doing development the rest of the time, along the way proving that small local French companies can more than compete with large international corporations.
For the future there are plans to extend the implementation by also supporting Big Data software beyond Hadoop (Jupyter, Wendelin, et al) to further strengthen the technical analysis capabilities.
SlapOS automates application operation lifecycle, Ansible automates base operating system lifecycle
re6st and babel completely automate network management of a datacenter with low latency routing
SlapOS bare metal nanocontainers are useful for services that require low footprint (memory, disk) or no overhead at all and less isolation constraints