Multi-core Python HTTP faster than Go

A multi-core Python HTTP server (much) faster than Go (spoiler: Cython)

EuroPython logo

Jean-Paul Smets

jp (at) nexedi (dot) com

Bryton Lacquement

bryton (dot) lacquement (at) nexedi (dot) com

 

This presentation how various progress which Nexedi made in bringing Python language closer to its corporate goals in terms of performance and concurrency.

QR Code

You may find this presentation online by scanning the QR Code above.

Agenda

  • Nexedi
  • Conclusion
  • Rationale
  • Code

The presentation has four parts.

First we provide some information about Nexedi.

Then we reach to the conclusions, so that those who are in a hurry can leave.

Then we take some time to explain the rational behind our efforts.

And last we show some code.

Nexedi

Nexedi - Profile

Nexedi World Map
  • Largest Free Software Publisher in Europe
  • Founded in 2001 in Lille (France) - 35 engineers worldwide
  • Enterprise Software for mission critical applications
  • Build, deploy, train and run services
  • Profitable since day 1, long term organic growth
Nexedi is probably the largest Free Software publisher with more than 10 products and 15 million lines of code. Nexedi does not depend on any investor and is a profitable company since day 1.

Nexedi - Clients

Nexedi References and Map

nexedi.com/success

Nexedi clients are mainly large companies and governments looking for scalable enterprise solutions such as ERP, CRM, DMS, data lake, big data cloud, etc.  

Nexedi - a Python Shop

Nexedi Software Stack

stack.nexedi.com

Nexedi software is primarily developed in Python, with some parts in Javascript.

Rapid.Space: ERP5, buildout, re6st, etc.

https://rapid.space

One of the recent services launched by Nexedi is a Big Data cloud entirely based on Open Hardware and entirely running on Free Software that anyone can contribute to. The target cost is 30% to 60% lower than low cost European clouds, and 10 times lower than US based public clouds. Targer use cases include performance testing, disaster recovery and big data batch processing.

Conclusions

 

2016: uvloop Blazing fast Python networking...

https://magic.io/blog/uvloop-blazing-fast-python-networking/

https://pyvideo.org/europython-2016/fast-async-code-with-cython-and-asyncio.html

This presentation was inspire by a presentatoin at Europython in 2016 that showed that uvloop could lead to very fast HTTP servers, even fatster than in golang.

...as long as python does nothing

https://www.nexedi.com/NXD-Document.Blog.UVLoop.Python.Benchmark

But we later found that as soon as some python code was added, the performance dropped tremendously by at least an order of magnitude. Also, all tests we running on a single core . But golang http server runs if needed on multiple cores. We thought the results were therefore quite biased.

We thus tried to find a better solution.

2018: HTTP Hello World

 https://www.nexedi.com/NXD-Blog.Multicore.Python.HTTP.Server

By using LWAN and Cython, we achieved to create a small HTTP server in python that is able to run faster that the fastest HTTP golang server and that is also able to scale linearly on multiple cores.

2018: Multi-core HTTP Finbonacci

 https://www.nexedi.com/NXD-Blog.Multicore.Python.HTTP.Server

And if code is added to this server using Cython cdef functions to create some dynamic pages (eg. a page with a Fibonacci result), then the server still remains as scalable as golang and stil faster.

We have thus been able to demonstrate how to equal go in concurrency and beat it in performance: use cython with nogil option.

2018: Coroutines

500 000 empty coroutines in several technologies with a concurrency model
Name Language Spawn time (sec) Run time (sec) Total time (sec)
asyncio Python 1.11 9.70 10.82
asyncio (with uvloop) Python 1.11 6.83 7.91
gevent Python N/A N/A 8.30
goroutines Go N/A N/A 0.39
Lwan coroutines with work-stealing Cython 1.15 0.27 1.49

https://www.nexedi.com/NXD-Blog.Cython.Multithreaded.Coroutines

We studied more the underlying LWAN library and found that it includes a coroutine library. Based on the ideas of Juliusz Chroboczek's system programming project at Paris 7 University, we created a small co-routine library for Cython and compared it with asyncio, gevent and goroutines on a "empty corounine" benchmark.

Surprisingly, it is an order of magnitude faster than exiting python libraries. The run-time part is probably as good as in golang. The Spawn time is till high, most likely because we rely on malloc, but could be improved.

Future: cdef class nogil

    
            cdef class Foo nogil:
              cdef double a;
              cdef double b;
              cdef int bar(self):
                self.b = 1.0
                self.a = self.b
          
        

Then..

    
            cdef int baz() nogil:
              o = Foo()
              o.bar()
            baz()
          
        

Base on those results, we expect to extend Cython language to turn Extension Types, also known as cdef class, into a full object system independent of cpython. This way, one can use cython to develop high performance algorithms that never hit the GIL of cpython, while keeping all the benefits of easy integration with cpython.

References

All details which have lead to above conclusions are available in the above references.

Rationale

Let us now dig into the rational behind our motivation to improve the concurrency and performance of Pyhton. In this section, we express Nexedi's views in an unambiguous way. Our views might be different from what is usually expressed in Python's community. Yet, they are based on the actual experience of bringing Python based large systems with million users to the market and maintain them for more than 10 years. They are also based on the observation that part of early Python adopters are leaving Python due to difficulties in adopting Python 3 or to push Python 3 language evolution in the direction they actually care most.

State of Python: Nexedi Perspective

  • as good as it was in 2000 when we selected it
  • favourite language of developers in 2018
  • still (very) slow
  • still (very) poor concurrency (models)
  • still unusable inside a Web browser
  • competitors: JS, Go

In 2018, Python language is in our eyes as good as it was in 2000 when Nexedi selected it as its primary language (No 2 was...Ruby), due to its reflexive object system and to the existence of an object database (ZODB) based on pickles. The existence of ZODB was a key argument for us because we knew since 1993 that object-relational mapping can not work by design (despite how much people still try) and that an object database was essential for clustering and passing objects from one process to another.

Python is in 2018 the favourite language of developers, which is a great news.

But in 2018, it is still very slow. Compared to other dynamic languages, and to Javascript in particular, it is unacceptably slow. What was acceptable for a scripting language in the 90s at a time when other scripting languages (Tcl, AppleScript, etc.) or dynamic languages (CLOS, Smalltalk, AppleScript, Visual Basic, etc.) were equally slow is no longer acceptable considering the progress made by runtime implementations.

The concurrency model of python is still very poor. Asyncio, by adding keywords to the language specification, actually imposes a specific and questionable concurrency model, that has been fashionable in the early 2010s but that is not future-proof as for any fashion. It does not provide much to handle multi-core. And it is not reflexive. In a sense, it does by changing the language something that a library could have done. And by so, it ignores to existence of solutions such as Actalk that provide a generic, universal approach to object oriented concurrent programming, able to cover all forms of concurrency: agent programming, map reduce, implicit concurrency, etc.

Python is still in 2018 unusable in a web browser, either because cpython compiled into web assembly is too slow or because other implementation of python are too restrictive.

The weaknesses of Python in 2018 are becoming too visible. Many projects that would have started 10 years ago in Python now start with go or javascript.

Python 3 Benefits: Nexedi Perspective

  • ...
  • ...
  • ...
  • ...
  • long term support by a dozen of individuals -> > 200 K€ / year

The state of python in 2018 includes the state of Python 3. Nexedi just started to migrate all its code to Python 3. Yet, we see close to zero benefits of doing this, besides better long term maintenance (python2.7 maintenance will be stopped in a few months or years) and their side effect: better system libraries, new versions of libraries not released for python2.7, etc.

Even if we search a lot, we can not say what Python 3 brings to our products and to our customers.

Python 3 Losses: Nexedi Perspective

  • Incompatible strings -> 200 K€ + 30 K€/year patch maintenance
  • Slower -> 20 K€/year
  • Requires more memory -> 20 K€/year
  • Needlessly increasingly complex -> 5 K€/staff
  • Perverted by fashion (ex. asyncio, unicode) -> 10+ K€/year

Porting Nexedi's code to Python 3 has high cost. The incompatibility that was introduced with pickles (in contraction to format stability that had been specified and promised previously), the introduction of unicode strings and the use of two types (string, byte array) instead of one in Python 2, will generate a porting cost of 200 K€. We will also have to create a patch and maintain it so that we can cast byte arrays transparently into strings. Without that, we would not be able to respect our promise to customers: "no data migration". Unlike other vendors (especially those that use ORM), Nexedi does not charge customers for migrating data from on version to another, thanks to ZODB and pickles.

Python 3 is slower on all our tests no mater the version. This will lead to optimisation and hardware costs.

Python 3 consumes more memory. This will lead to optimisation and hardware costs.

And Python 3 language PEPs are slowly turning the language, which used to be simple and minimal, into a complex and inconsistent one, especially in terms of reflexivity of the object model. This will lead to training costs, without much benefits (except maybe for optional type annotations).

And lately, the Python 3 language has been subject to changed that were more influence by fashion than anything rationality and scientific state of the art. This is the case with unicode binary strings, an approach that was fashionable at the end of the 90s but that other modern languages did not adopt (they use UTF-8 strings instead). It is also the case with asyncio which introduces a low level form of concurrent object programming which is both questionable, very specific (ES6 did the same) and unnecessary (see how Actalk does). For a programming language that has to be maintained 50 years ar least, fashion is actually a plague.

Python 3 Industry Adoption: Instagram

  • Dropbox dropped it (and pyston or pypy) for go
  • Google prohibits its use in production
  • Youtube wrote a python to go transpiler
  • Chinese clouds run on Java or Go

Instagram has adopted Python 3.

But Python 3 has experienced little industry adoption, which is quite normal considering the incompatibilities it introduced. All projects (eg. GRUB 2, Zope 3) that introduced big changes from one version to the next one version to the next one, also introduced a 10 years delay in their industry adoption, or died. Breaking upward compatibility is always a huge, and possibly deadly, mistake for a software project.  

Not all Free Software projects are careless about upward compatibility. The Linux kernel project or glibc projects have policies to protect upward compatibility. Thanks to their policy, releasing a a new version of a Linux/glibc based software can be achieved in two steps: first quickly by recompiling as is (upward compatibility), second after some time by leveraging new features (feature evolution).

This is what we could do until Python 2.7, with of course some unnecessary glitches (ex. introduction of changes in the rounding algorithm of python in contradiction with its mathematical specification). Yet, our code base could evolve quite smoothly for 15 years.

Certain changes in Python 3 language prevent such a smooth evolution. Instead of spending a few weeks or months, adopting Python 3 means for us spending a couple of man year due to changes in strings, pickles and AST.

If porting a big system to Python 3 or redeveloping its core from scratch have similar cost, it is natural to consider alternatives to Python 3.

Dropbox for example first tried to use pypy then created pyston in order to accelerate their Django based infrastructure. Pypy did not bring speed improvements. Pyston (that was also supported by Nexedi during one year) progressed too slowly and was dropped. Dropbox instead went in the direction of rewriting core parts of its infrastructure in Go language.

Google has a reputation of prohibiting the use of Python in production systems.

Youtube wrote a Python to Go transpiler.

Chinese cloud providers such as Qiniu heavily realy on Go language where US cloud providers would have used Python 10 years ago, and on Java in other cases.

Alternative to Python 3: Go?

  • Too costly (for Nexedi) to migrate
  • Not so easy to interface with C libraries
  • Nothing as good as PyData ecosystem

The lack of adoption of Python 3 by the industry in areas where Python used to be strong is quite frightening for a small company like Nexedi that relies on Python.

We thus conducted our own analysis. Could we switch to Go language?

The answer is no because it is too costly (even more than migrating to Python 3). Also it is not that easy to interface with C libraries (not as easy as using Cython). And there is nothing similar to the PyData ecosystem.

PyData ecosystem is currently the fastest growing social group of the Python community. Python libraries such as NumPy, Pandas, Scipy, Scikit-learn, Keras, PyTorch, etc. offer to developers a range of data science algorithms which no other language can offer. 

Alternative to Python 3: Javascript?

  • Too costly (for Nexedi) to migrate
  • Not multicore
  • Poor concurrency model (ES6)

Could we migrate to Javascript then? Some python to Javascript transpilers even exist.

The answer is also no. For the same cost reasons. 

But also because Javascript provides no improvement in terms of concurrency: same problem of lack of multi-core support as Python, same poor concurrency in ES6 model as asyncio in Python 3.

Browser Python: Iodide

https://www.nexedi.com/erp5-Iodide.OfficeJS.Overview

Nexedi supports Pyodide - we need more hands

We of course studied all alternatives to CPython runtime, searching for a faster one with JIT or multi-core support: pypy, micropython, pyjion, pyparalle, etc. (see High Peformance Multi-core Python at Nexedi).

We also studied all implementations of Python for web browsers.

None of the alternatives brought good results, except Pypy for performance in a few cases and for impressive performance in a web browser. But pypy has other issues: no improvements over CPython for multi-core, still harder to interface with C/C++ (but this is improving).

After some time, we finally found a way that we believe has some future: use Python for high-level scripting or reflexive programming (as it was intended for), use something else for high performance.

This something else already exists: it is called Cython. It has a syntax similar to Python. It is already widely adopted. It is fully compatible with Python 3 (Cython is a superset of Python language).

This something else could also be C, C++, FORTRAN, etc. which interface with CPython very easily thanks to Cython.

The power of using Python as a driver and something else such as Cython for performance is showcased in a project called Pyodide. 

Pyodide is a project that originates at Mozilla  and that brings a complete Python based data science notebook into a Web browser.

Pyodide is now also sponsored by Nexedi (see OfficeJS Iodide: Ubiquitous Data Science Notebooks for Business and Education) through the work Roman Yurchak who added dynamic loading of python modules, accelerated the build process, fixed stability issues and is on the way to release scipy and scikit-learn.

Pyodide is really interesting for us because, despite the extreme slowness of CPython runtime compiled to Web Assembly running in a Web browser, Pyodide is actually fast and usable.

Running a Pyodide notebook involves in general executing CPython (10 times slower than native) and NumPy (nearly same as native). In average, Pyodide is only 50% slower than native.

High Performance multi-core Python Cheat Sheet

Use python whenever you can rely on any of... Use golang whenever you need at the same time...
  • multiprocessing
  • cython's nogil option
  • multi-core execution within wendelin.core distributed shared memory 
  • cython's parallel module
  • low latency
  • high concurrency
  • multi-core execution within single userspace shared memory
  • something else than cython's parallel module or nogil option

Cython also provides language feature called nogil  whiich released CPython's GIL. Releasing the GIL is something that is used a lot by NumPy (without Cython). If you try to use NumPy on a multi-core system, you will find out that all cores are being used efficiently. You will also find a lot of nogil instructions in LXML or Scikit-Learn libraries which are based on Cython.

This means that high performance programming in Python is possible: through multi-processing (as ERP5 does with wendelin.core to share variables across processes) or by releasing the GIL in Cython (as Scikit-Learn does).

Yet, some kinds of software do not fit into this: those that need fine grained concurrency (like coroutines) and low latency within the same execution space. This is where go shines, until now.

Python 3 Performance Improvement Goals

  • Compatible
  • As fast as C
  • As multicore as Go
  • As reflexive as Actalk concurrency

Based on the former cheat sheet, we have expressed some goals of improvement of Python 3 (through Cython, which is a superset of Python 3).

We want compatibility, obviously, to ease adoption. Cython does that very well.

We want to be as fast as C. Cython does that very well too with cdef functions and classes.

We want concurrency to be as multicore a Go and as reflexive as Actalk. This is not yet the case.

Filling the missing parts

  • fast, multi-core dynamic HTTP server [DONE]
  • fast co-routines based on work stealing scheduler [DONE]
  • multi-core cdef class instanciation
  • muli-core memory manager
  • state of the art OOCP

We have thus defined a work plan. 

Step 1: prove that we can create an HTTP server in Python that runs faster than the fastest in Go, over multiple cores. This is done.

Step 2: prove that we can run co-routines as efficient as those of Go, with work stealing algorithms. This is mostly done.

Step 3: implement instanciation in Cython cdef class without depending on CPython's GIL. This is ongoing.

Step 4: implement a memory manager for Cython cdef class without depending on CPython's GIL. This is ongoing.

Step 5: implement a concurrency model that is state of the art. This has not started yet, but we did some experiments already through PyGolang and CMFActivity.

Code

We will now have a closer look at the code we wrote as part of Step 1 and Step 2.

Feature comparison of C coroutine libraries

 

cpc

libtask

lthread

libdill

libmill

libco

non bloking IO and network interface ? Y Y Y Y Y
lightweight and fast context switch Y Y? ? ? ? ?
contained memory impact Y Y? ? ? ? ?
efficient scheduler N? N ? ? ? ?
thread-safe ? N Y Y Y Y
multi-threads Y N ~N N N N
system to guarantee atomicity (mutex...) ? N Y? pthread pthread ?
communications between tasks (channels...) ? Y ? Y Y ?

https://www.nexedi.com/NXD-Blog.Cython.Multithreaded.Coroutines

High performance HTTP servers are usually based on a good coroutine library. We researched all C libraries we knew, patched them to get multi-core support if needed. We compared their features.

The result is shocking: there is no coroutine library in C which is covers the kind of features that go provided natively.

Performance comparison of C coroutine libraries

  1 thread 2 threads Comment
Python+uvloop 40k X Not multi-core, performance drastically decreases if server has to process a request
Go ~33k ~39k High performance and multi-core
libtask 6k - 7k 7k - 14k Slower than Go
lthread ~4k ? Poor performance, not multi-core
libdill ~60 ? Poor performance, not multi-core
libco 300 - 4k ? Poor performance, not multi-core

https://www.nexedi.com/NXD-Blog.Cython.Multithreaded.Coroutines

What is even more shocking is that most coroutine libraries in C are... slower than Python with uvloop. All tests we run on a mig-range laptop with a quad-core i7 CPU.

Then came LWAN: high speed and low memory

500 000 empty coroutines in several technologies with a concurrency model
Name Language Spawn time (sec) Run time (sec) Total time (sec)
asyncio Python 1.11 9.70 10.82
asyncio (with uvloop) Python 1.11 6.83 7.91
gevent Python N/A N/A 8.30
goroutines Go N/A N/A 0.39
Lwan coroutines with work-stealing Cython 1.15 0.27 1.49

https://github.com/lpereira/lwan

Bryton Lacquement then discovered LWAN and tried it. He wrote a Cython wrapper around it. He implemented a simple work stealing algorithm. And he achieved impressive performance with coroutines.

It is likely that - if he had used a dedicated memory manager instead of malloc - if would have surpassed Go's goroutines performance.

Define a server

So here is how one used LWAN in practice from Python / Cython to create a Web server. Code sounds quite familiar.

Define Hello Handler

    
            cdef public int handle_root(lwan_request *request, lwan_response *response) nogil:
              cdef char *message = "Hello, World!"
              response.mime_type = "text/plain"
              lwan_strbuf_set_static(response.buffer, message, strlen(message))
              return HTTP_OK
          
        

https://github.com/lpereira/lwan

Defining a handler in Cython cde function requires some extra care with C types. Hopefully, by wrapping into Cython some standard library for strings (ex. C++ STL), code could look better in the future.

Run

    
            import lwan
            lwan.run()
          
        

Running LWAN is just two lines of Python.

It is really fast

Result is impressive. LWAN based Python / Cython libraries spans across multiple cores and surpasses the fastest Golang HTTP server.

Define Fibonacci Handler

It is possible to execute some dynamic code inside a handler and release the GIL. This is a standard Cython feature. Here, we execute a Fibonacci calculation.

It really scales

Again, result is impressive. Python with Cython scales on multiple cores and runs faster than Go.

This last example demonstrates that it is already possible with LWAN wrapper and Cython nogil option to develop web application servers or cloud infrastructure services that scale as well and run as fast as Golang, yet integrate much more easily with existing code in C/C++/FORTRAN and leverage the vast legacy of Python libraries especially in data sciences.

Thank You

 
  • Nexedi SA
  • 147 Rue du Ballon
  • 59110 La Madeleine
  • France
 

QR Code

 

Thank You

  • Nexedi SA
  • 147 Rue du Ballon
  • 59110 La Madeleine
  • France
  • +33629024425

For more information, please contact Jean-Paul, CEO of Nexedi (+33 629 02 44 25).