Arm is a relatively new player in the server market. One implementation using their IP is the Cavium ThunderX, a system designed for massive parallel applications. This Arm based hardware is available on bare-metal cloud providers like Packet which is what we used to assess the performance running Nextcloud. Nextcloud, by its nature, scales extremely well across multiple cores and our tests show up to 50% better performance than a similarly priced Intel system.
Nextcloud is the most popular self-hosted private cloud platform, delivering the productivity benefits of file sync, share and online collaboration technology without the privacy and security risks of public clouds. It enables customers like the German Federal Government to ensure data stays under control, providing unique privacy and security capabilities and easy compliance with legal requirements like the GDPR and HIPAA.
With cross-platform mobile and desktop sync clients, users connect regularly with the server for updates to files, giving a constant background load on the server. The web interface offers users rich functionality with over 100 apps providing Calendaring, Mail and Contacts, audio/video calls & chat, bookmark syncing, password management and dozens of other features, requiring the server to be quick and responsive even under high load.
On most larger Nextcloud instances like the Technical University in Berlin, the constant background load from clients checking in for updates, consisting of
propfind calls, accounts for 80-90% of the load and this was a major focus in this study.
Arm, a British/Japanese company calling itself “The Architects of Tomorrow”, has some 5,000 employees, making it a tiny blip on the radar in the tech world. Yet its technology reaches 80% of the global population and 100 billion (yes, with a b!) Arm based chips have shipped already. This is due to their business model: Unlike Intel, they only provide their core designs to third parties which design and produce full chips.
These can be micro-controllers for hard drives as used by Western Digital, up to custom-designed, desktop-class CPU’s as used by Apple in their iPad Pro devices.
Over the last few years, Arm has started to make inroads in the server market. Several vendors have started to produce hardware, one being the Cavium ThunderX. This SoC (System-on-Chip) offers 96 Armv8 (64bit) cores and at Packet can be used priced at a mere $.50 per hour.
We compared its performance to a similarly priced Intel, the Type 1E, 8 threads on 4 Intel E3-1578L cores with 32G memory at the same $0.50 per hour.
The two systems follow a vastly different approach to computing. The Intel contains 4 very powerful, dual-threaded cores, offering performance for single-threaded applications nearly unparalleled in the market. However, this comes at a price: The cores are big and power hungry.
A workload which efficiently scales over large numbers of cores, on the other hand, would derive great benefit from the approach of the Cavium ThunderX, which offers no less than 96 small cores. Each might be nimble, but together they can put up quite a show as our results show.
Nextcloud is designed to scale nearly perfectly across large numbers of independent cores. Each request to a Nextcloud server is handled by a new, fresh and independent PHP process which can run on its own CPU or even on its own server – yes, a cluster of Raspberry Pi systems has been done!
Of course, there are bottlenecks in the infrastructure around Nextcloud, including its usage of a database (MySQL for example scales badly beyond 4-8 cores) and storage. For this reason, we designed our Global Scale architecture for user installations ranging in the tens to hundreds of millions.
We used a standard Ubuntu 16.04 installation and the same configuration on both the Intel and Arm system. Our goal was to assess the performance of the Nextcloud application itself, eliminating network, I/O, RAM, database, storage and caching performance from the equation. In a real-world, large-scale Nextcloud installation, database, caching and storage tend to run on their own, optimized servers but such a setup would make benchmarking harder as it suffers from unpredictable network latencies.
We thus did local benchmarking, using a simple ‘PROPFIND’ command without actual up- and downloading. Mid-sized installations around the 30.000 user mark like at the TU Berlin show this represents around 80-90% of the load on the system, making this a very valid benchmark.
We employed Apache with PHP 7, setting up Apache to use Prefork and modphp. As a database we used SQLite as this can, for these simple read-only commands, provide nearly unlimited scalability. While this is not a setup that is practical in real life installations, it isolates the specific performance of the CPU without letting other bottlenecks obscure the differences. The results model the performance of an application server, given other, separate servers offer database, file access and user directory services.
We tested using Apachebench with 100.000 requests, trying an increasing number of parallel requests. We ran all benchmarks twice and averaged the results, recording both the 99% serve time in milliseconds and the number of requests per second.
Example command we ran:
ab -n 100000 -c 500 -m PROPFIND http:///nextcloud/remote.php/webdav/
We used monitoring tools to ensure no bottlenecks. For example, RAM usage never exceeded roughly 8 gigabytes. Our goal was to saturate the CPU cores as much as possible, without undue load on other elements of the system.
In the graphs we show the 99% serve time (how long it takes to serve 99% of the requests, an indication of how long an individual user might end up having to wait for a response) and the total number of requests per second handled by the server, showing the absolute capacity of the server at this level of parallel requests. Note that in all the graphs, Intel is blue and the Arm/Cavium system is green.
As the Intel cores are substantially faster than the Arm cores, but the Arm cores are substantially more plentiful at the same price point as the Intel cores, you can expect to see each perform different at different numbers of parallel requests. This is indeed what the results show, though we were surprised at how quickly the Cavium Arm cores overtook the Intel ones. At 30 parallel requests the Intel outperformed the Arm but already at 75, Arm hit a more than 15% performance advantage and at 100 simultaneous requests and more, it delivered a roughly 40% performance uplift.
At 300, performance went down a little as the Arm server had some trouble retaining I/O performance – at this point, the database server was likely to be slowing it down, something we wouldn’t expect to happen in a real scenario with a separate database server. The new ThunderX2, meanwhile, has also addressed this issue besides offering an average of 3.7x the performance of the ThunderX in Anandtech benchmarking, making it an even more compelling solution.
The performance of the Cavium ThunderX Arm server in our workload was impressive. Considering the price point and the nature of a first-generation product for an entirely new market, an advantage of up to 40% is substantial. Together with Arm we are looking at the latest generation of the Cavium systems, which promises to provide better scalability and performance per dollar, and pit it against the latest and greatest from Intel.
For our customers, these results offer an interesting insight in what hardware is fitting to their Nextcloud servers. While databases might benefit from fewer, stronger cores, a fleet of small cores can deliver impressive results as Nextcloud application server.