Overview
Every day we hear of cybersecurity threats and attacks on the nation’s corporate and public infrastructure. Data breaches that result in leak of sensitive information cost both monetarily and in terms of goodwill. The volume of such attacks is increasing rapidly: for example, ransomware attacks grew by 41% in 2022. The cost of such attacks is spiraling too: the average cost of a data breach in the United States in 2022 was $9.44 million. In this environment, every organization faces a pressing imperative to defend its applications and data from prying eyes and malicious hands.
Web Application Firewalls (WAFs) are critical in the quest to protect an organization’s digital assets. Business imperatives impel organizations to expose their applications and services via APIs, but it is essential to protect them with a WAF. There is a popular open source WAF called Modsecurity, which has been so influential that other WAF products have also adopted or accept the language that it uses for defining firewall rules. The Open Worldwide Application Security Project (OWASP), a nonprofit foundation for software security, publishes the Core Rule Set, a “set of generic attack detection rules for use with ModSecurity or compatible” WAFs. Organizations can supplement them with custom rule sets, written in the same language, which are specific to their application(s).
But there are winds of change in the world of Modsecurity: it has been declared to reach End of Life by 2024. The open source community has responded by creating Coraza, a WAF that is a successor for Modsecurity but also aims to provide a broader rule API for WAFs and a broader range of environments in this cloud age. OWASP has embraced Coraza as the chosen successor. Since Coraza also has many corporate sponsors, it is expected to be widely adopted.
While WAFs are critical to security, they are expected to safeguard application APIs without impacting web workload performance adversely. Further, WAFs are also subject to the overall push to reduce data center power consumption for operational cost and environmental benefits. Both are related because boosting WAF performance helps reduce the number of CPU cores needed for a given Service Level Objective, thus saving power.
Tetrate, an Intel Network Builders program member, have therefore been collaborating with Intel to characterize Coraza’s performance, identify areas of improvement and work with the community to carry out those improvements. This joint initiative has proven fruitful: we flagged that the logging framework was taking an undue amount of CPU time due to excessive heap allocations, identified a suitable replacement that will reduce heap allocations, and got it implemented as open source. This change delivered significant performance improvements, with heap allocations reduced by up to 82% and running time up to 45.4%.
Let us now look a bit deeper at how we accomplished this.
Performance Study and Optimization
Coraza can be deployed in different ways: as a library linked with a web application, as a standalone server that is deployed in the ingress path of a set of applications, or a WASM plugin running on a platform that supports proxy-wasm ABI, such as Envoy. In the first two configurations, Coraza runs as native code, where performance analysis is relatively easy. The last configuration is of interest in cloud-native deployments, esp. those with service meshes, but Coraza runs as a WASM plugin, where performance analysis is not so straightforward. Furthermore, the versions of Go language and the memory allocator differ between the two. For these reasons, the performance analysis was done in two phases: with standalone (native code) and with the WASM plugin. Further, the study was done in two generations of Intel processors: Intel® Xeon® Gold 6152 (formerly Sky Lake) and Intel® Xeon® Gold 6338 (formerly Ice Lake). In all cases, the standard Core Rule Set was run as a benchmark.
Consider the standalone Coraza running on Ice Lake platform. When we run the Core Rule Set and look at the CPU profile, the top ten elements with the original logging framework looked like this:
$ go tool pprof cpu.prof
...
(pprof) top10
...
flat | flat% | sum% | cum | cum% | |
1210ms | 8.12% | 8.12% | 1500ms | 10.06% | regexp.(*machine).add |
780ms | 5.23% | 13.35% | 2850ms | 19.11% | runtime.scanobject |
770ms | 5.16% | 18.51% | 3030ms | 20.32% | runtime.mallocgc |
670ms | 4.49% | 23.00% | 850ms | 5.70% | runtime.findObject |
670ms | 4.49% | 27.50% | 700ms | 4.69% | runtime.pageIndexOf (inline) |
580ms | 3.89% | 31.39% | 580ms | 3.89% | runtime.memclrNoHeapPointers |
510ms | 3.42% | 34.81% | 890ms | 5.97% | regexp.(*machine).step |
420ms | 2.82% | 37.63% | 420ms | 2.82% | runtime.nextFreeFast (inline) |
370ms | 2.48% | 40.11% | 900ms | 6.04% | regexp.(*Regexp).tryBacktrack |
340ms | 2.28% | 42.39% | 340ms | 2.28% | runtime.memmove |
The CPU profile shows that regular expression processing takes a significant amount of the overall time. However, a significant amount of the time was spent in the Go language runtime, including garbage collection. This indicates that there are a significant number of heap allocations.
The CRS benchmark output confirms this:
BenchmarkCRSCompilation-4 | ... | 719534 allocs/op |
BenchmarkCRSSimpleGET-4 | ... | 29884 allocs/op |
BenchmarkCRSSimplePOST-4 | ... | 45216 allocs/op |
BenchmarkCRSLargePOST-4 | ... | 45391 allocs/op |
A look at the memory profile indicates that the log statements tend to be hot spots for heap allocation. This leads us to the fact that the logging framework used by Coraza passes arguments to log functions via a variadic list. The Go compiler does an escape analysis to decide whether to allocate a variable on the stack or the heap; in this case, it defaults to the heap. So, every argument (except perhaps the message format) in every log function results in a heap allocation. Further, these heap allocations occur even when debug logs are disabled. That is a huge overhead.
We investigated some alternative logging frameworks that can reduce this overhead. One that stood out is zerolog, which prides itself on zero/low allocations per log. The community was very receptive to this proposed change and together we iterated it and merged an own zerolog-inspired implementation into Coraza’s repository in short order. The reason why we adopted a design based on zerolog rather than directly using zerolog is that Coraza is often deployed as a part of a larger environment, such as the Caddy web server or the Envoy service mesh proxy, which may have its own logging framework.
After the switch to the new implementation backed by zerolog, the CPU profile when running the CRS looks different:
$ go tool pprof cpu.prof
...
(pprof) top10
...
flat | flat% | sum% | cum | cum% | |
1540ms | 11.02% | 11.02% | 1790ms | 12.81% | regexp.(*machine).add |
620ms | 4.44% | 15.46% | 2470ms | 17.68% | runtime.scanobject |
540ms | 3.87% | 19.33% | 540ms | 3.87% | runtime.memclrNoHeapPointers |
520ms | 3.72% | 23.05% | 1220ms | 8.73% | regexp.(*Regexp).tryBacktrack |
520ms | 3.72% | 26.77% | 810ms | 5.80% | runtime.findObject |
490ms | 3.51% | 30.28% | 2200ms | 15.75% | runtime.mallocgc |
450ms | 3.22% | 33.50% | 460ms | 3.29% | runtime.pageIndexOf (inline) |
430ms | 3.08% | 36.58% | 830ms | 5.94% | regexp.(*machine).step |
400ms | 2.86% | 39.44% | 400ms | 2.86% | runtime.nextFreeFast (inline) |
370ms | 2.65% | 42.09% | 370ms | 2.65% | runtime.memmove |
The profile now shows that many of the Go language runtime elements are now consuming a smaller percentage of the time. The allocations report from CRS benchmark output shows a stronger result for the HTTP request benchmarks (GET and POST): the number of heap allocations have been reduced by an order of magnitude!
BenchmarkCRSCompilation-4 | ... | 718060 allocs/op |
BenchmarkCRSSimpleGET-4 | ... | 4895 allocs/op |
BenchmarkCRSSimplePOST-4 | ... | 7811 allocs/op |
BenchmarkCRSLargePOST-4 | ... | 7970 allocs/op |
The overall time for the benchmark also reduces, but that varies by benchmark. For the SimplePOST benchmark, the average time reduces from 8 milliseconds/op before the change to 4.39 milliseconds/op after the switch to the new logging framework. That is more than a 45% reduction!
The performance analysis with Coraza running as a WASM plugin on Envoy also shows significant improvements after the switch to the new logger. Specifically, the total time to run the ftw tests decreased noticeably after the switch to the new logging framework. These results confirm that the switch has benefited Coraza in all its form factors: standalone or as WASM plugin.
Conclusion
The joint initiative from Intel and Tetrate has improved the performance of Coraza, an emerging upstart Web Application Firewall. The fruitful collaboration with the community also augurs well. The performance profile also shows that regular expression parsing and literal matching are hot spots, whose share of the workload have increased after the logging optimizations. This points us to the focus area for the future: there is potential to accelerate both regular expression parsing and literal matching using Hyperscan.
As organizations are focusing more on cybersecurity concerns and adopting open-source projects for faster time to market, community initiatives such as the Coraza improvements by Intel and Tetrate will be increasingly important and valuable.
Notices & Disclaimers
Intel technologies may require enabled hardware, software or service activation. No product or component can be absolutely secure. Your costs and results may vary.
© Intel Corporation. Intel, the Intel logo, Xeon and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.