SDN Developers Report Key Lessons in Testing OpenDaylight Performance

368

OpenDaylight (ODL) is an open source SDN platform designed to serve a broad set of use cases and end user types, mainly service providers, large enterprises, and academic institutions. Uniquely, ODL provides a plethora of network services in all domains–data center, WAN, NREN, metro and access.

With ODL, controller functionality is literally in the hands of application designers, as opposed to being hard-wired (and thus restricted) by controller designers. This unique flexibility is due to an evolved model-driven service abstraction layer (MD-SAL) that allows for the easy addition of new functionality in the form of southbound plugins, network services, and applications.

In March of this year, ODL published a Performance Report based on the newly released Beryllium, focusing on real-world, end-to-end application performance. This report generated approximately 2,000 downloads, providing many prospective–and even existing–users with key data points for a comprehensive understanding of how OpenDaylight (and potentially other SDN technologies) can be leveraged in the world’s largest networks.

Why the focus on real-world, end-to-end application metrics? ODL has well over 100 deployments, detailed in the user stories of many global service providers, including Orange, China Mobile, AT&T, T-Mobile, Comcast, KT Corporation, Telefonica,TeliaSonera, China Telecom, Deutsche Telekom, and Globe Telecom. As these key end users and the broad ecosystem of developers continue to use ODL to software-control their networks, they need to know what to expect not only in terms of ODL functionality but also the application performance characteristics of that functionality in a live network deployment.

Given all the possibilities in testing a platform as broad as ODL, savvy readers requested additional context around some of our results. For instance, developers and end users wondered about the differences that might be expected in the latest (SR1) release of Beryllium, as well as other key factors that might affect performance. Some were curious about the differing benefits of testing in single-instance versus clustered configurations (both of which are supported in production ODL deployments), and our reasons for using multiple methods for accessing controller functionality (i.e., Java versus REST).

Accordingly, we just updated the report to give a more comprehensive picture of ODL’s performance when programming network devices using the industry’s most complete set of southbound protocols, including OpenFlow, NETCONF, BGP, PCEP, and OVSDB. As before, we also provided reference numbers for other controllers (ONOS and Floodlight) for the southbound protocols they support (principally OpenFlow).  

ODL works closely with OPNFV in support of an open Controller Performance Testing project (CPerf), which will provide easily referenceable, application-relevant benchmarks for the entire networking industry to make tomorrow’s networks more agile, available, secure and higher performing.  As such, we strongly encourage–and have already invited–developers and users from all open SDN controllers to participate in CPerf.

To discuss the results and other topics of interest in the open source controller world, we sat down with OpenDaylight developers–and members of the Performance Report team. Luis Gomez, Marcus Williams and Daniel Farrell offer their insights into the report and the impacton the SDN ecosystem.

Marcus Williams is a Network Software Engineer on Intel’s SDN Controller Team.
Please give us some background on who you are, where you work, and which open source networking projects you work on.

Luis Gomez, Principal SW Test Engineer at Brocade. I am committer in the Integration/Test project and a member of the OpenDaylight Technical Steering Committee (TSC) and the Board of Directors.

Marcus Williams. I am a Network Software Engineer on Intel’s SDN Controller Team. I work on OVSDB and Integration/Test projects in OpenDaylight and I am a committer in the OPNFV Controller Performance Testing (CPerf) project.

Daniel Farrell, Software Engineer on Red Hat’s SDN Team. I’m the Project Technical Lead of OPNFV CPerf (SDN Controller Performance Testing) and OpenDaylight Integration/Packaging (delivery pipelines, integration into OPNFV). I’m also a committer to OpenDaylight Integration/Test and on OpenDaylight’s TSC.

What were the key findings from the Performance Report?

Luis: One key finding was we did perform similarly to other well-known open source controllers (e.g. ONOS, Floodlight) in the same test conditions. Another key finding was the effect of batching and parallelism in the system throughput: batching multiple flow add/modify/delete operations onto a single REST request on the northbound API increased the flow programming rate by an order of magnitude (8x). Batching benefits also extend to southbound protocols; for example, L2/L3 FIB programming rate using NETCONF batch operations was nearly an order of magnitude (8x) faster than using OpenFlow single operations. On the other hand, adding more devices in the southbound did not work as expected in some tests (like OpenFlow) where the performance figure did not change very much with the number of switches. This is because we used mininet/OVS OpenFlow agents on fast machines with plenty of memory and CPU resources, as opposed to hardware switches that have much less powerful CPUs; so few of these OVS agents are normally enough to stress the controller.

Luis Gomez is a Principal Software Test Engineer at Brocade.
Daniel: We entertained some interesting discussions around the use of REST as opposed to a native Java API to program the controller. This led us to add context around this testing decision in the second version of the report. Virtually all end users employ REST for its ease of deployment and maintenance. Given the more direct connection of a Java API, it naturally yields numbers that are higher–in OpenDaylight or any other controller–by multiple orders of magnitude (literally hundreds to thousands of times faster to add flows internally in the controller). While such metrics may be useful to developers enhancing the controller, they don’t represent end-to-end system performance. Therefore, understanding the performance profile of using real/simulated devices attached to controller or using a REST interface informs end users as to use cases that are most suitable for the controller. On our southbound tests, we do use a Java API, but the the performance is measured in the device rather than internally in the controller.

What prompted you to create these tests?

Luis: As SDN has gained momentum and increased use among telcos, enterprises and others, we were often asked how OpenDaylight would perform in different scenarios, so we wanted to create tests for common use cases (e.g. device programming, control channel failures, etc). It is important to note that every test is fully described and reproducible so people can see for themselves and validate our numbers in their own environment. We wanted to show OpenDaylight’s performance in common scenarios.

Marcus: A broad set of people across the community created these tests to show the usability of OpenDaylight. Future adoption of SDN depends largely upon having a usable solution. We created this set of tests to help tell the story of OpenDaylight usability, by underscoring its ability to perform and scale in many common use cases. Since we wanted the results to be user-facing, we did the work of nicely presenting them in a white paper instead of our usual developer-oriented wikis.

Were there any major surprises?

Luis: We learned a lot about our own controller by doing this exercise. For example, we did not get comparative programming performance numbers until we disabled datastore persistence (e.g., write flow configuration in hard disk) or installed the faster Solid State Disk Drives (SDD) on which to persist the database. We also noticed that none of the other controllers we evaluated persisted the configuration by default, so we disabled this feature in OpenDaylight in order to run a commensurate test.

Marcus: We found out quickly that it is challenging to synchronize procedures and environment setups across teams and continents. We saw widely differing numbers depending on disks (owing to the datastore persistence issue mentioned by Luis) and environmental configuration. For example, using the command line tool tuned-adm, we could configure our systems to use a throughput-performance profile. This profile turns off power savings in favor of performance and resulted in around 15% improvement in performance in OpenDaylight OpenFlow tests.

Daniel Farrell is a Software Engineer on Red Hat’s SDN Team.
Daniel: I was surprised by how batching Northbound API requests (which Luis mentioned earlier) improved performance across the board (OpenFlow 8x, NETCONF 8x, BGP 10x). Since ODL at that time was the only SDN controller to support REST API batching (ONOS subsequently added similar functionality), we were pleasantly surprised at the dramatic impact on performance. I was also surprised how consistently and quickly the new design of ODL’s OpenFlow plugin collects operational flow information relative to the prior ODL design or other controllers.

What do end users look for in network performance tests? How do you decide which tests to run?

Luis: There are many, many tests one can run–we have focused on tests that we see as more relevant for the user, and that represent real network deployment scenarios being contemplated at this stage of SDN’s maturity. It is very important to look at end-to-end scenarios where the controller is just a piece of the overall solution. For this report, we tested single plugins and single controller instances, but future versions will include multi-plugin and cluster scenarios. Iterating toward these fuller implementations has a number of advantages. For example, a single controller instance has fewer variables, and it’s easier to isolate root causes for performance differences in (for instance) southbound protocols/plugins such as OpenFlow and NETCONF.  Also, starting with a single instance establishes a baseline for comparison with future testing of clustered and/or federated configurations.

Marcus: I completely agree with Luis. The next phase of testing will be more solutions focused. I think end users look for tests that provide relevant metrics for their use-case or solutions needs. Clustered scenarios and the interaction of multiple plugins, as well as external software interaction and integration will be key to gathering the user-focused metrics needed to move the industry to adopt SDN solutions.

Daniel: OpenDaylight is a platform with support for a broad set of protocols and use cases. Our large and diverse community has a large and diverse set of performance metrics they care about. Part of our S3P (Security, Scalability, Stability and Performance) focus in Beryllium was to create many new tests for these metrics, including tracking changes over time in CI. So, as Luis and Marcus said, there are many tests to select from. We focused on a set of end-to-end tests we thought were representative of the types of ODL deployments our User Advisory Group has identified. For the OpenFlow use case, the Northbound REST API flow programming statistics collection and Southbound flow programming tests were interesting because they could be executed in other well known and primarily OpenFlow controllers like ONOS or Floodlight. Other OpenDaylight protocols like NETCONF, OVSDB, BGP, PCEP were also tested and show that OpenDaylight has the performance required for many other interesting use cases.

Do you plan on refreshing the report again, and if so when?

Luis: My belief and desire is to produce a performance report after every release. In addition, we will run regular performance tests against real and emulated devices through the OPNFV Cperf project. Upon customer demand, we may also run reports against larger networks; in the meantime, just such a report is currently available through one of our members and cperf collaborator, Intracom Telecom. This report compares Lithium and Beryllium on topologies of up to 6400 switches, and successful installations of up to 1M flows.

Marcus: Our next release is Boron in the fall, and we are working hard to provide an enhanced version of this report soon after that release. In the meantime, we are working through the OPNFV Cperf project to create objective, SDN controller independent, industry standard performance, scale and stability testing on real hardware using realistic, large, automated deployments.  

Daniel: The report has been extremely well-received, so it looks like we’ll continue to refresh it. OpenDaylight’s experts in creating new performance tests are collaborating with OPNFV’s standards experts, who have been thinking about which SDN controller metrics are important to test for years. Eventually, we’d like to create something like a continuously updating performance report from OPNFV’s Continuous Integration (CI).

Where can I get the report?

It’s available at https://www.opendaylight.org/resources/odl-performance.