Second time´s a charm



Conventional wisdom says that it is not a good use of time to rewrite existing code with new technologies. Nevertheless, we had an opportunity to do some of that and in our case it offered us the opportunity to reflect on our existing design choices and technology stack. Here are our initial thoughts and experiences during the process. 

What we’ve realized early on was that this provides an opportunity for improving on our original design. Some issues we’ve identified manifest themselves as never-to-be-addressed TODOs and FIXMEs sprinkled throughout the codebase. Some were not evident to us as authors of the code looking through the lenses of technologies choices, frameworks and our general assumptions. That’s why a set of new technologies can be helpful in providing a fresh perspective. The other major thing that helped identify some problems was adding new members to the team. There’s nothing like a well aimed “Why?” question to make you revisit and rethink your design decisions.

To be more specific, the code being ported is a general purpose translation framework that allows access to configuration and state of network devices in a unified manner. An example would be: implementing standard OpenConfig APIs for a Cisco IOS classic device that only exposes unstructured CLI access. The framework is capable of running close to a device acting as a NETCONF/RESTCONF yang agent or it can be part of the automation framework talking to the device remotely e.g. CLI over SSH. The original code (still actively maintained and deployed in production) was developed using Java and Kotlin programming languages, running in a JVM, using Opendaylight and OSGi as base frameworks. 



More about framework within FRINX Uniconfig solution can be found 



As for the new technology stack, it is a much less restrictive C++ codebase under Facebook’s magma project: https://github.com/facebookincubator/magma that is part of their connectivity effort: https://connectivity.fb.com/.

Here are some of the issues we’ve been able to address so far:

  • MAKE TRANSLATION CODE STATELESS AND IMMUTABLE
  • DROP SUPPORT FOR RUNTIME REGISTRATION AND UNREGISTRATION
  • MOVE YANG GENERATED CODE OUT OF THE CORE FRAMEWORK
  • IMPROVE PERFORMANCE AND SCALE OF THE CLI IO LAYER

There are more improvements we have been able to implement in our solution so far and there are definitely more to come. Some were known before and others have been identified thanks to a new technology stack and fresh team members. We’ll keep you in the loop on the lessons learned that we’ll pick up on our  journey.

We know that keeping code stateless is a good practice, but we failed to keep the translation code in our original design completely stateless. Our framework injects some dependencies into the translation code during instantiation and some are passed during execution, which forces the code to hold on to the injected dependencies and thus keep some state. So by simply passing all the dependencies during execution, there are no resources to be held in the translation handlers making them immutable. The big advantage is that we can keep just a single instance of translation code in our system (instead of per device) and call it concurrently without having to consider any thread safety issues. In addition, C++ lets us declare our API as “const” and thus enforce immutability on translation code implementations.

Coming from an OSGi world, we always assumed a highly dynamic environment where code and services can come and go at any time. That seems like a good match for our framework + translation plugins architecture. But the truth is that this has never added user value and we could drop this support along with a lot of code and configuration for the sake of simplicity.

YANG is a big part of our solution since we utilize existing YANG based standards such as OpenConfig to unify various vendor specific APIs. The actual YANG compliant data can be stored in various formats such as: json, DOM structures or even generated helper classes. The problem is that generated classes should only be part of the client code and core of the solution should not be aware of them at all. That’s exactly what we did before and now had a chance to cleanse the framework code and make it more lightweight. Leaving the use of (or lack of) generated classes just to the client code.

Talking to a device over CLI sounds pretty straightforward, but there are many intricacies you have to address in order to support thousands of concurrent connections, while being able to invoke dozens of commands on each device using minimal number of threads. Our original java code achieves very high performance, but we were still able to identify some bottlenecks, address them and improve the code significantly. This was possible thanks to better understanding of the async IO design while using low level libraries such as libevent, libssh and folly.




Comments

Popular posts from this blog

VodafoneZiggo deploys network-wide automation with FRINX

Elisa Polystar acquires FRINX to broaden its network automation portfolio