Second time´s a charm
Conventional wisdom says that it is not a
good use of time to rewrite existing code with new technologies. Nevertheless,
we had an opportunity to do some of that and in our case it offered us the
opportunity to reflect on our existing design choices and technology stack.
Here are our initial thoughts and experiences during the process.
What we’ve realized early on was that this
provides an opportunity for improving on our original design. Some issues we’ve
identified manifest themselves as never-to-be-addressed TODOs and FIXMEs
sprinkled throughout the codebase. Some were not evident to us as authors of
the code looking through the lenses of technologies choices, frameworks and our
general assumptions. That’s why a set of new technologies can be helpful in
providing a fresh perspective. The other major thing that helped identify some
problems was adding new members to the team. There’s nothing like a well aimed
“Why?” question to make you revisit and rethink your design decisions.
To be more specific, the code being ported
is a general purpose translation framework that allows access to configuration
and state of network devices in a unified manner. An example would be:
implementing standard OpenConfig APIs for a Cisco IOS classic device that only
exposes unstructured CLI access. The framework is capable of running close to a
device acting as a NETCONF/RESTCONF yang agent or it can be part of the
automation framework talking to the device remotely e.g. CLI over SSH. The
original code (still actively maintained and deployed in production) was
developed using Java and Kotlin programming languages, running in a JVM, using
Opendaylight and OSGi as base frameworks.
More about framework within FRINX
Uniconfig solution can be found
As for the new technology stack, it is a
much less restrictive C++ codebase under Facebook’s magma project: https://github.com/facebookincubator/magma that
is part of their connectivity effort: https://connectivity.fb.com/.
Here are some of the issues we’ve been
able to address so far:
- MAKE TRANSLATION CODE STATELESS AND IMMUTABLE
- DROP SUPPORT FOR RUNTIME REGISTRATION
AND UNREGISTRATION
- MOVE YANG GENERATED CODE OUT OF THE
CORE FRAMEWORK
- IMPROVE PERFORMANCE AND SCALE OF THE
CLI IO LAYER
There are
more improvements we have been able to implement in our solution so far and
there are definitely more to come. Some were known before and others have been
identified thanks to a new technology stack and fresh team members. We’ll keep
you in the loop on the lessons learned that we’ll pick up on our journey.
We know that keeping code stateless is a
good practice, but we failed to keep the translation code in our original
design completely stateless. Our framework injects some dependencies into the
translation code during instantiation and some are passed during execution,
which forces the code to hold on to the injected dependencies and thus keep
some state. So by simply passing all the dependencies during execution, there
are no resources to be held in the translation handlers making them immutable.
The big advantage is that we can keep just a single instance of translation
code in our system (instead of per device) and call it concurrently without
having to consider any thread safety issues. In addition, C++ lets us declare
our API as “const” and thus enforce immutability on translation code
implementations.
Coming from an OSGi world, we always
assumed a highly dynamic environment where code and services can come and go at
any time. That seems like a good match for our framework + translation plugins
architecture. But the truth is that this has never added user value and we
could drop this support along with a lot of code and configuration for the sake
of simplicity.
YANG is a big part of our solution since
we utilize existing YANG based standards such as OpenConfig to unify various
vendor specific APIs. The actual YANG compliant data can be stored in various
formats such as: json, DOM structures or even generated helper classes. The
problem is that generated classes should only be part of the client code and
core of the solution should not be aware of them at all. That’s exactly what we
did before and now had a chance to cleanse the framework code and make it more
lightweight. Leaving the use of (or lack of) generated classes just to the
client code.
Talking to a device over CLI sounds pretty straightforward, but there are
many intricacies you have to address in order to support thousands of
concurrent connections, while being able to invoke dozens of commands on each
device using minimal number of threads. Our original java code achieves very
high performance, but we were still able to identify some bottlenecks, address
them and improve the code significantly. This was possible thanks to better
understanding of the async IO design while using low level libraries such as
libevent, libssh and folly.
Comments
Post a Comment