Thoughts on Software System Rewrite

Background

Sometimes, in discussions about software systems, an idea to rewrite from scratch pops up. The system in question is usually a few years old, used heavily in production, and has one or more pain points. Problems can be anything from poor performance to low-quality code to maintenance costs. The promise is that new technology, architecture, development practices, and frameworks will solve all problems.

Starting on a clean slate is tempting and may sound very good on paper (or more likely, in a presentation). However, deeper analysis of reasoning for rewrite, potential pitfalls, implementation, go-live planning, and business involvement reveals that such a decision is way more complicated than it appears on the surface.

Weak Reasons for Rewriting a System

Ignorance

It is hard to deal with an unfamiliar code base. Accidental complexity piled upon essential complexity, new business concepts, inappropriate naming, inadequate architecture - all can be very intimidating. Wouldn’t it be great if it could all be thrown away, started from scratch, organized in an elegant framework, using all the latest technologies and trends, where all is clear?

There are reasons why things are as they are. Surely, part of it may be bad developer practices or outdated architecture. But there is a lot of hidden knowledge in there - how business operates, edge cases, bug fixes, weird stuff with reasonable explanation. Without this knowledge, any rewrite is doomed.

Investing time getting to know the driving forces behind the current code base reduces rewrite urges. Documentation and especially decision records are a good place to start. Remember that decisions were taken at a point in time in the past, with knowledge, skill, and technology available at that time.

A more hands-on approach by diving into the code also helps in understanding the system and some of the decisions. In the absence of documentation, this is the only way besides the tribal knowledge in the organization. Unfortunately, the code can tell how something is done, but rarely reveals why. In any case, refactoring, addition of unit tests, and small improvements here and there can go a long way to understanding the system. Investigating issues and fixing bugs is enlightening because of the narrow focus on concrete functionality.

Resume-Driven Development

We developers are always attracted to the latest shiny thing - new programming languages, new frameworks, new architecture styles, new cloud services, new technologies. Desire for learning is a great virtue and should be encouraged.

Sometimes it can be purely selfish - with this language/framework/architecture on my resume, it would be easy to get a more interesting and better-paid job. How to achieve this? Easy, use it in my current work. There are variations to this - we should use X because there are blogs about it, conference talks, or even a book.

Adopting a new technology or architecture because it is hot and everyone is talking about it is not a good reason. Often, these approaches are applied to a concrete problem in a specific context. Remember, there is No Silver Bullet.

However, going deep to understand it, the motivation behind it, the problems it solves, the strengths and the weaknesses, and when it is appropriate to use it, is quite a useful thing to do. If it still sounds like a good idea for the problem at hand, a proof of concept will further reveal its feasibility.

Technology Envy

Similar to resume-driven development, a technology envy is looking at another company and associating their success with their tech stack. Adopting the technology someone else uses to solve a problem does not mean it will automatically solve your problem. There is more than technology in an organization - the business specifics, the region it operates in, customer demographics, the way of working, the skill level of its teams, the scale they work at, and many more.

Investigating a case study is a source of knowledge about good technical and organizational practices in a given context. Even if it is not a fit, it will broaden your worldview. Also, remember that Choose a boring technology is an option, and sometimes it is the best one.

Company Politics

This may sound ridiculous, but never underestimate corporate politics. Imagine you have a vocal developer advocating for a system rewrite. A manager can decide this is a career opportunity - being “the hero” to lead the system rewrite that will open the path to untold riches. Imagine this manager has the ear of someone higher up…

Strong Reasons for Rewriting a System

Cost Reduction

A cost reduction for operating a system is a strong reason from a business point of view. Newer, cheaper technology, running on commodity hardware or in the cloud, can greatly reduce the cost of ownership. More people with the right skills are available to hire. A path for future development and improvements opens.

Obsolete technology

Technology out of support poses various risks. It can’t address security vulnerabilities, and it is hard to maintain and develop due to a lack of skilled people. It is running on old or specific hardware, making it hard to maintain or replace if needed. It cannot scale to the business needs.

Business Changes

Businesses can pivot in another direction, move to a different model, or target new markets. A system becomes inadequate for the problems it should solve. Assumptions upon which it was built are no longer valid.

System Rewrite Pitfalls

Focusing entirely on the pain points of a software system leaves a blind spot for the pieces that work well. Usually, most of the problems are concentrated in one or a few areas, which may actually be relatively small compared to the whole system.

A large portion of simple functionality, like CRUD, basic reporting, exports/imports, and external integrations, may not be perfect, but at least they don’t cause problems too often. If doing a complete rewrite, these also have to be rewritten.

Deployment pipelines, backups, support tools, and other operational concerns also need to be taken into account.

Feature Parity Race

While a rewrite is going on, the old system can be patched with bug fixes, features may be upgraded, or even new ones introduced. If the requirement is that the new system should do exactly as the old one, this becomes a race. The new system development should not only cover all existing functionality but also catch up with new additions. A chance to review, analyze, optimize, and improve the features in the new system is missed.

Big Bang Release

A long rewrite in isolation, ending up in a big bang release, is the worst possible scenario. A move from the old to the new system is never smooth. All kinds of problems surface only when the new system is put in a live environment, at actual scale, with actual data. From system access to performance problems, to features not working - everything is possible. Rollback to the old system may not be an option. The only option is fix-forward, which is quite stressful on a live system.

Data Migration

Data migration is the most ungrateful task. It needs to be reliable, preserve data integrity, take care of all edge cases, transform data from the old to the new model, and have decent performance. It takes time, and feedback loops are long. It’s a throw-away thing, used only once. Worst of it, it is usually left for the last minute.

Newly developed data models are cleanly structured and normalized (if a relational one is used). Legacy systems may contain data inconsistencies or workarounds, accumulated over time, that do not fit in the new data model. A non-standard data model (e.g., de-normalized relational data) may have been used to tackle performance issues.

Organizational Damage

Consider the following imaginary story:

A system rewrite is announced, promising that every problem of the current system will be gone. At the kickoff, a new team is formed, including all the best people from the company. Furious arguments about technology, frameworks, and architecture go on for some time. At last, work starts, but quickly becomes evident that there is way more work than expected. The selected approach works fine for the pain points, but is not quite there for the rest of the cases. Workarounds and bending technology to do things it is not designed to do are becoming more common. People are getting frustrated, and some even leave. Meanwhile, the team left to deal with the old system is reduced to a skeleton crew, on top of being denied playing with the new cool stuff. They also become frustrated as they start leaving. The new system is being delayed, and the old system is not getting enough care. A common narrative in the business becomes “the new system will do this” regarding anything not possible at the moment. Consultants and contractors are hired…

An exaggerated and grim story, but still a very much possible one. If not handled properly, a rewrite can be devastating to morale and company culture.

Partial Rewrite

It is worth considering whether a full rewrite can be avoided at all. Working within the system itself can be more time and cost-effective than starting from scratch. Incremental improvements by refactoring, optimization, modernization, and internal redesign and restructuring can be gradually pushed to production. Techniques like feature switches can provide a safety net if something goes wrong. Deploying changes early allows fast feedback loops directly from the production environment, early problem detection, and correction. At the same time, it minimizes customer disturbances and takes advantage of already existing system infrastructure, processes, and knowledge.

Incremental Rewrite

It may be possible to take an incremental approach to rewriting where the new system is gradually built in small steps, taking over one functionality at a time. Consider a variation of Strangler Fig Pattern, where the new system is in front of the old one and initially only acts as a proxy. When a feature is ready in the new system, it replaces the usage of the old one. Another approach is to introduce some form of thin router in front of both systems, deciding which one should handle a given task. This has the advantage of early releasing rewritten pieces of the system, resulting in fast feedback and early problem detection. It also allows switchover when things go wrong.

If such an approach is not feasible or possible, consider how the new system can be exposed to production data and loads as early as possible. Automating data replication and/or synchronization, parallel runs with live payloads, and results comparison as early as possible can again uncover problems early.

These approaches are not easy and require deep thinking and time to implement. Once in place, however, they will have a huge payoff when the new system goes live.

Complete Rewrite

If a full system rebuild is the only option, it should be made clear to everyone in the business that it isn’t purely technical work. The effort should be supported at all levels, including business leadership.

The goals and expected business and customer value should be crystal clear. The same goes for the identified risks and the “unknown unknowns” coming from the complexity of the task.

A strategy for replacing the system and roadmap for implementation should be realistic and keep the expectations in check.

Conclusion

Rewriting a software system is a hard decision and is even harder to follow through with. Careful consideration of the reasons for rewriting may reveal alternative approaches. A deep analysis should take into account all possibilities of partial and incremental rewrite. For any chance of success, a good understanding of the goals and risks is mandatory, and the entire business should be involved and support the effort.