The purpose of this post is document my recent journey I had moving from a typical web based (layered) architecture to an architecture based around CQRS and Event Sourcing. I am not going to describe CQRS and event sourcing as there are great blog posts already on the web that do the job perfectly well. Instead, this post is about the difficulties that I faced, planning, keeping the project on track during the transition.
I hope to give you an understanding of my project and the existing architecture and the reasons why we moved to CQRS.
The product is a platform for driving networks of digital signage players/broadcasting video and generating playlists from its content that is also managed by the platform. Many distributed components make up the platform that are controlled by workflows. We have developed a UI plus we have developed a REST/SOAP based API to allow for integration (SaaS). This product has been designed as an enterprise level product and requires many servers. So it needs to tick all the non-functional requirement boxes (scalability, availability, reliability etc etc)
Being a C# guy, its written in .NET 4.0. The UI has been developed using Silverlight 4 (for our sins) and we plan to move to HTML5 in the very near future (thats another story). On the server-side, as mentioned above we provide an REST API, we host our web services on IIS7.5 through WCF. The database is SQL server 2008 R2. We have a number of windows services and lots of MSMQ.
When I started on this project, a prototype which served its purpose was already development using Silverlight 4, RIA services and Entity Framework and SQL server. When we started on the production version, I choose to start from scratch. Many because we are agile and we wanted to drive out the code through TDD, we wanted to drive out our build and deployment processes as well. At the time it was believed that the size of networks we would support was around 10,000 digital signage players. Taking other things into consideration like the skills of the team, cost of ownership it seemed the best thing to do as to “keep it simple” and go for a typical web based layered architecture (CRUD) which suited the technology stack. Our layers were.
- From the top, Silverlight 4 using Prism 4.
- MVVM pattern working with presentation models and service proxies.
- Proxies that communicate with web services asynchronously using WCF channel factory (binary encoding over http).
- Web services exposed over IIS using WCF.
- Stateless services that interact with the domain and map WCF data contracts using AutoMapper.
- Domain (POCOs with logic)
- Persistence layer using NHibernate 3 and Fluent NHibernate.
- SQL Server 2008 R2.
The windows services communicated with the Web services and listened on MSMQs.
All in all, this is a typical architecture, it was nicely decoupled, The ORM was abstracted from the domain, high level of code coverage from over 600 odd unit tests and many integrations and all in all its was a code base to be proud of.
Why change? We had a valid architecture in place and its well understood. Our non-functional requirements changed, we were originally talking about our biggest network being 10,000 which grow to 200,000+ up to 1 million. Our requirements were that we needed to scale, be available for 99.99% and be auditable. We could be dealing with thousands of web requests per second. Amongst these changes we were also starting to fill the pain with the current architecture. Now, I am not going to speak ill of this architecture, but we did find parts of the code base that started to smell.
- Fat view models. Its common to implement CRUD behaviours for the entities in your model and expose them through your web services. This is fine, but what we were finding is that our application was being created in our view models. A view model was being injected with many proxies to get the data from the various web services that were then mashed up to form the UI.
- Fat services. With all the will in the world, logic would find its way out of the domain and into a service, even with peer code reviews.
- Multiple data mappings. This issue is were we read data from the database using an ORM into the domain (transform 1), then a requesting web service will map the domain object into a message object (DTO) (transform 2). The UI takes the message object and transforms into a presentation object (transform 3). Although this might not seem a lot, AutoMapper made this easier, its mundane code, needs testing as it can go wrong.
- Interacting with the database. I have been pro ORM and most of the projects I have worked on over the last 4 years mainly used ORMs. I have been happiest when its NHibernate or Castle ActiveRecord. But dealing with the volumes of data and performance being paramount, we needed to have finer control over the SQL. So back to crafting our own sprocs. Mainly to respect the database and have the ability to tune it accordingly. Also we have a complex domain model, we could without knowingly easily fetch half the database back through one query. yes we could lasy load, but we could then end up with a “select n + 1” issue and getting this balance right is a problem that we could do without.
I first came across CQRS about 2 years ago. At first, I was sceptic. It was about 6 months later and it just clicked and now I am completely sold on the concept.
More importantly, implementing an architecture with the CQRS concept and Event Sourcing rectified our current issues and satisfied all the non-functional requirements. It also gave us a lot of benefits.
Solving the current issues:
- Fat view models: We would be returning the data needed for a view in one query straight from a stored procedure. This means that our view models need only know one proxy to get its data. This meant that we don’t need to shape the data in the view model.
- Fat services: Our services are just facades over our command handlers. Each command handler deals with a single behaviour.
- Multiple data mapping: This is reduced. The results from our query store are mapped straight into a message object (DTO). We still need to map in the UI though.
- The queries are specific, we bring data only the data we need. Even before changing the query store to a denormalised tables we were seeing the benefits while stilling querying our relational database.
Beyond solving our current issues, the concept gave us some much more.
Planning the change
Before stating the move, I had to plan out many things. But first I have to present this to my team and get them to buy into the idea. I also needed to get the nod from management so I had to plan how the change was going to happen. I had to present the architecture and how we would move to it, while still delivering new features without knocking timescales into another year.
Most of the refactoring would be on the server side, so we could change the implementation for each service call. The calls you start with are important. We chose to get a tackle the lookup data and administration service calls changed first, knowing that parts of the domain and database would give us a good foundation moving forward.
So intercepting a single service call one by one seemed the best approach because the risk seems small and easy to revert. Doing the intercepting would be easy but driving out the CQRS based architecture was not, so we allowed a block of time to create the new databases and domain etc. We have to logically group the features to try and do a feature per sprint.
Although we were planning to move to a new architecture, we still had to ensure that old database was updated and that the existing features still worked. We was not a big issue as we could listen to events coming though the event bus and update the old the database accordingly. This felt good, we were using a clear benefit of the new architecture to keep the old stuff in sync. Once we had the whole system moved over we would remove this bridge and the old database.
Hurdles that got in way
Although I planned this, things happen that you didn’t see coming.
During the beginning, I worked in isolation to craft out the implementation on a different branch, while the rest of the team carried out the normal daily development. I tried at various points, walk the team through the new architecture to bring up their levels of understanding. I say I tried, CQRS was new to them and they needed more exposure to really get it. The mistake I was not getting the team involved more frequently.
I got to point where I had to refactor the core domain and I could not pussy foot around it no longer, so i had to just go for it. I was blessed that the team got a fairly stable build out, to relieve pressure, so I had to merge my branch back into the main trunk. This was painful, I had been working on the branch for 3 months and allow I planned out what parts of the code base to work on, changes and conflicts still happened.
Once the code was merged, we tackled the core domain which took another 3 months. Progress was slow because I hard to educate the team and the team had to make mistakes as part of this learning. Our deadlines slipped the board needed to be updated, and this effected the marketing and product launch. This was caused by not educating the team in the beginning and not enough stakeholder management.
Standing in the light at the end of the tunnel
After 6 months hard work, the team and I managed to complete the refactoring. If I had to do it again (which I hope I do), I would have produced the software architecture description document and other documentation sooner and better stakeholder management.
Our product has massively benefitted from CQRS. Was it worth it? Absolutely.