Promotion App v3 — The journey from a monolith to a scalable app

For those who did not attend BTW or just could not be present at our speech on Promo v3, the application that holds all the promotions you see in the eMAG platforms, here is a short recap of what we talked about:

1. The problem at hand — the monolith
2. How we organized to handle it
3. What we actually did
4. What we accomplished
5. What we learned

First, a short story of our Promotion app. The previous version of the application (v2) was released in 2013 and was designed to serve Romania and its 40k active products (1P), most of which were added in the 12k active promotions. Meanwhile, the business was moving fast, we also integrated Bulgaria and Hungary, and the big product expansion came (when eMAG added new products and categories each month and redefined itself from an electro IT company to an everything store). As the categories covered more and more ground they also added different selling strategies that made them more attractive to the customer (tires sold well when bought together, stoves sold well with ovens, books had a tight market and required higher discount, fashion gave way to sizes, etc.).
All this had to be added to the initial architecture and slowly but surely it build up a huge technical debt.

At the beginning of 2017 we had 250k active products for Romania, Bulgaria and Hungary and over 120k promotions,which translates into an increase of 10 times since 2013(counting promotions).

The problem at hand — the monolith
Looking at the technologies we used, we can list:
• Our own apps
• PHP 5.3
• Cake PHP
• Zend 1
• Symfony 2
• Memcached
• RabbitMQ
• MySQL– we moved from 5.5 to 5.7 about a year ago, which helped us a lot, but still didn’t quite add up to our load of data
To understand how the Promo app V2 was handling everything, we gathered some info in the table below. To view a random promo is one of the basic things done in the app, and it took up to 42s for the page to load. The number of products added in a single promo could go up to around 624. And it took us around 2.5h to send 30k promos to the website.

ActionPromo V2 View a random promo (page load)42sAdd a specific product to a promo14sNo. of products in a single promo624Send 30k promos to the website2.5 h

Definitely, something started to smell fishy, so we breathed deeply and prepared to dig deeper into the situation.

How we organized to handle it
The process demanded us to be Agile, so our team had Developers, Testers, Product Owners and more, all with the purpose to build the product as fast as possible:

The team also had to sustain the organic growth for PromoV2 and at the same time develop the application, which wasn’t easy.

The to do list
• Get user feedback
• Create an architecture structure
• Define a minimum viable product (MVP)Find the right tools
• Build a Roadmap

We had to find out what our users wanted to keep, remove or add to the new version of the Promo App. What we lately discovered is that the user always wants more, so our desire to know everything from the start led us to the first mistake for the MVP approach.

What we actually did
• Where does any application start? Very simple: with its users! That’s why the first thing we did was to talk to them and get their input on how they would like the new application to work, what features should it have and how it can make their life easier. Some important ideas we took note of: Focus on the human user, on fast loading pages and response times, redesigned the interface to no longer encumber if major processing is done in the backend

• Have a gatekeeper to ward off any excess requests made from external application that might slow down the backend

• A single point of contact from the promo app to these external applications and make our backend as independent as possible thus allowing us to make rapid deployments of new code without the need to notify/change anything in the external clients. In addition, server isolation was easier this way

Therefore, the new promotion processor app would:
• hold the bulk of the business logic
• validate, save, process and communicate with external apps interested in promo generation
• be the only one with a connection to the database
• handle cache manipulation

We went back to the technologies we used, we did our research, came to some conclusions and ended up in a productive brainstorming session that not only helped us improve the application, but also improved the overall experience of the team when it comes with designing and building something new.

Since we wanted to use Entry as a proxy, we made a test that received a post request, did an API toward an external resource and responded to the initial client with some hardcoded result. We ended up choosing NODEJs over PHP for Entry since it performed better in this particular server requirement. We discarded GO because we did not have enough experience with it and we personally did not like the syntax that much.

The Battle between MongoDB and Cassandra
We were so divided between Mongo vs Cassandra to hold the bulk of our promotion data, that we organized around 6–7 meetings just to discuss pros and cons. They all ended in a “final battle” meeting where our top two programmers had a full presentation for each. We had 7 challenges that had to be completed by each of the database and measure the time it took, the sample code to do the operations (selects, updates etc.). Each challenge mimicked a live situation we could encounter.

Challenge examples:
• A campaign starts and we would have to create + validate + send many promos in the website at the same time — simulate major inserts
• Find if a product was present in multiple promotions, active at the same time — selects for same data in multiple documents

At the end of each challenge, we all voted on our best solution. The winner was the database that gathered the most votes (applied to most of our challenges), and was easiest to understand and maintain). It ended up to be MongoDB, and the main reason was the document format of our data, the existing (minimum) know-how which made the integration easier and at that time, the Doctrine support for Mongo. We also chose to keep some data in MySQL for ease of access to complicated filtering queries necessary in UI.

To sum it up, we ended up using PHP7, Symfony 3, NODEJs, MongoDB, MySQL 5.7, Redis and RabbitMQ. For support we used tools such as VividCortex, NewRelic, Grafana, JIRA, Logstash and Kibana, Internal tools (ATF, eDeploy and the HX Team).

What we accomplished
Remember the table with the tasks performed in the app? After all the work, we definitely saw improvements. For example, the page load went from 42s to 2 sec.

ActionPromo V2Promo V3 View a random promo(page load)42 s2 sAdd a specific product to a promo14 s2 sNo of products in a single promo6246kSend 30k promos to the website2.5 h10 min

What we learned
• Build an MVP
• Backwards compatibility is a bitch
• Not everything is a nail

Build an MVP
Because we already had a rich feature application (V2) the customer had a high standard as to what made an application functional. This made things significantly harder for us when we started defining our MVP. The functionalities that we initially viewed as nice to have ended up as a must by the end user. Because of this, the estimations and complexity grew accordingly. Also, as a chain reaction:
• the timeslots reserved for this project started colliding with major other project (Black Friday)
• dependencies that were initially agreed with other teams had to be moved to other time slots
• hardware acquisition was also a thing to consider since we now required higher fire-power

If I had to do it all again, I would definitely focus more on finding a way to reduce the scope of the initial release and then iterate constantly.

Backwards compatibility is a bitch
Backwards compatibility is such a nice thing to have from a user point of view but such a hard thing to implement from a developer point of view. We had so many systems that were dependent on the monolith’s way of doing things that a lot of the development (around 25–30%) was just to make sure that the new didn’t break the old and that other applications had a smooth switch to the new one. The backward compatibility also limits your flexibility in thinking new ways of doing stuff. You always have to make sure you still comply somehow to the old style of doing things. Most of the performance issues that we still currently face + bugs are related to inconsistencies because we have V3 and V2 still communicating and exchanging data. I’m not saying you should do it, just that the cost is high and you should be aware of that.

Not everything is a nail
To rap it all up I would conclude that the experience of developing the new promotions app was a journey that taught us a a lot about experimenting with different technologies and what you must to do come up with a great product. In the end all the battles of the brains we had along the way paid off immensely.

Hopefully you can take a look at how we did things and learn something that might help you on your project, whatever that project might be.