Imagine you have to rewrite an existing web service to move to a new payment gateway (PSP or payment service provider) due to various business use cases. Your first thought might be to replace old with the new one in its entirety and roll that out. That is a naive approach especially when you are working with payment gateways which have their own SLAs, agreements with acquiring banks, risk and fraud detection softwares, etc. which makes this process riskier in terms of conversions, revenue, customer retention and eventually business. In this blog post, we discuss the approach that we took to mitigate risks while switching payment gateways and why it was so important.
It all begins with our old payment service
Our old payment service is written in Python 2 and is highly coupled with the old payment gateway (payment service provider). When we first dissected the problem, we thought that integrating a brand new payment gateway within the same flows, urls and Python Bottle views would be trivial. Once we started working on the first POC, we realized that we were already generating a lot of spaghetti code as API flows for these two payment gateways were entirely different. While the old payment gateway API design had to rely heavily on Redis and Gevent to optimize user and payment flows, there was absolutely no reason to re-introduce that dependency with APIs from the new one.
A/B Tests Are Good As Long As They Are Simple
A/B tests are very powerful in making product decisions. At dubizzle, these tests have traditionally been more oriented towards the user facing components such as page flows, page components, placements and products.
However, these tests complicate things when you want to test underlying systems that are highly dependent on each other.
While we were still working on our POC, we realized that the A/B test we wanted to perform should not be done using tools such as Optimizely even if we somehow managed to integrate the new gateway within the same views and user flows. Here is why:
- Going the Optimizely route would have required us to implement some A/B test logic on each and every view of the web service to ensure that we direct requests across all views for a transaction to the correct payment gateway based on cookie. This could be something like the following on every view and is definitely not neat:
if cookie == ‘OLD’: use old payment gateway API else: use new payment gateway API
- Debugging two different payment gateways that are using same namespaces such as Redis, UUID generation and logging would definitely fireback and we would be scratching our heads in no time.
Note that once a user lands in bucket A (web service talking to old payment gateway), that web service needs to ensure that payment flow for that user continues on the same payment gateway on which it was first started. For instance, web service should never initiate transaction on old payment gateway and try to finalize it on the new one. This problem can be resolved using sticky sessions which are supported by Optimizely and HAProxy both.
A tale of two payment services
Once we had some idea around the problem we were trying to deal with, we decided to write a new payment service integrated with the new gateway and compare the performance through a 50–50 split A/B test. The whole A/B test had to fulfill at least the following:
- Look and feel of card details page for new service should exactly match the old one to ensure that we do not skew any conversions. Conversions are considered as card “Submit” events in this case.
- No internal backend tasks or API calls should be able to increase the rate of transaction failure in new service as compared to the old one. This would mean that most of the user flows should essentially work in the same way as the old one.
A/B Test via HAProxy
Since Optimizely was out of the equation, we decided to leverage HAProxy for running the test.
HAProxy is a powerful layer 4 and layer 7 load balancer with an extensive set of features. One such feature is its ability to enable cookie based persistence in a backend which happens on layer 7.
We configured our HAProxy to have three backends:
- Main backend
- Old service backend
- New service backend
To give you a taste, here is an example of our HAProxy backends:
You might be thinking why we had static backends? This would become clear once we come to HAProxy
frontend so let’s discuss the
main backend first.
We went for weighted round robin approach for load balancing requests between
new-service-new-pg . This makes sense for an A/B test where you just have to split the traffic to A and B buckets with the caveat that any request that lands on bucket A should never land on bucket B for that session duration. We neatly achieved this through HAProxy
cookie directive which is very powerful. With our assumption that any user who initiates a transaction would complete it within 2 hours, we told HAProxy to discard a session cookie after 2 hours and generate a new one based on what round robin decides for that request.
This solved two very strong use cases for us:
- Gave us the flexibility to tweak A/B test weights to propagate to all users in 2 hours without breaking any sessions or user flows.
- Increased the probability of a user attempting transactions on both services during the lifetime of the A/B test. This helped us in identifying issues where one gateway would reject the same credit card that was accepted by another one. We were literally amazed to see how things can break in bigger contexts where multiple cascaded systems spanned across continents are involved!
There is a small vulnerability in our setup that you probably wouldn’t have noticed yet. Consider the following diagram:
A user starts a transaction on the old payment service at 45 min mark after receiving the cookie. He then proceeds to 3-D Secure page of the bank at 1hr 59th minute and is redirected back to our success url after the 2hr mark. Since HAProxy is configured with cookie
maxlife of 2hrs, it will discard the session cookie and try to insert a new one on redirect to success url. If we are unlucky enough, round robin might tie the new session to the new payment service which would not know how to handle a success redirect that was configured by the old service.
In our case, we chose to ignore this problem because we have seen earlier that users who initiate a transaction would usually complete it under two hours. But can you imagine the severity of this problem if we had set cookie
maxlife to 1 min, for example?
Let’s come back to our discussion around why we had
new_service backends along with
main. HAProxy is usually configured with a
frontend proxy that handles all ACLs (Access Control Lists), which in our case was configured like this:
These webhooks and endpoints that you see in the configuration are specific views to handle requests originating from the external gateways. Since old and new services cannot speak to each other’s payment gateway, putting them under a round robin LB would mean 400s or 404s for as much as 100% of the requests. Also, since these requests are coming from payment gateways, there is no need for any sort of persistence because these do not incorporate flows spanned across different views (All requests are touch and go with a 200).
The best place to fix this problem was on LB and we did this by defining ACLs to control request flow to appropriate application servers through
new_service static backends. To put it in simple words, this would be a conventional conversation between payment gateway originated request and HAProxy that would eventually hit the appropriate application servers:
Payment Gateway: Hey, I am an internal request from old payment gateway and I need to tell you that I have successfully received the payment.
HAProxy: Hey, I recognize you! You should pass through door
old_service to reach your destination application.
Destination Application: Hey, I know how to process you! Let’s activate the order for this user. I will give you the number 200 in return!
Monitoring Request Flows
With all conditions in place, how would you test such a complicated A/B test?
HAProxy provides you with very precise logs to debug such complicated systems. Let’s get through some examples from our actual A/B test logs to understand this.
VN for each request made for the same order id. These flags give a lot of information on how persistence was handled by client, the server and by HAProxy and are one of the most important indicators you would look at while testing and debugging.
Quoting HAProxy docs here:
— :Persistence cookie is not enabled.This is the case where the request path is
webhook-new and it doesn’t makes sense to put it under the A/B test.
NI :No cookie was provided by the client, one was inserted in the
response. This typically happens for first requests from every user
in “insert” mode, which makes it an easy way to count real users.This is where round robin would decide the test bucket for the user.
VU : A cookie was provided by the client, with a last visit date which is
not completely up-to-date, so an updated cookie was provided in
response. This can also happen if there was no date at all, or if
there was a date but the “maxidle” parameter was not set so that the cookie can be switched to unlimited time.
VN : A cookie was provided by the client, none was inserted in the
response. This happens for most responses for which the client has
already got a cookie.This is how HAProxy finds the current bucket for the user and directs him to the correct backend.
Notice that once HAProxy decides server
new-service-new-pg and sets a cookie, all subsequent requests from that user are directed to
new-service-new-pg through the
For everything else that matches one of the ACLs, HAProxy directs the request without setting any cookie. This also ensures that A/B test results are not skewed by HTTP calls from computers rather than humans.
Architecture Layout For The A/B Test
Note that DNS and Edge Tier are common across all our microservices. All the A/B test magic happens after the traffic passes through HAProxy load balancers.
While this has worked for us perfectly, there is more complicated load balancing use cases which could potentially require a lot more parameters than just session persistence. Shopify has covered this in an excellent blog post here by leveraging Nginx and OpenResty.
This A/B test was a major team effort across the dubizzle infrastructure team and product engineering. Thanks to everyone involved!
Hope you enjoyed this post. Feel free to add comments or ask questions. You can also reach me out on my twitter handle: @mrafayaleem