Create your own AWS #RabbitMQ Cluster, the #dubizzle way

A group of software engineers and SREs have been recently assigned the task of creating a stable RabbitMQ cluster running on production in dubizzle. We heavily rely on celery, and with RabbitMQ-based-docker-running-images failing almost every week – especially over weekends – we decided to take some time, do some research and come up with a healthy robust running cluster to support the massive number of tasks held by celery.

RabbitMQ

A brief description of RabbitMQ, as per pivotal’s definition of the system – pivotal is the company that owns RabbitMQ now -, RabbitMQ is the most widely deployed open source message broker. It has been used by most of the internet giants for different purposes. At dubizzle, and in our journey into microservices, we have chosen RabbitMQ to be our communication bus over other alternatives like Kafka/Kinesis, VerneMQ and even considering at sometime writing our own.

Because of the amazing flexibility RabbitMQ provides and that once deployed you should never consider revisiting except for some minor maintenance tasks to be executed. It also comes with a lot of out of the box features that would facilitate working with it on a daily basis like the nice Admin interface and a RESTful API. RabbitMQ also support lots of programming language with wide variety of protocols, clients and libraries.

Our infrastructure

We rely on AWS for almost 99% of our services either internal or exposed. Other than some third party integrations, We are considered as one of the largest accounts in the region working with Amazon. We used to have a docker based RabbitMQ cluster using some open source images from docker hub, which for some reasons failed to serve our needs. Facing lots of down times every now and then. That affected us badly. So, instead of relying on docker images, we decided to go on bare EC2 machines implementing the RMQ cluster from scratch. After a lot of investigation in the features that comes with RabbitMQ plugins, we decided to implement our own cluster using EC2 launch configurations – using CloudConfig -, Auto-Scaling groups and RMQ Auto-clustering plugin which comes with AWS support.

So, by end of this blog post, we should have a working RabbitMQ Cluster of 3 nodes on AWS with fault tolerance and auto clustering of new nodes if any of the old nodes fail.

Let’s get our hands dirty

Code

On the following link GitHub Gist, we will keep the code that we will be using as a launch configuration which will then be used for the Auto Scaling group – later to be addressed in the post – on AWS.

Some notes regarding the Gist

  • The Gist has comments that explains almost every step along the way, so please feel free to leave a comment asking about any part that might need more explanation.
  • You should change the password values on lines 71 and 76 to something secure.

Steps

  • On your AWS Account, go to EC2, from the left sidebar menu, click on Launch Configuration then click on Create launch configuration, the following screen should appear, from which choose the AMI that might suit you, for this tutorial, we will be working with Ubuntu Server and the latest LTS supported version at the time of writing this blog post it’s 16.04. So, beside the distro entry, click Select.
  • On the next screen, you will have to choose the instance type based on the load RabbitMQ is expected to handle. We have tested multiple versions with this, but finally we chose t2.medium considering load will be moderate and consumers should always be running. For the sake of this tutorial, we will keep the default chosen option which is t2.micro. Then click Next: Configure details in the bottom.
  • To continue to the next step, we need to setup a new IAM role for the cluster, which will allow communication between RabbitMQ cluster nodes. For this tutorial, we will be calling it rabbitmq-cluster-iam-role, with the following policy.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "autoscaling:DescribeAutoScalingGroups",
                    "autoscaling:DescribeAutoScalingInstances",
                    "ec2:DescribeInstances"
                ],
                "Resource": [
                    "*"
                ]
            }
        ]
    }
    
  • Launch configuration details:
    • Assign the name of the launch configuration in the next screen, for now we will call it RabbitMQ Cluster.
    • In the IAM role, we will choose the IAM role – we created from the previous step rabbitmq-cluster-iam-role – from the select box.
    • You can also allow CloudWatch monitoring on that form.
    • Click on Advanced Details and a new form will appear on the same page in which you will paste the script we linked for earlier in the User data text box.
    • In the IP Address Type, we will choose not to have public IP addresses for the cluster instances as we will be adding a load balancer for that cluster which can be used to access the cluster either publicly or privately.
    • Click Next: Add Storage

  • RabbitMQ writes messages to disk if they were not consumed after some period of time from memory, so unless you change the configuration to keep messages in memory or have consumers always running, you will need somehow a good amount of storage for the nodes. Depending on what you would favor, choose the amount that you feel would be appropriate. We will be choosing 10 GB. Then click on Next: Configure Security Group

    • You will experience failures in the cluster if messages filled up your disk space that’s why it’s always good to make sure your consumers are healthy and consuming messages in a steady manner.
  • For this step, we will be creating a security group which will allow communication between the RabbitMQ cluster nodes for clustering and sync. And also for us to connect to any node separately or through the load balancer via console or UI interface. As per RabbitMQ’s documentation we will need to allow the following ports:
    • 22: for ssh connections
    • 4369: epmd, a peer discovery service used by RabbitMQ nodes and CLI tools
    • 5672: used by AMQP 0-9-1 and 1.0 clients without and with TLS
    • 15672: HTTP API clients and rabbitmqadmin (only if the management plugin is enabled)
    • 25672: used by Erlang distribution for inter-node and CLI tools communication and is allocated from a dynamic range (limited to a single port by default, computed as AMQP port + 20000).

    Note that that this step is based on your AWS configuration, but mainly you will have to enable communication from only subnets that should access these nodes through the mentioned ports. That’s why how you create security groups is not covered in this tutorial.
    Click Review

  • Review your launch configuration details and if everything looks fine, just press Create launch configuration from the bottom of the page.
  • A new popup will now be shown from which you can choose which key pair you will need to use to connect to these instances. Either create a new key pair or use a pre-created one.

    Make sure you have saved it somewhere secure and then press Create launch configuration.

  • Now, we will create a new auto scaling group for the recently created launch configuration. We will press on Create an Auto Scaling group using this launch configuration, the following window will appear:
  • We add the group name, set the number of instance to 3 in the Group size field and choose the network and subnet, again this is totally based on your AWS account configuration.
  • Press on Configure Tags from above, in there we will add one tag which will set the node name when it’s being created to easily identify and manage the nodes from EC2 later on. Then click on Review.

  • On the review window, if everything is ok, click on Create Auto Scaling group button in the bottom of the page. You should see the following window.
  • Now, let’s create a load balancer which will then be used to access the cluster. From the left menu, click on Load Balancers, then click on the Create Load Balancer button.
  • Choose the Classic Load Balancer and click Continue.

  • Choose a load balancer name, for now we will be using RabbitMQ Cluster LB, choose which VPC the load balancer should reside in and based on if you want it to be a public on private load balancer, check the Create an internal load balancer checkbox.
    Make sure you add the listening ports 5672, 15672 as per the following screenshot.

    Choose subnets based on the VPC and Assign Security Groups
  • This step is to choose the security group for the load balancer, which might be totally different than the security group of the nodes created earlier. Make sure for which security group you will be choosing, it’s allowing access to the load balancer via the ports 5672, and 15672.
  • On the Configure Health Check Tab, you can define the health check and how is it performed. Default settings should be fine.
  • Keep clicking next till you reach the Review page and click Create.
  • Now we need to attach our load balancer to the auto scaling group, so from the left menu, click on auto scaling group and choose the RabbitMQ Cluster ASG entry then Click on Edit.
  • In the load balancer field, search for and select the newly create rabbitmq-cluster-lb. Then click Save.
  • Now we should have a running RabbitMQ AWS Based Cluster with a Load Balancer balancing between all its nodes.
  • Based on if you chosen the load balancer to be public or private go directly to its URL or tunnel to it, you should be prompted to login, you can use the admin:admin as username:password from the launch configuration code from the github gist.
  • After login, you should see your cluster nodes healthy, up and running.
  • Go on, try deleting one of the nodes from EC2 and see a new one almost instantly coming up and joining the cluster.

Notes

  • When we first tested this with one of our VPCs on AWS, we had a problem with name resolution on the cluster nodes, we got in touch with the AWS Team and they notified us that name resolution of non standard TLD will not work in VPCs created before October 2016. So after a couple of hours, trying to debug this, and after we created the ticket with AWS we had to create a new VPC that would allow nodes to resolve each other.
  • We also found another problem with non-standard private IP ranges with clusters created before October 2016, if so, make sure you have a standard IP range within your VPC.
  • We will keep updating this part on the way, we know that enhances can be added, so keep checking this section for updates, and for sure let us know if you tried that and found any problems so we can enhance this guide.