GetAccept (YC W16) is an e-signature tool with document tracking and smart sales automation. Before launching our platform at the end of 2015 we spent months researching and working to configure the perfect hosting environment for our SaaS application and would like to share it with you.
Many of you have probably experienced long application latencies, hassles pushing out new releases and sleepless nights restoring databases. We definitely have. So, we decided to put an end to this and design our SaaS application from the ground up with this in mind. We aimed for 100% application uptime and full redundancy on all layers. 18 months into the business we have kept to our goal–not a single second of downtime and no service windows.
How did we achieve this?
We are a lean team and decided the whole environment needed to be easy to manage and upgrade/downgrade by anyone in the team. A no-ssh philosophy was implemented early–that means we should never have to login to a server or write any terminal commands. From a scalability and redundancy perspective, this turned out to be great since we’ve been forced to rethink the way we design the application and configure the environment. This also ensures better security and minimizes access to sensitive data. To keep a ssh-free environment we use only standard PHP components, default EC2-configuration and setup local development environments with same versions as the production environment.
Using Amazon OpsWorks we created a PHP-layer, load balancer and applications for app and our API. We configured each stack using Chef, os-packages and custom json-configuration. Each region is setup as a Stack to give us better control and monitoring. This also means we can scale-up or down each region with a single click in OpsWorks and it also automatically handles code-deployments and load-balancing/redundancy. We use SQS for worker queues and the Elastic Beanstalk worker tier for autoscaled backend workers which has never failed for us. An easy way to configure Elastic Beanstalk instances is to use the .ebextensions folder for os-packages, yum and other commands. In the root folder you can also use cron.yaml to utilize workers to do scheduled tasks instead of configuring this manually on the servers.
We put the pieces together using Amazon Route 53 for our DNS hosting. We have set it up to route a user of the application to the closest application-stack using GeoLocation routing policy. After that each region has its own load balancer to split traffic between instances and also handle redundancy in case a server goes down.
For our database layers we were early testers of the new Aurora DB but had to turn to RDS for MySQL since they didn’t solve cross-region replication. To maximize speed and redundancy we replicate the database to each region. Each replica use load-balanced SSH-tunnels to communicate with the master DB. You probably get it by now but this means that a whole datacenter or even a region can be down and our application automatically redirect a user to the second closest region-host and can continue to work without disturbance.
Let’s take a step back first to understand why we would like the application hosted in multiple regions. In our business we work with displaying and signing documents combined with media such as video. We also have customers with a concern about hosting document data outside their own region (think US documents hosted in Europe or vice versa).
The obvious solution here would be to use a CDN network that distributes files to local nodes upon request. In our case and after some testing we could see that the initial load time was drastically higher than hosting from S3 in the same region. Our solution here was to have each customer select the preferred document hosting region (S3 region) when creating the account and the one with lowest latency is selected as default.
We come from a world where it was accepted to spend hours building out, uploading and backing up old code just to release a small patch. And where a failing release could occupy developers up to a whole day trying to revert. Those days are gone and we have created an environment where each developer can release new updates and patches without any of these steps.
Using GIT for our code repo we use the a commit-hook to parse the comment and do actions based on that. Using the AWS SDK we use built-in GIT support in OpsWorks to deploy the new version, upload static files to S3 and test the new release. For example, if a developer just commit a change it automatically deploys to our dev-environment for manual testing. If the comment also ends with a release command the git-server automatically deploys a new version of the application and workers to OpsWorks and Elastic beanstalk. The updated version is deployed and live on all servers around the world within 2 minutes with zero downtime thanks to rolling deployment. Yes, by doing this we give a lot of power to the developers but at the same time we minimize all the steps to deploy a release which have resulted in very few hiccups. As an added bonus customers get pretty impressed when we can fix and deploy a bug patch within a few minutes at best, sometimes while having the customer on the phone.
Here is our recipe to maximize uptime and redundancy with low latency:
Amazon OpsWorks with stack for each region containing PHP layer and ELB with code base in private GIT repository RDS MySQL with region replica over redundant SSH-tunnels Elastic Beanstalk Worker connected to SQS queue for background jobs S3 bucket in each region for secure and fast storage of documents and videos CloudFront for static content such as images, jQuery, AngularJS Route53 DNS with GeoLocation routing policy and health checks
Some resources to get you started:
Good luck with your setups and feel free to reach out to us at GetAccept if you want any help. We'd also like to discuss and debate the hosting environment above.