We recently announced the launch of our new Client Area and Site Tools. Our COO spoke about it from a business perspective and how important this project is for our future growth. I would like to give you the technical perspective and why we see these new interfaces to be also a technical milestone for SiteGround.
SiteGround has always been first and foremost a service company. However, for many years, our business growth has been tightly correlated with our technical evolution. The more software we developed in-house, the more the quality of our service was rising and the more our reputation was growing. Not only that, but we have grown the self-confidence of a high-tech company that produces powerful and smart software solutions and is always among the first to implement the most innovative technologies.
So, when we realized that many of our business ideas were slowed down or made impossible by the limitations of our currently-used underlying platform, we had the courage to think bigger than what’s available on the market and dared to re-create everything from the ground up.
From a business perspective, we wanted to manage a platform (frontend and backend) that would work on any device, have a modern look, be lightweight, fast, secure, and easily scalable. Translating that in technical terms, we scoped the following objectives:
1. Speed and Advanced UX
We decided to use single-page applications as that is THE must-have technology for a faster web experience.
2. Scalability and Security
We decided to adopt the micro-services philosophy and make it all API-based so the system could scale more easily and in case of incidents have a smaller damage scope.
When you have many services, however, authentication and authorization is a big problem. So we needed a secure, quick, and easy solution for the cross-origin resource-sharing problem, where we chose to use JSON web tokens.
Operating at scale when you have millions of sites hosted on hundreds of thousands of containers also requires orchestration, service discovery, and a reliable messaging layer where we introduce Consul and NSQ.
Finally, good observability, monitoring, and alerting are crucial given the complexity of the system. So we put Prometheus and Grafana into action.
1. Single-Page Applications with React and Redux
We have two main single-page applications. One is the Client Area app, and the other one is our Site Tools app. They are both used hourly by thousands of people and the number is growing. Moving some application logic to the users’ browsers, where JS is executed locally, made things faster. Part of this is because we have less data manipulation performed in the backend. The other part comes from the fact that our apps live on an object-caching solution with CDN in front of it. This makes it really fast for users all over the world to quickly load the app and start using it.
Unified style guide
Single-page applications allow you to work with ready components that are reusable and make your interfaces follow certain standards. For ours, we wrote a style guide, which makes our interfaces look homogenous and let us write new code more easily and faster. The trade-off is that just the style guide repo with all examples is about 550 000 lines of code!
As our development is pretty big and growing and the code we work with is quite voluminous, we needed to make sure our standards will be met and not regressed by changes. That is why we covered the code with Visual and e2e tests (Cypress). If a code change visually modifies a certain page significantly we have a red flag and we can even automatically stop code deployment and builds.
2. RESTful APIs and Microservices
We now have over 100 API calls that can be used to manage a single site hosted on our servers. Things like installing WordPress, issuing a private SSL certificate, creating mail accounts, creating new MySQL databases, adding SSH keys, and even migrating sites between servers are all done via API calls. What these APIs give us are the following benefits:
Fosters the usage of different languages
We have software engineers that write in the following languages/frameworks for our platform:
- PHP (Symfony, Zend Framework)
- Perl (Dancer)
To make sure that all the moving parts will be able to talk to each other, we needed RESTful APIs to bridge the communication.
RESTful API design is well-known
The good thing about RESTful APIs is that pretty much any developer nowadays understands them and is familiar with the concept. This means that when we need to add new features we don’t have to train developers or make them learn the whole system’s in-and-outs so they can deliver. They can just focus on the new piece of code needed and integrate it in the system.
Our partners can use the APIs
The fact that it’s all API-based makes it easy to give access to partners and let them create sites, install a CMS or perform other functions on our platform. That makes us flexible and is a good foundation for our future business growth.
User access customization
The RESTful APIs apply the CRUD principle which says every object manipulated by the API should support the following 4 functions: create, read, update, and delete. Coupled with the REST unified representation of resources, this allows for an easy segregation of actions and for custom access and user control.
Together with the JSON web tokens, we used the APIs to provide customized access levels in our new interfaces. It is possible now to allow an operator to create mail accounts for a domain name, but not have access to the existing mail accounts or any other tools. For a start, we have introduced only 3 user roles: site owner, collaborator and white-label client, but we have laid the foundation to add a lot of customization options easily in the future. Role-based access control allows us to provide granular access to specific services and features.
Identify problems more easily & measure better
The more granular the construction, the more easy it is to isolate a rotten part and replace it or fix it. That’s one of the main benefits we see from the micro-services philosophy. We can more easily trace and fix issues with a specific function provided by a specific service. We can also better track the performance and usage of the different services so we can scale platform resources and prevent problems.
3. JSON Web Tokens
The moment we started working on single-page applications, we knew that we need to scale our authentication and authorization setup somehow. Given the number of APIs that talk among each other and users that also interact with these APIs (that were spread across different domains), authorization done the old way - via cookies was not optimal and not scalable. The two main problems we had were:
- Dealing with Cross-Origin Resource Sharing cookies and HTTP headers is not simple, and we wanted to avoid it. Combining JSON Web Tokens with our own APIs allows us to solve this issue.
- Scaling a backend system which checks every API request started looking terrible from a sessions performance point of view.
After some research, we decided to use JSON Web Tokens for this part of the infrastructure. We configured a custom authentication and authorization system on our end. It allows us to benefit from things such as:
- Uniform authorization mechanism for different APIs
- Reduced server load and fewer DB hits
- Ability to use this solution across different domains
- Easier to achieve horizontal scalability
- A higher level of security for our users
- Single sign-on out of the box
4. Orchestration and messaging
The more we grow, the more servers (as a hardware unit) and containers (as a virtual unit) we operate. The Linux container is the main unit in our infrastructure, but the way we manage these containers is changing over time. Ordering new servers, deploying a container with the necessary software stack, adding that new container to all the databases that feed different services and APIs, all that and more requires orchestration. For this part, we have written a lot of code, specific to our platform and processes. Still, we heavily depend on well-established software projects to achieve our goals. We use HashiCorp Consul for service discovery and systems automation. Our messaging platform of choice is NSQ. We use Ansible for automating many software provisioning and configuration management tasks.
Two good examples are our reworked Let’s Encrypt and WordPress Automatic Updates systems. They used to run locally on every hosting server and now they are based on a distributed approach. Servers are automatically registered into the system. Services use automatic discovery provided by Consul. Then the above mentioned APIs are utilized and SSL issuance and renewal happens for every new domain name we register into the system. The same is valid for each app that goes into the WordPress Automatic Updates systems.
5. Prometheus/Grafana for Observability
The way we deliver new services has also changed a lot. In the cloud-native world, we no longer care about individual hosts failing. No engineers in the NOC are called when this happens. There are no artisan sysadmins that treat machines and services with loving attention. Now we monitor the whole infrastructure with the following goals in mind:
- We have to know when things go wrong
- We can debug and have insight about the specific context
- We must be able to see changes over time and make predictions
- Data should be easily consumed by other systems and processes
To make the above said more specific, we are now re-defining the “fail” scenarios and the alerts that we get for each. For example, we may get alerts that require human interaction when there is direct user impact — like when a service that directly affects clients’ websites has been affected. But, we may not need human intervention when a specific node has failed as data is automatically redistributed on other nodes and there is no user impact, hence no alert is needed in this case.
When we work on those alerts, we aim to go from the alert to the problematic sub-systems. In order for our engineers to do a quick analysis and offer a fast resolution, we must have enough data. This is called distributed tracing, and it allows us to follow specific users’ requests all the way from the browsers of our clients, through multiple APIs and different services, till they hit databases and/or storage engines. Each record about a request in the distributed tracing system gives us performance profiling info and a way to correlate issues to other events. We use distributed tracing to faster debug and optimize our code. We are able to achieve higher feature velocity because of that.
To achieve all the goals, we chose to use Prometheus and Grafana for profiling, metrics collection, logs analysis, and distributed tracing of events. We are now able to correlate information from multiple sources and reveal and predict deficiencies.
The software we had to write in order to address the challenge and provide the above scoped solutions is massive, but quite rewarding. To wrap it up, I’ll leave you with some numbers straight from the new software repositories and our JIRA:
- 2 539 604 lines of code
- 99% code test coverage
- 9250 JIRA tasks
- 199 560 words in JIRA stories
We are particularly proud of the code test coverage. Even though 2.5 million lines of code is impressive, we know this is not a really great measurement of success. The fact that all our code is covered by tests is, however, a big achievement. From the very beginning we set code test coverage as one of our main criteria for delivering quality software. And if our new Client Area and Site Tools cannot provide guarantees for quality then we can’t deliver the awesome service our clients are used to. Rest assured, we are dedicated to providing the same top-notch hosting service but this time with even better site management and collaboration tools, built on the world’s best and most innovative web technologies.