Common mistakes that cause Ruby on Rails apps outages

Everybody makes mistakes. Some of them are caught early in a deployment pipeline: during writing code, testing it locally or code review process. Unfortunately, some hide cleverly and pop up on the production environment.

The above is true also for Ruby on Rails applications. In this article, I would like to share mistakes which I have made or encountered over time. Moreover, I am going to propose solutions how to avoid them. It’s much better to learn from others mistakes 🙂

1. Misconfiguration of number of threads and database connections

Let me start here with three simple questions:

  1. How many connections to database your application use on average?
  2. Where would you check this number?
  3. What is the maximum number of connection your database can handle?

To avoid potential production outages you should know answers to all of them. To better illustrate the problem, let’s assume that we use puma in clustered mode as a web server for our Ruby on Rails app hosted on one virtual machine. We set:

  • (maximum) number of threads to 161,
  • number of workers (processes) to 2.

How many connections to a database would the application open? We have 2 workers and 16 threads, so 2 * 16 = 32 connections in total. But wait…is it enough? Very often it isn’t. According to README.md file of the gem:

Be aware that additionally Puma creates threads on its own for internal purposes (e.g. handling slow clients). So even if you specify -t 1:1 (where the first number is minimum number of threads and the second is maximum) , expect around 7 threads created in your application.

To be well-prepared, we would need to secure at least a few additional connection slots. 40 available database connections would be a good starting point for such an application. Otherwise, we could encountered the below error:

ActiveRecord::ConnectionTimeoutError - could not obtain a database connection within 5 seconds

and first production outage.

Very often, after some time, when RPM (number of requests per minute) grows there is a need for adjusting a number of workers and/or threads for Ruby on Rails applications. That’s why it is important to be able to provide accurate answers to the three questions from the beginning of this section.

Ideally, there should be monitoring in place that will inform us automatically before we exceed the limit of available database connection.

2. Incorrect parameters handling

JSON, String& Symbol, what a lovely trio 🙂 I very often ask myself the same question: How should I access elements of parsed JSON response?  Should I use json['key'] or json[:key] notation? Well, even though an answer seems to be obvious, a silly mistake and a production outage is ready:

json_response = "{\"name\": \"Igor\"}"
=> "{\"name\": \"Igor\" }"
parsed_json_response = JSON.parse(json_response)
=> {"name"=>"Igor"}
parsed_json_response["name"]
=> "Igor"
parsed_json_response[:name]
=> nil

Did you know that you can pass symbolize_names option to the JSON.parse method?

symbolized = JSON.parse(json_response, symbolize_names: true)
=> {:name=>"Igor"}
symbolized['name']
=> nil
symbolized[:name]
=> "Igor"

Furthermore, there is HashWithIndifferentAccess class defined in Rails that you can benefit from. As a result, you can access elements of a parsed response by using both String & Symbol notation.

indifferent = JSON.parse(json_response).with_indifferent_access
=> {"name"=>"Igor"}
indifferent_answer_json_response['name']
=> "Igor"
indifferent_answer_json_response[:name]
=> "Igor"

To be even more sure always remember to write tests with correct data structures. I have found myself writing tests where mocks returned Hash object instead of JSON (String to be precise) multiple times. In consequence, I accessed its elements by json[:key] notation. Tests were green, but a new feature was not working as expected.

Finally, always try to test/invoke written code as soon as possible, locally or if not possible, on pre-production environment(s). The soone the better.

3. Missing fallback responses for HTTP communication

How many internal and external services your application depends on? In era of microservices & SaaS solutions, an answer to the question is rarely equal to zero.

Let me ask you another question. How many of those services could be turned off and the app still could be used by end users without any issues? I bet an answer to this question is closer to zero 🙂

When you write code responsible for fetching some data from other services always ask yourself a simple question: What will happen to my application when the service is down? I can prompt you, displaying a white screen with 500 error code is not the best idea. Sadly, there is no silver bullet to this problem. I can provide a few ideas which I have applied, though:

  1. If data you fetch do not change very often, store it in application cache to reduce a number of HTTP requests.
  2. Verify if HTTP Cache Headers are set. They make the data cacheable by different caches (browser, nginx etc.)
  3. Rescue an error in your application and return a fallback response (log the error, though). If you expect an array of n-elements return an empty array instead.
  4. In case of error inform user about an issue and (if possible) display a button to retry the failed request.
  5. Store failed requests and retry them asynchronusly after a while. Allow a user to proceed.
  6. Introduce exponential backoff to retry failed requests before displaying an error to a user.

As you can see, a solution needs to be chosen base on a concrete situation, but any of the above points seems to be better than a blank screen with 500 error code.

4. Missing or incorrect timeout settings

Another point, another question. How many seconds need to pass before your application cuts a request made by user’s browser through HTTP protocol?

Rails by default do not control request lifecycle. The same is valid for puma web server. Chromium browser has socket (request) timeout set to 5 minutes. What does it mean for us as developers? Let’s back to the app from 1. point with 2 workers and 16 threads per worker (32 threads in total; let’s skip additional threads for the sake of simplicity).

If the app had public API endpoint which for some reason (like a heavy query on a database’s table with a missing index) became unresponsible and response time grew to minutes instead of seconds it would be enough to start only 32 simultaneous requests to the endpoint to prevent the application from accepting other requests. All available threads would be occupied waiting for server response.

What could you do? Introduce rack-timeout gem that aborts requests that last too long. The gem raises an exception, a request is cut and an occupied thread can handle next requests from a queue.

When introducing the gem it’s important to know what kind of HTTP server (like nginx) is in front of the application and does it have timeout set. Application timeout should be lower than request timeout set on HTTP server. Otherwise, the server may cut requests before an application finish processing it.

5. Missing DDoS/HTTP flood protection

How many HTTP requests to your application with incorrect login & password combination I could do before my IP is blocked and I am unable to make further requests?

Even though mechanisms protecting from DDoS (Distributed Denial of Service) can be implemented on different levels, thanks to open-source community we can do it directly in Ruby on Rails applications.

Let’s meet rack-attack gem. Its documentation together with ready-to-use example configuration are pleasant to read, so I will skip additional how-to-use explanations here. If you don’t have any anti-DDoS solution in place, I recommend introducing the gem. DDoS attacks have already caused too many production outages.

Summary

The above issues can happen in any Ruby on Rails application. I hope that you learnt something today and as a result will make applications you develop more stable and secure. That is our responsibility.

If you have encountered any other issues that caused outages, please share them in comments. I would like to learn from you as well 🙂

Footnotes

  1. I wrote maximum in brackets, because Puma offers autoscalling of threads based on current traffic and number of minimum threads may be lower than maximum.
 

Igor Springer

I build web apps. From time to time I put my thoughts on paper. I hope that some of them will be valuable for you. To teach is to learn twice.

 

Leave a Reply

Your email address will not be published. Required fields are marked *