Exporting significant SQL reports with ActiveRecord

August 29, 2018 – Nicolas Zermati 6-minute read

This article was written before Drivy was acquired by Getaround, and became Getaround EU. Some references to Drivy may therefore remain in the post

A few months ago we faced a memory issue on some of our background jobs. Heroku was killing our dyno because it was exceeding its allowed memory. Thanks to our instrumentation of Sidekiq, it was easy to spot the culprit. The job was doing a fairly complex SQL request, and outputing the query’s result into a CSV file before archiving this file.

In this article, I’ll explain what happened and detail the method we used to solve the problem. I had never seen or used this technique before thus I thought it would be nice to share.

More context

We run a tiny framework, something more like a convention, to run SQL queries and archive the results. If I remove the noise of the framework, we had a code like:

rows = with_replica_database do
  ActiveRecord::Base.connection.select_rows(query)
end

CSV.generate do |csv|
  csv << header
  rows.each { |row| csv << row }
end

In this simplified example, there are:

  • with_replica_database: a helper that helps us run a piece of code using a replica database,
  • query: our SQL query, as a String, and
  • header: a placeholder for the Array of our columns names.

We used select_rows as the results of the query didn’t really match any of our models. It is a reporting query that does too many join, group by, and subqueries. The query takes dozens of minutes to run. We could, and probably should, integrate that into our ETL but that’s not the point…

The resulting CSV file wasn’t that big, maybe a hundred megabytes.

The issue

The memory comsumption of this came from the many rows returned by the select_rows method. Each row is an array containing many entries as our CSV have many columns. Each entry could be a complex datatype converted by ActiveRecord into even more complex Ruby objects. We had many instances of Time with their TimeZone, BigDecimal, …

Since the query returns millions of rows, even while having a linear complexity, the memory consumption is too high.

An impossible approach

At first I thought about paginating the results much in the same way that find_each works. The problem with that was that for 10000 rows, if I paginatd by 1000, it would take 10 times the time of the same request without pagination.

Our query looked like this:

SELECT t.a, u.b, SUM(v.c) as c
FROM t
JOIN u ON u.id = t.u_id
JOIN v ON v.id = u.v_id
GROUP BY t.a, u.b

Just imagine t, u, v being subqueries with unions, OR conditions, other GROUP BYand more of poorly performing stuff. The sad part is the GROUP BY which required the engine to go through all results in order to group rows correctly. Using pagination on this would be something like:

SELECT t.a, u.b, SUM(v.c) as c
FROM t
JOIN u ON u.id = t.u_id
JOIN v ON v.id = u.v_id
GROUP BY 1, 2
ORDER BY 1, 2
LIMIT 10000
OFFSET 1000000

So the fewer entries on a page, the less memory used on the client-side but the more time spent in the database because more requests will be done. The more entries on a page, the more memory used on the client-side but the less time spent in the database because less requests will be done.

In the end, this approach wouldn’t have been future-proof.

Focusing more on the problem

It was easy to try to find solutions to the results does not fit in memory problem because it is a known one. It is common with Rails that long lists and association-preloading will cause you memory issues. The quick-fix is to use the find_each or in_batches methods.

I realized that I didn’t actually need to load everything in memory, I’m only interested in getting one line at a time in order to write it into the CSV and then forgotting about it, thanks to the garbage collector.

Solving the right problem

After acknowledging what the true issue was, it was possible to find something more efficient: streaming APIs.

CSV.generate do |csv|
  csv << header
  with_replica_database do
    mysql2 = ActiveRecord::Base.connection.instance_variable_get(:@connection)
    rows = mysql2.query(query, stream: true, cache_rows: false)
    rows.each { |row| csv << row }
  end
end

The idea was to bypass ActiveRecord and use the underlying MySQL client which was providing the stream option. I’m sure there are similar options for other databases.

With that implementation, we only do one request, so no pagination, but we won’t have all the results in memory. We never needed to have all those results in memory in the first place anyway.

Conclusion

I would be very interested to use this feature with ActiveRecord’s ability to return models rather than rows. Maybe it is already possible but I didn’t find it. If you have any further information on the subject, please let me know!

I hope you won’t have to use these lower level APIs. But, if you do encounter the same kind of memory issues, don’t throw money at it right away. Try this first ^^

And obviously, most of this could be avoided by tweaking the layout of data and their relations. In our case, denormalization could make this easier but we’re not ready to pay that cost - yet.

Edit As nyekks mentionned it on Reddit, sequel seems to be better at this out of the box.

Did you enjoy this post? Join Getaround's engineering team!
View openings