Fetching millions of rows from PostgreSQL with Rails

Gustavo Diel
Magrathea
Published in
2 min readJul 10, 2019

--

Let’s say you have a simple Rails application, and you want to process a large quantity of rows from a PostgreSQL instance. For instance, you want to iterate over every post the user has, so you may do something like this:

Post.where(user_id: 16)
==> SELECT "posts".* FROM "posts" where "posts"."user_id" = 16

This SQL query might take anything from some milliseconds to a couple of seconds on a regular sized application. But what if you have millions of users, and each of them have millions of posts and each post has lots of data? This is where that simple query might fail: the DB will try to process and return all the results at once, which may overload the connection, the DB machine and even your own Rails console.

You could try to paginate the query with something like

Posts.where(user_id: 16).limit(MAX_PAGE_SIZE).offset(offset)
==> SELECT "posts".* FROM "posts" WHERE "posts"."user_id" = 16 LIMIT MAX_PAGE_SIZE OFFSET offset

For the first results it will work, but after some thousands of millions of rows, it’s going to start hanging, or even timeout! This happens because PG handles offset queries using a full-scan strategy.

The alternative: Streaming!

It is possible to use a raw PG connection in order to stream the results directly from the database, like this:

connection = ActiveRecord::Base.connection.raw_connection# Feeds the DB with the query it will run
# But it does NOT execute it yet!
connection.send_query("SELECT title FROM "posts" WHERE "posts"."user_id" = #{user_id}")
# This line alone would solve our problems, as it sets DB’s mode
# as single line, which instead of sending the results all at once,
# it sends them line by line, as requested by the application.
connection.set_single_row_mode
# This will return the next result from DB, and if you use stream_each
# right after it, it will stream the results one by one until there
# are no more results to fetch.
connection.get_result.stream_each do |row|
puts row['title']
end

Conclusion

We now have a really simple but powerful script that was able to query billions of lines of data without slowing down anybody’s experience, and it only took half a day! We hope that streaming single rows from DB with ActiveRecord may help you in your future endeavours!

--

--