Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Relation#load_async to schedule the query on the background thread pool #41372

Merged
merged 1 commit into from Feb 16, 2021

Conversation

casperisfine
Copy link
Contributor

@casperisfine casperisfine commented Feb 8, 2021

This is a followup to #40037 (see that PR for complete context)

Relation#load_async schedules the query on the background thread pool, and returns self.

[].freeze
else
relation = join_dependency.apply_column_aliases(relation)
@_join_dependency = join_dependency
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very happy with having to store this in an instance variable for later use, but apply_join_dependency can sometimes trigger a query on its own, so if we were to call it twice, we'd generate an extra query in some cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test for this case? Or can you elaborate a bit more what you mean by it "can sometimes trigger a query on its own"? I don't follow what scenarios this might happen in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is exactly one test case: https://buildkite.com/rails/rails/builds/74758#40f90c22-7273-4458-a0db-71ec3fbf3638/1055-1064

The extra query is triggered here:

klass.connection.distinct_relation_for_primary_key(relation)

Which mean that in such case load_async would actually trigger a query in the main thread.

There might be a way to refactor this out, but I chose to prioritize a smaller more comprehensible PR.

# Post.where(published: true).defer # => #<ActiveRecord::Relation>
def defer
unless loaded? || scheduled?
return self if connection.current_transaction.joinable?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning here is that some shared code might attempt to use .defer, and sometime might be called from inside a transaction, and sometimes outside.

So have have to make an API decision of either

  • Querying immediately
  • Just do nothing

I believe the later make more sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new name (which I agree with), I think it's probably better that load_async will still load even if it can't do it async.

In the async-possible case, the resulting relation here will behave like it was already loaded (by blocking if necessary to make that true)... I can't think of specific examples, but my gut says an un-loaded relation will behave less similarly in some edge cases.

That said, I can also see use cases where a knowing caller might want to say "start an async query for this if you can, because I think it's going to be used later... but if you'd need to block, then we'll just wait and see whether it's actually needed". (Association preload and "all Relation-shaped instance variables that get copied from a controller into a view context" being variously-reasonable examples there.) 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can find arguments for both behaviors too.

I think the most common case this will happen, will be in tests using transactional fixtures. So I'm thinking we should chose the solution that will make this fallbacks behave closer to production, in term of assert_queries and similar test assertions.

Using this reasoning, I'd say you are right and we should fallback by either straight calling load, or maybe by calling exec_main_query(async: false), as to simulate the query being performed, but the records not being instantiated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I pushed 2042a2914535551f165378e11da19fef98ee7fbb, which does call exec_main_query(async: false) if we're inside a transaction.

The pro is that the behavior is much closer to a query that would be properly scheduled asynchronously, as the records instantiation is still delayed.

The con is that it open the door for some weird behavior, as the main query is executed inside the transaction, but if you have preloads or something like that, and you iterate on the relation outside the transaction, they might be performed outside.

So maybe falling back to load might be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not possible to do an async query then we should go through the regular load path rather than a new future result path that isn't actually future. This makes the feature easier to debug and harder to make accidental mistakes.

I made this mistake in my demo app by calling to_a on the load_async (ie Dog.where(name: "fido #{i}").load_async.to_a). From a user experience perspective it was pretty confusing to debug why it said it was async but it was not actually async.

Additionally I made another mistake that revealed a bug with pluck and load_async. If you do Dog.where(name: "fido #{i}").load_async.pluck(:id) it results in duplicate queries. I now know this is not the correct way to use the feature but it's an easy mistake to make that results in incorrect behavior.

  Dog Load (0.3ms)  SELECT `dogs`.* FROM `dogs` WHERE `dogs`.`name` = 'fido 1'
  Dog Pluck (0.2ms)  SELECT `dogs`.`id` FROM `dogs` WHERE `dogs`.`name` = 'fido 1'

In the non-async code path (ie Dog.where(name: "fido #{i}")load.pluck(:login) ) Rails knows what to do with an already loaded relation and we don't get duplicate queries (see https://github.com/rails/rails/blob/main/activerecord/lib/active_record/relation/calculations.rb#L184-L186)

  Dog Load (0.3ms)  SELECT `dogs`.* FROM `dogs` WHERE `dogs`.`name` = 'fido 1'

If we go the previous path for relations that can't be loaded async, we'll have less edge cases to worry about. I think the mistake I made with pluck and to_a is an easy one to make and so in that case we should make sure the app behaves as it previously did rather than having potentially surprising non-async async behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pluck is a nice find:

    def pluck(*column_names)
      if loaded? && all_attributes?(column_names)
        return records.pluck(*column_names)
      end

Since load_async doesn't set @loaded = true but @scheduled = true, pluck doesn't know it could use the future.

I either need to go over all Relation methods to make sure they handle both loaded? and scheduled? states, or we could merge both states together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this mistake in my demo app by calling to_a on the load_async [...] it was pretty confusing to debug why it said it was async but it was not actually async.

I think this is my fault, that part of the design was explained in the initial proof of concept, but I didn't mention it in the first actual PR.

Then wehn you do use the relation, either the query is completed and it just uses the results, or it is being executed and it waits for it to complete, or the thread pool was busy and it simply execute it in the foreground.

That third case is important, because of the way Ruby thread scheduling works, unless you do other unrelated IOs between the time you did call load_async and the time you actually need the result, it's entirely possible that the query have been sitting idle in the work queue without even having been sent to the database.

We could of course wait for the thread pool to execute it, but that would be a waste because at this stage we know that the main thread will be blocked on that result, and that it's free to execute the query itself.

So yes, it's very easy to misuse load_async, and we should be on the lookout for making easier to understand what's happening, but I do believe that behavior is the current one.

Now if the database clients had true native async APIs, we could send the query and return a response promise/future, but that's unlikely to happen any time soon.

Let me know if that behavior is still unclear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: pluck

Now load_async marks the Relation as loaded?, which fixes pluck and similar methods that behave differently based on wether the Relation was loaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else we could do it to call Thread.pass right after we enqueue the query, this would increase the chance that the query get performed asynchronously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So interestingly enough, calling Thread.pass caused the test of that fallback behavior to fail, which proves that it's actually effective at producing the behavior you expected in this simple scenario.

@@ -33,7 +41,7 @@ def complete(asynchronous_queries_tracker)
attr_reader :current_session

def initialize
@current_session = nil
@current_session = NoSession
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some thought, I think I make sense to have defer work by default. It's good to cancel queries when we know we exited the context in which they are useful, but if we're in rails console or just a random rails runner script, there's no reason to not allow it.

[].freeze
else
relation = join_dependency.apply_column_aliases(relation)
@_join_dependency = join_dependency
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test for this case? Or can you elaborate a bit more what you mean by it "can sometimes trigger a query on its own"? I don't follow what scenarios this might happen in.

activerecord/lib/active_record/relation.rb Outdated Show resolved Hide resolved
@casperisfine casperisfine force-pushed the ar-relation-async-query branch 3 times, most recently from 022015d to 56ab574 Compare February 11, 2021 08:13
@casperisfine casperisfine changed the title Implement Relation#defer to schedule the query on the background thread pool Implement Relation#load_async to schedule the query on the background thread pool Feb 11, 2021
@casperisfine
Copy link
Contributor Author

I've put more thoughts in the @async_executor.

In #40037 I made it held by the ConnectionPool, because I wanted to guarantee some level of concurrency for each database, as in I wanted to prevent one database to use all the threads, making others database wait.

But now I don't think it's really worth the tradeoff. First the vast majority of Rails applications out there have a single database, so for them, a per pool executor or a global one don't matter.

Then for apps using horizontal sharding (like Shopify), you can have a large number of pools to similar shards, but you only ever query a single one during a request cycle. If you have say 30 shards, you'd end up with 30 executor with X threads each, most of them sitting idle the vast majority of the time, and I think we should rather keep the number of threads low, as I don't quite trust the scheduler to properly handle hundreds of idle threads.

eileencodes added a commit to eileencodes/rails that referenced this pull request Feb 11, 2021
While testing out rails#41372 in a demo application I noticed that the
`ASYNC` portion wasn't getting added to the log. I tracked it down to
this method not passing `async: true` to `execute_query` so the `sql`
subscriber was always getting `async: false` for the payload.

Before:

```
(0.3ms)  SELECT * FROM dogs where name = 'fido 1'
(0.1ms)  SELECT * FROM dogs where name = 'fido 2'
(0.1ms)  SELECT * FROM dogs where name = 'fido 3'
```

After:

```
ASYNC  (0.3ms)  SELECT * FROM dogs where name = 'fido 1'
ASYNC  (0.2ms)  SELECT * FROM dogs where name = 'fido 2'
ASYNC  (0.2ms)  SELECT * FROM dogs where name = 'fido 3'
```

I also verified that this fixes `Dog Load` to `ASYNC Dog Load` in the
new PR. It will be easier to add a test for this functionality when
we merge rails#41372.
@casperisfine casperisfine force-pushed the ar-relation-async-query branch 4 times, most recently from 753aefe to d0fc4a8 Compare February 12, 2021 15:13
activerecord/CHANGELOG.md Outdated Show resolved Hide resolved
… thread pool

This is built on previous async support added to Adapter#select_all.
@casperisfine
Copy link
Contributor Author

@eileencodes code squashed, rebased, and about to be green on CI!

@eileencodes eileencodes merged commit 48effc7 into rails:main Feb 16, 2021
@casperisfine
Copy link
Contributor Author

casperisfine commented Feb 16, 2021 via email

leequarella added a commit to leequarella/rails that referenced this pull request Feb 17, 2021
rails#41372 Added
`ActiveRecord::AsynchronousQueriesTracker::NullSession` which replaced a
use of `nil` in `AsynchronousQueriesTracker`.  This commit changes the
`finalize_session` method to match that change from `nil` and properly
handle cases where it is called with a `NullSession` present.
@zzak zzak mentioned this pull request Jun 27, 2021
records.empty?
else
!exists?
end

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious - why we dont want to use guard clause in that case?

#
# Post.where(published: true).load_async # => #<ActiveRecord::Relation>
def load_async
unless loaded?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of guard clause will simplify most of the conditional statements in the codebase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion but we prefer to not use guard clauses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a place where the core team explains why they don't want to use guard clauses? I did a quick search, but wasn't able to find anything and would love to hear the reasons.

casperisfine pushed a commit to Shopify/rails that referenced this pull request Mar 30, 2022
Followup: rails#41372

Something we knew we'd need when we implemented `Relation#load_async`
but that we chose to delay to have time to think about it.

Right now, Active Record async support is limited to "collection results",
but among the not so fast queries that would benefit from asynchronicity
you often find aggregates (e.g. `count`, `sum`, etc) as well as hand crafted
`find_by_sql` queries.

`load_async` was easy to add as an API, because `Relation` acts as a collection
so it was trivial to simply block whenever it was iterated while retaining total
API compatibility.

For aggregates and `find_by_sql`, we have no other choice but to return something
different in async mode, with its own API.

This proof of concept showcase what this API looks like for `Relation#count`:

```ruby
Post.where(published: true).count # => 2
promise = Post.where(published: true).async.count # => #<ActiveRecord::Promise status=pending>
promise.value # => 2
```

This API should be applicable to all aggregate methods, as well as all methods
returning a single record, or anything other than a `Relation`.
casperisfine pushed a commit to Shopify/rails that referenced this pull request Apr 14, 2022
Followup: rails#41372

Something we knew we'd need when we implemented `Relation#load_async`
but that we chose to delay to have time to think about it.

Right now, Active Record async support is limited to "collection results",
but among the not so fast queries that would benefit from asynchronicity
you often find aggregates (e.g. `count`, `sum`, etc) as well as hand crafted
`find_by_sql` queries.

`load_async` was easy to add as an API, because `Relation` acts as a collection
so it was trivial to simply block whenever it was iterated while retaining total
API compatibility.

For aggregates and `find_by_sql`, we have no other choice but to return something
different in async mode, with its own API.

This proof of concept showcase what this API looks like for `Relation#count`:

```ruby
Post.where(published: true).count # => 2
promise = Post.where(published: true).count(async: true) # => #<ActiveRecord::Promise status=pending>
promise.value # => 2
```

This API should be applicable to all aggregate methods, as well as all methods
returning a single record, or anything other than a `Relation`.
gmcgibbon added a commit to gmcgibbon/rails that referenced this pull request Nov 2, 2023
@rails rails deleted a comment from abdullaachilov Jan 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants