Dynamic ActiveRecord columns
January 18, 2019 - @kddeisz
When dealing with any kind of data transmission, the problem almost always boils down to the amount of data being sent. Engineers can optimize the speed of the wire and the compression of the data, but at the end of the day sending fewer things is almost always going to yield the biggest gains.
We see this discussed in web programming all the time with regard to the frontend of the application. Tools like webpack and rollup are pushing more and more toward code splitting as an effective way of improving initial page load times (i.e., sending less JavaScript down to the client until they actually need it for execution).
This post, however, is about backend technology. Specifically, it’s about sending less data over the wire when selecting data from the application database. Even more specifically, in Ruby on Rails
applications the ActiveRecord
ORM allows us to generate SQL queries. This tool is a sharp knife that can very easily end up selecting way more data than we need, resulting in large numbers of allocations, fragmentation, and overall memory bloat. One way to combat this is to use a feature of ActiveRecord
that involves dynamically creating columns in result hashes.
Joining strings
As a place to start looking at this technique, let’s say we have a User
class with first_name
and last_name
columns. We want a list of the full names of every user in the system, by combining the first and last names of every user into a full_name
string. We can perform this in ruby
just fine by running:
User.all.map do |user|
"#{user.first_name} #{user.last_name}"
end
This works well and achieves our result, but at the cost of way more memory than we needed. The problem is, User.all
by default will generate a query that looks like:
SELECT "users".* FROM "users"
That *
is the dangerous part. It tells the database to return every column that exists on the users
table. Each of those columns is then returned to ActiveRecord
, which dutifully converts each of them into ruby
objects that each require their own allocations. ActiveRecord
then returns the resulting allocated User
instances back to us as an array. Once we have it back, we then only use a tiny amount of the allocated memory to perform the task that we actually need to perform.
Remember what we said at the beginning:
sending fewer things is almost always going to yield the biggest gains
Let’s rewrite the query to send fewer things back to us and then work backward from there. Since we only need the first_name
and last_name
columns, let’s select only those columns:
SELECT "users"."first_name", "users"."last_name" FROM "users"
We can perform the exact same mapping as before, and this time we’ve required less data (proportional to the number of columns on the users
table) to be sent back from the database. We can achieve this by modifying our Relation
with the select
method, as in:
User.select(:first_name, :last_name).map do |user|
"#{user.first_name} #{user.last_name}"
end
One small caveat to the above approach: it works in this example because we’re only referencing the first_name
and last_name
attributes out of the resulting User
objects. If we were to instead reference an attribute that we didn’t select (e.g., admin
) then it would raise an error complaining about missing the attribute, as in:
ActiveModel::MissingAttributeError (missing attribute: admin)
Dynamic columns
We can do better than the above query though, by performing the string joining in the SQL query itself. One nice feature of ActiveRecord
is that when it encounters a column in a response from a database query that is not contained in the table that maps to the object performing the query, it treats it as it would any other column. This means we can do the following:
users = User.arel_table
concat =
Arel::Nodes::NamedFunction.new(
'CONCAT',
[users[:first_name], Arel::Nodes.build_quoted(' '), users[:last_name]]
)
User.select(concat.as('full_name')).map(&:full_name)
# => SELECT CONCAT("users"."first_name", ' ', "users"."last_name") AS full_name FROM "users"
A couple of things are going on here worth discussing:
- arel_table is a method that gives access to the underlying
Arel
DSL for query genereation. It was introduced in Rails 3 and in the latest versions of Rails is bundled withActiveRecord
. Arel::Nodes::NamedFunction
is a way of generating the string for SQL functions that are eventually passed intoActiveRecord::Relation
objects.- The real trick here happens with the
as
call, which names the result of the concatenation as a column, which results in theUser
objects having a dynamically-definedfull_name
method, which we can then call asuser.full_name
. In this case we’re mapping over the list using symbol’sto_proc
syntax (&:full_name
).
As a best practice, I like to write the actual definition of the method on the class as well so it looks less like magic. In order to mirror the behavior of missing attributes above, I tend to write methods like the following:
class User < ActiveRecord::Base
def full_name
return read_attribute(:full_name) if has_attribute?(:full_name)
raise ActiveModel::MissingAttributeError, 'missing attribute: full_name'
end
end
This can be further simplified with a little metaprogramming, as in the following:
module Ext
module AttrDynamic
def attr_dynamic(*names)
names.each do |name|
define_attr_dynamic(name)
end
end
private
def define_attr_dynamic(name)
define_method(name) do
return read_attribute(name) if has_attribute?(name)
raise ActiveModel::MissingAttributeError, "missing attribute: #{name}"
end
end
end
end
ActiveRecord::Base.extend(Ext::AttrDynamic)
Now I can go back into my User
class and just call the attr_dynamic
method, as in:
class User < ActiveRecord::Base
attr_dynamic :full_name
end
This indicates to future developers that this column is a dynamically-defined column that may or may not be present based on the manner in which the User
was selected, and so they should not be surprised when the full_name
method is referenced later in the code.
Counting
This approach really shines when getting counts of a parent-child table relationship. Let’s take as an example an app with users
and posts
:
class User < ActiveRecord::Base
has_many :posts
end
class Post < ActiveRecord::Base
belongs_to :user
end
Let’s say we want to know the number of posts by each user within a list of users, for instance on the admin page. In this case we could go and perform a count for each user:
User.all.map do |user|
user.posts.size
end
Immediately some combination of your tooling (for instance the bullet gem) or your coworkers should point out that this will result in an N+1 query problem (in that you make one query for the users and then one more for every users’ posts). You can see it in your logs as:
User Load (13.8ms) SELECT "users".* FROM "users"
(36.8ms) SELECT COUNT(*) FROM "posts" WHERE "posts"."user_id" = $1 [["user_id", 1]]
(0.8ms) SELECT COUNT(*) FROM "posts" WHERE "posts"."user_id" = $1 [["user_id", 2]]
(0.4ms) SELECT COUNT(*) FROM "posts" WHERE "posts"."user_id" = $1 [["user_id", 3]]
...
So, you’ll go into your code and add the appropriate eager-loading, just as your tooling and coworkers are telling you to:
User.includes(:posts).map do |user|
user.posts.size
end
Now when you rerun the code you’ll only get two generated queries, one select query from the users
table and one select query with a massive IN
statement for the posts
table:
User Load (3.0ms) SELECT "users".* FROM "users"
Post Load (1.7ms) SELECT "posts".* FROM "posts" WHERE "posts"."user_id" IN ($1, $2, $3, $4, $5, ...) [["user_id", 1], ["user_id", 2], ["user_id", 3], ["user_id", 4], ["user_id", 5], ...]
This works, but think about all of that data that you didn’t need to send back. You selected "posts".*
, which likely includes a full text column, and you only needed a count. Let’s apply the same technique we did above to cut down on all that waste:
class User < ActiveRecord::Base
has_many :posts
# Using `attr_dynamic` here from the above code example
attr_dynamic :posts_count
scope :with_posts_count, lambda {
select(arel_table[Arel.star], Post.arel_table[:id].count.as('posts_count'))
.left_joins(:posts).group(arel_table[:id])
}
end
class Post < ActiveRecord::Base
belongs_to :user
end
Now we can run achieve the same effect as above using both more readable and performant code, as in:
User.with_posts_count.map(&:posts_count)
# => SELECT "users".*, COUNT("posts"."id") AS posts_count FROM "users" LEFT OUTER JOIN "posts" ON "posts"."user_id" = "users"."id" GROUP BY "users"."id"
This results in an understandable named scope returning a column that is read from a method, all with minimal code and maximal performance.
Caveats
It would be a little misleading to end the post here, without attaching a slight bit of warning. There are times when dynamic columns are appropriate and times when they aren’t, and those times are almost exclusively going to determined through measurement. I would check out memory_profiler and ruby-prof, both are excellent tools for measuring the impact of code on memory and performance.
Additionally, you can and should measure the performance of your database. This post has discussed sending too much information back, which is a measurable quantity. To find out how, check your respective databases’ documentation. There are great ecosystems around all of the popular options (MySQL
, PostgreSQL
, Oracle
, etc.).
Also as an aside, most of the examples above could be even further optimized with some combinations of select_values
/pluck
. I’m assuming you have other things to do with the objects in your view layer besides just getting counts, which is why I’ve stuck with returning ActiveRecord::Base
instances.
Finally, GROUP BY
as in the last example has its own performance implications, and will largely depend on your schema and the indices you have in place. As always, measure before you go wildly changing code. In general speed and memory are an inverse correlation, so it will also depend on what is important to you and your business/application.
tl;dr
ActiveRecord
is a powerful tool that in inexperienced hands can select far more data than necessary. You can tell it to create method names from dynamic column definitions using SQL strings or Arel
. These dynamic methods can drastically cut down on the amount of memory allocated both by ruby
and by your database.