Table of contents
    blog cover

    Solving the "must appear in the GROUP BY clause" Error in Rails and PostgreSQL

    Software Engineer
    Software Engineer
    Ruby on Rails
    Ruby on Rails
    This blog will give an example of how to resolve this common error when using group_by in Active Record

    1. Example problem

    First, take a look at this SQL problem. Imagine you're working on a booking application where users can have multiple sessions. You have two models:
    // language: ruby
    # User
    # email: string
    class User < ApplicationRecord
      has_many :sessions
    end
    
    # Session
    # tag: string
    class Session < ApplicationRecord
      belongs_to :user
    end

    Now, you need to find users with the same email address and have sessions with a specific tag. This might be necessary for identifying potential duplicate accounts or for a data cleanup task.

    2. The wrong approach

    Your first attempt might look something like this:
    // language: ruby
    User.joins(:sessions)
      .where(sessions: { tag: Session::SPECIFIC_TAG })
      .group(:email)
      .having('COUNT(DISTINCT users.id) > 1')

    The generated SQL:
    // language: sql
    SELECT "users".*
    FROM "users"
    INNER JOIN "sessions" ON "sessions"."user_id" = "users"."id"
    WHERE "sessions"."tag" = 'SPECIFIC_TAG'
    GROUP BY "users"."email"
    HAVING COUNT(DISTINCT users.id) > 1

    However, running this query results in an error:
    // language: bash
    PG::GroupingError: ERROR:  column "users.id" must appear in the GROUP BY clause or be used in an aggregate function
    LINE 1: SELECT "users".* FROM "users" INNER JOIN "sessions" ON "sessions"...

    This error occurs because of how SQL handles grouping and non-aggregated columns. When using a GROUP BY clause, every column in the SELECT statement must either:
    1. Be included in the GROUP BY clause, or
    2. Be used in an aggregate function (like COUNT, SUM, AVG, etc.)

    In our query, we're grouping by email, but Rails' default SELECT statement (SELECT "users".*) is trying to select all columns from the users table. This creates a conflict because columns like id, which are not in the GROUP BY clause, are being selected without being aggregated.

    3. Solution

    To solve this problem and efficiently retrieve the data we need, we can use PostgreSQL's array_agg function along with pluck. Here's the working query:

    // language: ruby
    User
      .joins(:sessions)
      .where(sessions: { tag: Session::SPECIFIC_TAG })
      .group(:email)
      .having('COUNT(DISTINCT users.id) > 1')
      .pluck("users.email, array_agg(users.id)")

    It will return an array of arrays, where each inner array contains:
    • The shared email
    • An array of user IDs that share that email (using PostgreSQL's array_agg function)
    // language: ruby
    [["[email protected]", [1, 99, 543]]

    You might wonder why we use pluck instead of select in our solution. Let's explore what happens when we try to use select:
    // language: ruby
    User
      .joins(:sessions)
      .where(sessions: { tag: Session::SPECIFIC_TAG })
      .group(:email)
      .having('COUNT(DISTINCT users.id) > 1')
      .select("users.email, array_agg(users.id) as user_ids")

    This query still works, but it comes with some caveats:
    1. Partial Objects: This query returns User objects, but they're partial objects. They only have the email attribute populated, plus a custom user_ids attribute that doesn't correspond to any column in the users table.
    2. Type Casting: The user_ids attribute will be a string representation of the PostgreSQL array, not a Ruby array. You'd need to parse it manually.
    3. ActiveRecord Expectations: Some ActiveRecord methods might not work as expected on these partial objects, potentially leading to subtle bugs.
    4. Performance: Creating ActiveRecord objects is more resource-intensive than just returning the raw data.
    // language: ruby
    [#<User:0x00000001309c3098 id: nil, primary_email: "[email protected]">]

    Remember, when dealing with complex queries involving grouping and aggregation, it's often helpful to think about exactly what data you need and how you can retrieve it most efficiently from the database.
    Created at 2024-07-10 11:17:29 +0700

    Related blogs