Getting Crazy with Analytic Functions and Group By

This topic has been sitting in my backlog for a long time and I finally decided to write it. Analytic functions are not so new anymore (they’ve been around since Oracle 8i), but they are still a very powerful tool.

This is based on a real case I had quite a few years ago. A client came to me with a question regarding a graph they had to generate.

But before I start, I wanted to tell something funny. When I decided to write this blog, I sat down to reproduce the data and the queries (from my memory, which is not great at best). I was almost done when I realized that there actually is a better way of doing it (than the one I actually used in the original case). So in this post I’m not going to present the solution I provided back then, but a better and easier-to-understand solution.

And now to the case:

The system had a table of cars. Each row had information of the car, including the make and model. Their need was to present a bar graph of the number of cars by make and model in a certain way:

  • First bar is the make with the highest number of cars
  • Second bar is the make with the 2nd highest number of cars
  • Third bar is the make with the 3nd highest number of cars
  • Forth bar is all the rest

But that’s not all, in each bar, they wanted to have 4 colors: the bottom color is the model with the highest number of cars, the one on top of that is the 2nd model, then the 3rd model and on top of that all the rest. Here is an example:

analytic

To build this graph, they had 2 options:

  1. Get a full list from the database: make, model, and number of cars. Then calculate all the information they needed for the graph (this included: the total number of cars, the top 3 makes with number of cars per make, and the top 3 models for each of these makes with number of cars per model).
  2. Work in iterations: first run a SQL to find the top 3 makes. Then run another SQL for each of these makes to find the top 3 models of each make. And besides that, they also need to get the total number of cars as well as the number of cars per make for the top 3.

They didn’t want to go for option 1, but option 2 required lots of round trips to the database, which is not great. So at that point they came to me asking if I can somehow reduce the number of round trips to the database. After thinking about it a little bit, I told them that using analytic functions I can probably provide all the information they need with one query.

Test case

For this example, I created a simple table with some information about cars. Use this short script to create the table if you’d like to play with it

create table cars
(id number,
make varchar2(15),
model varchar2(15));

declare
   id number;
   procedure insert_cars(make varchar2, model varchar2, num number)
   is
   begin
      for i in 1..num loop
         insert into cars values(id,make,model);
         id:=id+1;
      end loop;
      commit;
   end;
begin
   id:=1;
   insert_cars('Toyota','Corolla',100);
   insert_cars('Toyota','Camry',52);
   insert_cars('Toyota','Yaris',36);
   insert_cars('Toyota','Rav4',30);
   insert_cars('Toyota','Highlander',30);
   insert_cars('Ford','Focus',85);
   insert_cars('Ford','Fusion',41);
   insert_cars('Ford','Edge',37);
   insert_cars('Ford','Explorer',30);
   insert_cars('Ford','Fiesta',30);
   insert_cars('Ford','Expedition',20);
   insert_cars('Honda','Civic',75);
   insert_cars('Honda','Accord',69);
   insert_cars('Honda','CRV',56);
   insert_cars('Honda','Pilot',20);
   insert_cars('GMC','Acadia',200);
   insert_cars('Dodge','Caravan',100);
end;
/

Phase 1

First, let see again what information we need:

  • Number of cars per make (and rank them to get the top 3)
  • Number of cars per make and model (and rank them to get the top 3, but we are interested only in the models of the top 3 makes)
  • Total number of cars (in order to calculate the “others” bar size)

To get the number of cars per make and model, we can use the query below. The only “interesting” thing I did here is nested the analytic functions in the 4th column to sum the number of cars per make, all the rest is pretty straight forward:

select make,
       model,
       count(*) sum_model,
       sum(count(*)) over (partition by make) sum_make
from cars
group by make,model
order by make,model;

The result set looks like that:

MAKE            MODEL            SUM_MODEL   SUM_MAKE
--------------- --------------- ---------- ----------
Dodge           Caravan                100        100
Ford            Edge                    37        243
Ford            Expedition              20        243
Ford            Explorer                30        243
Ford            Fiesta                  30        243
Ford            Focus                   85        243
Ford            Fusion                  41        243
GMC             Acadia                 200        200
Honda           Accord                  69        220
Honda           CRV                     56        220
Honda           Civic                   75        220
Honda           Pilot                   20        220
Toyota          Camry                   52        248
Toyota          Corolla                100        248
Toyota          Highlander              30        248
Toyota          Rav4                    30        248
Toyota          Yaris                   36        248

Phase 2

The next thing is to add the total number of cars to the query. I used a “grouping sets” trick that I found here to do that. I added another column based on the grouping_id function so it will be clear which row is the grand total.

The last addition in this phase is a “rank” analytic function to find the top 3 models per make. I wanted to add the rank to find the top 3 makes, but that requires nesting another analytic function: rank(sum(count*)) and this is not allowed. I’ll add that rank in the next step.

select make,
       model,
       count(*) sum_model,
       sum(count(*)) over (partition by make) sum_make,
       grouping_id(make) is_grand_total,
       rank() over (partition by make order by count(*) desc) model_rank
from cars
group by grouping sets ((),(make,model))
order by make,model;

The result set:

MAKE            MODEL            SUM_MODEL   SUM_MAKE IS_GRAND_TOTAL MODEL_RANK
--------------- --------------- ---------- ---------- -------------- ----------
Dodge           Caravan                100        100              0          1
Ford            Edge                    37        243              0          3
Ford            Expedition              20        243              0          5
Ford            Explorer                30        243              0          4
Ford            Fiesta                  30        243              0          4
Ford            Focus                   85        243              0          1
Ford            Fusion                  41        243              0          2
GMC             Acadia                 200        200              0          1
Honda           Accord                  69        220              0          2
Honda           CRV                     56        220              0          3
Honda           Civic                   75        220              0          1
Honda           Pilot                   20        220              0          4
Toyota          Camry                   52        248              0          2
Toyota          Corolla                100        248              0          1
Toyota          Highlander              30        248              0          4
Toyota          Rav4                    30        248              0          4
Toyota          Yaris                   36        248              0          3
                                      1011       1011              1          1

Phase 3

The next step will be to add the rank for the makes. As I said I couldn’t do it in this query, so I simply converted it to an inline view and did the ranking in the outer query. Note that I added the grand_total to the rank order by clause, that’s because I don’t want the grand total to be first. The makes themselves should be first, so I can take ranks 1-3.

select make,
model,
sum_model,
sum_make,
is_grand_total,
model_rank,
dense_rank() over (order by is_grand_total,sum_make desc) make_rank
from
  (select make,
          model,
          count(*) sum_model,
          sum(count(*)) over (partition by make) sum_make,
          grouping_id(make) is_grand_total,
          rank() over (partition by make order by count(*) desc) model_rank
   from cars
   group by grouping sets ((),(make,model))
   order by make,model
  );

Result set:

MAKE            MODEL            SUM_MODEL   SUM_MAKE IS_GRAND_TOTAL MODEL_RANK  MAKE_RANK
--------------- --------------- ---------- ---------- -------------- ---------- ----------
Toyota          Camry                   52        248              0          2          1
Toyota          Corolla                100        248              0          1          1
Toyota          Highlander              30        248              0          4          1
Toyota          Yaris                   36        248              0          3          1
Toyota          Rav4                    30        248              0          4          1
Ford            Fusion                  41        243              0          2          2
Ford            Focus                   85        243              0          1          2
Ford            Fiesta                  30        243              0          4          2
Ford            Explorer                30        243              0          4          2
Ford            Expedition              20        243              0          6          2
Ford            Edge                    37        243              0          3          2
Honda           Pilot                   20        220              0          4          3
Honda           CRV                     56        220              0          3          3
Honda           Accord                  69        220              0          2          3
Honda           Civic                   75        220              0          1          3
GMC             Acadia                 200        200              0          1          4
Dodge           Caravan                100        100              0          1          5
                                      1011       1011              1          1          6

Phase 4

Finally, the last step will filter only the rows that are important for us: the grand total and the 3 top models of the 3 top makes:

select *
from
  (select make,
model,
sum_model,
sum_make,
is_grand_total,
model_rank,
dense_rank() over (order by is_grand_total,sum_make desc) make_rank
   from
     (select make,
             model,
             count(*) sum_model,
             sum(count(*)) over (partition by make) sum_make,
             grouping_id(make) is_grand_total,
             rank() over (partition by make order by count(*) desc) model_rank
      from cars
      group by grouping sets ((),(make,model))
     )
  )
where
   is_grand_total=1 or
   (model_rank<=3 and make_rank<=3)
order by make_rank,model_rank;

And the final result set:

MAKE            MODEL            SUM_MODEL   SUM_MAKE IS_GRAND_TOTAL MODEL_RANK  MAKE_RANK
--------------- --------------- ---------- ---------- -------------- ---------- ----------
Toyota          Corolla                100        248              0          1          1
Toyota          Camry                   52        248              0          2          1
Toyota          Yaris                   36        248              0          3          1
Ford            Focus                   85        243              0          1          2
Ford            Fusion                  41        243              0          2          2
Ford            Edge                    37        243              0          3          2
Honda           Civic                   75        220              0          1          3
Honda           Accord                  69        220              0          2          3
Honda           CRV                     56        220              0          3          3
                                      1011       1011              1          1          6

From this result set it's quite easy for the application to build the required graph. The first 3 rows are the top make and the top 3 models for this maker. The "others" part of the bar can be easily calculated by subtracting the three SUM_MODEL from the SUM_MAKE. The same should be done for the other 2 makes. The last bar for "others" will be calculated from the grand total (the row where IS_GRAND_TOTAL=1) minus the 3 SUM_MAKE for the top 3 makes.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s