Why
the group by Operator is not Necessary in Object Query Languages
Back to Description
of SBA and SBQL.
The operator group by occurs in SQL and is proposed in other
query languages, in particular, in the object-oriented query language OQL (by ODMG).
In relational query languages the operator allows one to formulate a lot of
useful queries that cannot be formulated otherwise. In particular, these
queries cannot be formulated in the relational algebra[1],
making one more evidence that the “relational completeness” of
query languages is only a pseudo-scientific buzzword, with no sense and meaning[2].
The operator was especially useful in connection with aggregate functions, for
example (c.f. Fig.30,
the relational case):
|
E.GB.1 |
For each department get its dept number, the
number of employees and the average salary of employees: |
|
SQL: |
select worksIn,
count(*), avg(sal) from Emp group by worksIn |
Semantics of this construct is very simple. The table Emp is horizontally subdivided into
groups according to the same vale of worksIn.
Then, for each such group the select
clause calculates the number of tuples count(*)
and the average salary avg(sal). The result is a table with three
columns, where the first can be named worksIn
and two next are unnamed.
In SQL the operator is extended by the having clause that is a conditional expression allowing one to
filter proper groups. The having
clause can coexist with the where
clause, for instance:
|
E.GB.2 |
For each department having more than 50
employees get its dept number and the average salary: |
|
SQL: |
select worksIn,
avg(sal) from Emp group by worksIn having count(*)
> 50 |
The operator is relatively simple for a single table. In case of several
joined tables it is not easy to use:
|
E.GB.3 |
For each department having more than 50
employees get its dept number, dept name and the average salary: |
|
SQL: |
select e.worksIn,
d.dname, avg(e.sal) from Emp e, Dept d where e.worksIn = d.d# group by e.worksIn, d.dname having count(e.*) > 50 |
There are some additional rules of using this operator, in particular,
all attribute names that occur in the select clause outside aggregate functions
must also occur in the group by
clause. For large SQL queries such far context dependency compromises
orthogonality and could be the reason of programmer’s bugs. The far
context dependency makes also some problem for static strong type checking;
however, current SQL has no such a feature.
Some professionals consider the group
by operator as a consequence of the flat (unnested) nature of relational
tables. The group by operator temporarily
subdivides the table into sub-tables, but such an operation is impossible in
the classical relational model. Hence, a nested relational model could change
the nature or the necessity of this operator. It is also considered
disadvantageous for performance, as its definition is imperative rather than
declarative. Hence many optimization methods become invalid if this operator is
used. Some new methods are developed, but in general, because the operator
requires sorting, the methods are not always as efficient as required.
The operator also leads to the well known semantic reefs where the
result calculated by the query processor is different from the result expected
by the programmer. The first reef is connected with empty groups, which are not
taken into account by the operator.
|
E.GB.4 |
Get the departments together with the number
of their employees: |
|
SQL: |
select worksIn,
count(*) from Emp group by worksIn |
Assume that the company has three departments, one with 10 employees,
second with 20 employees and third with no employees, and the programmer wants
to calculate the average number of employees in the departments. Obviously, the
average is 10, but if it would be calculated according to the result returned
by the above SQL statement, the average is 15. The empty department is not
taken into account. Another semantic reef is connected with null values. If in
the worksIn attribute nulls are
allowed, then the group by operator
collects all nulls within an additional group. Hence, the number of departments
is increased by one, one group has null as its department number. We can
imagine how many difficult bugs such a feature can generate, especially in the
maintenance phase, when the administrator is forced to add “null is
allowed” for definitions of some columns in the relational tables.
Checking all group by operators in
all applications of all clients that work with this database would not be an
easy job.
In SQL the group by operator
is tightly associated with aggregate functions sum, min, max, avg
and count. If the functions act on a
table that is not subdivided by group by
into subtables, then they act as usual. However, if the aggregate functions are
put in the context of the group by operator, then they calculate proper values
for groups rather then on an entire table. This is a kind of semantic
schizophrenia that is disadvantageous from the point of view of the perception of programmers and obvious
evidence of non-orthogonality of aggregate functions with other query
operators.
Despite the above well recognized disadvantages, ODMG OQL introduces the
group by operator. Unfortunately, the
syntax and especially semantics of this operator are not clear. It makes little
sense to repeat the definitions and examples presented in the “standard”,
as they are rather below the commonly accepted scientific and technical formal
specification standards. It seems that this operator is introduced with no
deeper motivation, as a clone of the corresponding SQL operator motivated by
the (false) statement that OQL is only a minor extension of SQL.
During the development of SBA and SBQL we have tried to develop a
“civilized” version of the operator that would satisfy the
following assumptions:
Unfortunately, despite many investigations, trials and dozens of examples
we did not find the solution that would satisfy these requirements. It is
possible that other researchers will find it, but actually we have lost any
hope. Our conclusion is that the group by
operator, motivated purely by physical implementation, makes no chances to be
integrated with other query operators in such a way that all assumptions
presented above would be fully satisfied.
Our attempts to specify this operator have led us to the conclusion that
for object-oriented models the operator is unnecessary, redundant and awkward.
It can be smoothly and directly substituted by dependent (navigational) join,
dot and other SBQL operators. As we have observed at the beginning, the
operator requires subdividing a table into subtables, i.e. it implicitly
assumes a relational model with nested tables. But our store models AS0-AS3
allow for arbitrary nesting of stored data structures. Nesting is also the
assumption concerning query results. Hence the conceptual motivation for the group by operator totally
disappears. It is unable to introduce to
a query language essential new quality. Moreover, a lot of complex queries that
in SQL require the group by operator
become incredible simpler in SBQL, without this operator.
Below we illustrate on examples how we can avoid the group by operator in SBQL (c.f. Fig.30, the
object-oriented case).
|
E.GB.5 |
(Compare E.GB.1) For each department get its
dept number, the number of employees and the average salary of employees: |
|
SBQL: |
Dept.(d#, count(employs), avg(employs.Emp.sal)) |
|
E.GB.6 |
(Compare E.GB.2) For each department having
more than 50 employees get its dept number and the average salary: |
|
SBQL: |
(Dept where count(employs) > 50). (d#,
avg(employs.Emp.sal)) |
|
E.GB.7 |
(Compare E.GB.3) For each department having
more than 50 employees get its dept number, dept name and the average salary:
|
|
SBQL: |
(Dept where count(employs) > 50). (d#, dname, avg(employs.Emp.sal)) |
Note that having clauses are
not necessary.
|
E.GB.8 |
(Compare E.GB.4) Get the average number of
employees in departments: |
|
SQL: |
avg( Dept.count( employs ) ) |
Note that unlike SQL there is no danger of the semantic reef causing
omitting empty departments. All the departments having no employees will return
count(employs) = 0, hence the average number of employees in departments
is calculated correctly. To deliver the
average number of employees in non-empty departments one can write:
|
E.GB.9 |
(Compare E.GB.4) Get the average number of
employees in non-empty departments: |
|
SQL: |
avg(( Dept
where exists( employs )).count( employs ) ) |
SBQL avoids also the semantic reefs related to null values, because in
SBQL null values are not supported. According to [Date86]
and [SLU98]
no consistent definition of null values is possible. Instead, in SBQL a null
value is represented as a collection with the cardinality [0..1].
In SBQL we can also address relational databases. Below we show that all
the above queries that require the group by operator in SQL can be formulated
in SBQL without this operator. The prescription for such SBQL queries is very
simple:
Sometimes naming suggested in the first step can be omitted. Below we
present examples that follow this prescription (c.f. Fig.30, the relational
case):
|
E.GB.10 |
(Compare E.GB.1) For each department get its
dept number, the number of employees and the average salary: |
|
SBQL: |
Dept.(d#, count(Emp where worksIn = d#), avg((Emp where worksIn = d#).sal))) |
|
SBQL: |
Dept join
((Emp where worksIn = d#) group as e) . (d#, count(e), avg(e.sal)) |
|
SBQL: |
(Dept.d# as d) join ((Emp where worksIn = d) group as e) . (d,
count(e), avg(e.sal)) |
|
E.GB.11 |
(Compare E.GB.2) For each department having
more than 50 employees get its dept number and the average salary: |
|
SBQL: |
((Dept as d) join ((Emp where worksIn = d.d#) group as e ) where count(e) > 50) . (d.d#, avg(e.sal)) |
|
E.GB.12 |
(Compare E.GB.3) For each department having
more than 50 employees get its dept number, dept name and the average salary:
|
|
SBQL: |
((Dept as d) join ((Emp where worksIn = d.d#) group as e ) where count(e) > 50) . (d.d#, d.dname, avg(e.sal)) |
Note that in this case (in contrast to SQL) including dname into the final result has required
a small obvious change only.
|
E.GB.13 |
(Compare E.GB.4) Get the average number of
employees in departments: |
|
SBQL: |
avg( Dept.count(Emp where d# = worksIn)) |
Note that in this case (ulike SQL) no semantic reef is occurring: empty
departments are taken into account during calculation of the average. One can
also create the semantic equivalent of E.GB.4, but this must be explicitly
determined by the programmer:
|
E.GB.14 |
(Compare E.GB.4) Get the average number of
employees in non-empty departments: |
|
SBQL: |
avg((Dept join
count(Emp where d# = worksIn) as c where
c > 0).c) |
Some queries in SQL that require the group
by operator are difficult to formulate. Consider, for instance, the
following SBQL query:
|
E.GB.15 |
(Compare E.GB.7) For each department having
more than 50 employees get its dept number, dept name and the average salary
of its clerks: |
|
SBQL: |
(Dept where count(employs) > 50). (d#, dname, avg(employs.(Emp where job = “clerk”). sal)) |
Note that in comparison to E.GB.7 the above query requires only
inserting the condition job =
“clerk”. In SQL this is not so easy, because count requires the group by operator acting on the entire Emp table, while avg
requires a similar grouping, but addressing only on a part of the Emp table selected by the condition job = “clerk”. This leads to
a very awkward SQL query or some sequence of queries, with the help of
materialized views. This can be considered another sign of the conceptual
weakness of the group by operator.
Below we give some quite easy (but a bit more sophisticated) examples in
SBQL that for sure would require the group
by option in SQL, but cannot be so easily formulated - not excluding that
due to the limitations of SQL they cannot be formulated at all (c.f. Fig.30, the
object-oriented case).
|
E.GB.16 |
For each location give the set of department
names that are located at it, the average salary of bosses of these departments
and the number of clerks that are (possibly) employed at these locations. |
|
SBQL: |
(distinct(Dept.loc) as X). ((Dept where X in
loc) group as Xdepts). ( (Xdepts.dname)
group as XdeptsNames, avg( Xdepts.boss.Emp.sal) as
XdeptsBossAvgSal,
count( Xdepts.employs.(Emp where job = “clerk”)) as XdeptsClerks# ) |
|
E.GB.17 |
For each salary range 0-99, 100-199,
200-299,…, etc. get the range, the number of employees getting the salary
from the range and the average salary of bosses of the employees getting the
salary from the range. Formulate the result as report messages. Ranges with
no employee belonging to should be omitted. |
|
SBQL: |
distinct( integerOf(Emp.sal/100) ) as R . ((100* R)
as Rmin join (Rmin+99) as Rmax join (Emp
where sal ≥ Rmin and sal < Rmax) group as Remps). ( count(Remps) + “
employees earn between “ + Rmin
+ “ and “ + Rmax + “; the average salary of
their bosses: “ +
avg(distinct(Remps.worksIn.Dept.boss.Emp).sal) ) |
The first line calculates the range discriminators R. The second line calculates the minimal and maximal salaries
within a particular range. The third line calculates identifiers of employees
that earn salary within the given range and forms from such employees a group
named Remps. Three next lines form
the output message for the given range using + as a concatenation of strings.
Some professionals can consider as a positive property of the group by operator some performance gain.
Because the operator is closely related to its physical implementation, it can
be tuned to be very efficient. This argument, however, works only for simple
cases, when the grouping concerns stored tables. For tables that are calculated
in a query the argument may be no more valid. Moreover, this argument reminds
the old argument that programming should be done in assembler, because it is
the most efficient and programs can be tuned by the programmers with no limits.
The current point of view that governs the development of computer languages
emphasizes human and economic aspects of programming processes, conceptual
closure (orthogonality) of programming options and achieving proper quality of
programs. Performance is very important, but it is the subject of internal
optimizations rather than the conceptual construction of the language. In
particular, query optimization methods developed for SBQL (and further methods)
give some promises that the overall performance of SBQL queries that in SQL
require grouping would be at least not worse. This of course requires testing
and very detailed analysis of situations in query processing that are
disadvantageous for performance.
As a side effect of our attempts to define the group by operator for object-oriented store models we concluded
that another grouping operator is necessary. This operator we call group as; see examples E.GB.12, E.GB.16
and E.GB.17. The operator groups a
result returned by a query under a single name; then the group can be
manipulated as a single element. The group
us operator is earlier known as the “nest” operator. Despite
the origin, the group us operator has
semantically almost nothing in common with the group by operator. The group
as operator is analogous to grouping some set of figures into a single
figure, as for instance, in Power Point.
The semantics of the group as
operator is very simple and it is explained in
the section devoted to algebraic
operators in SBQL.
Last modified: February 21, 2008
[1] There are special relational algebras that
include the group by operator. In our
opinion they are too opportunistic (in the negative sense) to consider them
seriously.
[2] The “relational completeness”
concept makes still a lot of misunderstanding, especially in industrial
communities not always realizing properly its meaning and limitations. As
“Turing completeness”, the “relational completeness”
introduces no valid yardstick of the universality of computer languages, including query languages and
programming languages. The term “completeness” is abused here and
has no expected technical meaning. Instead of Turing or relational
“completeness” we can talk on “pragmatic completeness”,
i.e. a language reasonably offering everything that programmers want for the
given application domain. In this relative sense no computer language is
“complete”.