Back to Description of SBA and SBQL
Impedance mismatch is perceived by many professionals as a negative phenomenon in the software construction that arises from an eclectic mix of two incompatible languages: a query language that is used to access and update of a database and a programming language that is used for making client applications acting on the database. A bit more careful look shows that this negative connotation is perhaps inadequate. Impedance mismatch is an inevitable consequence of a (quite reasonable) principle known as data independence. Misunderstanding of the relationships between impedance mismatch and data independence is the reason of some creativity within modern object-oriented programming languages (mostly Java), which some professionals could perceive as a medicine that is worse than the illness. Impedance mismatch has also less discussed aspect concerning database models and transformations between database schemas. In the following we will try to discuss all the above mentioned aspects.
Impedance mismatch between query and programming languages
The concept of query languages developed in 1970-ties assumed no pragmatic universality. However, because eventually such universality is inevitable in real applications, there was an assumption that a query language is a “sublanguage” that is to be embedded in a universal programming language. A “sub-language” determines the access to a database only. The rest of the entire application has to be programmed in a typical programming language. This assumption requires joining a query language with a programming language in such a way that:
· Queries can be used inside programs;
· Queries can be parameterized through values calculated within programs;
· Results of queries are to be passed to programs.
Difference between language concepts cause significant technical difficulties in accomplishing this kind of connection. A lot of programmers and computer professionals were also disappointed by the technical, aesthetic and conceptual degradation of the programming environment. This degradation is commonly referred to as impedance mismatch. This term denotes a bunch of disadvantageous features that are implied by the eclectic mix of a query language (in particular SQL) with a programming language (such as C, C++ or Java). Below we list and comment these features.
· Syntax: In the same code the programmer must use two programming styles and must follow two different grammars. Similar concepts are denoted differently (for instance, strings in C are written within “…”, in SQL – ‘…’) and different concepts are denoted similarly (for instance, in C = denotes an assignment, in SQL – a comparison).
· Typing: Types and denotations of types assumed by query and programming languages differ, as a rule. This concerns atomic types such as integer, real, boolean, etc. Representation of atomic types in programming languages and in databases can be significantly different, even if the types are denoted by the same keyword, e.g. integer. A lossless conversion between such types could be impossible and might imply some performance overhead. This also concerns complex types, such as tables (a basic data type constructor in SQL, absent in programming languages). Popular programming languages introduce static (compile time) type checking, which is impossible e.g. in SQL (because query languages are based on dynamic rather than static binding).
· Semantics and language paradigms: The concept of semantics of both languages is totally different. Query languages are based on the declarative style (what is to be retrieved rather than how), while programming languages are based on the imperative style (a sequence of commands to a machine, which accomplishes what).
· Abstraction level: Query languages free the programmers from majority of the details concerning data organization and implementation, for instance, organization of collections, presence or absence indices, etc. In programming languages these details usually are explicit (although may be covered by some libraries).
· Binding phases and mechanisms: Query languages are based on late (run-time) binding of all the names that occur in queries, while programming languages are based on early (compile and linking time) binding. Thus, from the point of view of a program, queries are simply strings of characters.
· Name spaces and scope rules: Queries do not see names occurring in programs and v/v. Because eventually there must be some intersection of these name spaces (e.g. program variables must parameterize queries) additional facilities, with own syntax, semantics and pragmatics, are required. These facilities are the burden for the size and legibility of the program code. Moreover, in programming languages name scopes are organized hierarchically and processed by special rules based on stacks. These rules are ignored by a query language. This leads e.g. to problems with recursive procedure calls (a well-known example concerns SQL cursors that severely reduce the possibility of recursion). Another disadvantage of separated name spaces concerns automatic refactoring of programs, which cannot be performed on queries.
· Collections: Databases store collections (e.g. tables) which are processed by queries. In programming languages collections are absent or severely limited. Hence collections returned by queries have no direct counterparts in a programming language and must be processed by special constructs with own syntax and semantics.
· Null values: Databases and their query languages have specialized features for storing and processing null values. Such features are absent in programming languages, thus some substitutes must be introduced. For instance, if some value in a relational database can be null, mapping it to a programming language requires two variables: one for storing information about null and another one for storing the value.
· Iteration facilities: In query languages iterations are accomplished as macroscopic query operators, such as selection, projection, join, product, sum, intersect, etc. In programming languages iterations are coded explicitly as loops (for, while, repeat, etc.). Processing results of queries in a programming language requires special mechanisms, such as cursors and iterators.
· Persistence: Query languages address persistent data (stored on a disc or another long-term memory), while programming languages process only data stored in a volatile operating memory. Joining query and programming languages requires special facilities for parameterizing queries by volatile variables and transmission of persistent data to volatile memory and v/v.
· Queries and expressions: There is some competence mismatch between queries and programming language expressions. Some queries look as expressions and v/v, but there is strong syntactic subdivision of them, which can be poorly understood by the programmer. For instance, in some query languages 2+2 is a query, but it is also an expression of a programming language. A query cannot be a parameter to a procedure, but an expression can. There could be other syntactic constraints, which cause a lot of chaos in the entire programming environment.
· References: If a query is to be used for updating, inserting or deleting, it must return references to stored data (i.e. data identifiers rather than values). According to the official semantics of the relational model, queries return tables of values, with no references. For updating constructs defined in a programming language such semantics is inconsistent; actually, it means that queries cannot be used for updating or require a special mode of execution and/or a special constructs in a programming language.
· Refactoring: decisions concerning new names used for data structures cannot be automatically propagated to queries, because – from the point of view of a programming language – queries are strings, sometimes not explicitly seen from the source program. Hence refactoring of queries should be done manually, with a lot of effort and possibilities of inconsistencies.
The consequences of impedance mismatch concerns not only aesthetics and ergonomics of the programming environment. Impedance mismatch implies an additional programming layer, with own syntax, semantics and pragmatics. This layer causes that learning of the language takes more time, programming is more error prone, programs are unnecessarily longer and less legible. This layer may also cause worse performance and maintainability of applications. If queries are strings then there is no explicit support for creating reusable query components. None of the reusability features of the programming language (functions, methods, polymorphism) are available to support reuse. Passing parameters to queries written as strings (c.f. the ODMG standard) is awkward and error-prone. Queries, as strings embedded in a program, are also more prone to injection attack.
Some authors suggest that the source of impedance mismatch is in incompatibility of data models, in particular, access to relational databases is accomplished from an object-oriented programming environment (such as object-oriented DBMS). Such a suggestion presents the ODMG standard, with the conclusion that in this standard impedance mismatch no more holds. Unfortunately it is only partly true. Indeed, the mismatch is inevitable in situation of big differences between data models, in particular, between a relational system and an object-oriented programming language. However, even if both models are claimed to be “object-oriented”, the impedance mismatch still persists. There are significant differences between various object-oriented models. Actually, there are as many object models as different object-oriented artifacts and proposals; no standard object model exists. Differences between the object models of Smalltalk, CORBA, UML, C++, SQL-99, ODMG, Java, C# etc. are fundamental. Moreover, even if the model is exactly the same and would the subject of some precise standard, some impedance mismatch can persist due to e.g. differences in binding phases. These issues we discuss in the next subsection devoted to the relationship between impedance mismatch and data independence.
To avoid the impedance mismatch, a language should be integrated, where queries are smoothly connected with programming constructs and abstractions. This tendency is seen in such products as PL/SQL, T-SQL and standards SQL-99 and SQL-2003. The SBQL language is designed according to this tendency, where SBQL queries are integrated with updating constructs and programming abstractions. The typing system and a strong type checker is the same for the queries, updating constructs, procedures, functions, methods, parameters, views, transactions, etc. This significantly distinguishes SBQL from other query languages. In the following we discuss some pros and cons of the idea of integrated database programming languages.
Impedance mismatch and native queries
Recently there are some proposals known as native queries [CoRo06] that support some kind of queries within typical programming languages such as Java (db4o) and Comega [Comega07]). This mechanism is advocated as 100% typesafe, 100% compile-time checked and 100% refactorable. Unfortunately, such an approach introduces limitations to queries, in particular:
· Queries dressed in the syntax of a programming language are much less legible even in comparison to SQL (which in our opinion is not much legible too). There is a lot of information noise from the point of view of direct expressing of a query goal.
· The expressive power of queries is far below the acceptable power, for instance, in native queries there are no joins, grouping, aggregate functions, sorting, quantifiers, set operators, etc. Actually, such queries cover only simple selections.
· It is impossible to formulate ad hoc queries, because of early binding.
· It is impossible to have server-side persistent abstractions based on queries, such as database views, stored procedures, triggers, etc.
· The problem with native queries concerns performance for very large collections of objects. In [CoRo06] the problem is recognized (and claimed to be solved through changes to a compiler), but the solution seems to be not enough flexible. Native queries are very simple, more complex tasks must be accomplished by a sequence of several queries, instead of just one. Because such a sequence is mixed up with the regular code, generating finally a single composed query is practically impossible. This much reduces opportunities for optimization. In particular, essentially no optimization by rewriting rules is possible.
· Query optimization is not directly supported by native queries – they rely on SQL optimizers. To utilize an SQL optimizer the mapping between objects and relations cannot be too complex; otherwise the developers have to solve non-trivial research problems, similar to the view updating problem that is commonly recognized in the database domain as very difficult. This limitation can much warp the initial conceptual object data model. In majority of cases this leads to almost trivial mappings between Java or C# objects and relational tables. Then, mapping between native queries and SQL is equally trivial. In effect, native queries still work on the relational data model. In this way the object model becomes a “slave” of the relational model. Although relational tuples are wrapped as “objects”, this is far from true object-orientedness.
The idea of native queries can work for some kinds of applications, but not for real very large databases. The LINQ project has a better position, because it assumes extensions of the C# syntax. This approach can be perceived as a reverse approach in comparison to SBQL: LINQ extends the syntax and semantics of a programming language in direction of query languages, while SBQL attempts to extend a query language in the direction of a programming language.
Java and C# were developed without database programming in mind. Extensions to these languages in this direction are of course possible (as proponents of JDO, Hibernate, native queries, etc. have shown), but with a lot of limitations and awkward solutions. Only integrated database query and programming language, developed with taking into account all database peculiarities, is able to fulfill all critical requirements to such software manufacturing environments. Native queries present too little (if any) progress in this direction.
Impedance mismatch and data independence
Impedance mismatch is an inherent consequence of the data independence principle that assumes that a database is designed, administered, maintained, secured, catalogued, published and accessed independently from any application program that acts on the database. Moreover, there is usually no assumption that database applications are to be written in a single programming language. Just otherwise, data independence implicitly assumes that a database will be available for many programming languages, providing each of them implements a corresponding library (or a “driver”). Integrated languages that make no distinction between querying and programming violate data independence. In most common situation the idea of an integrated language implicitly assumes that database application programs will be written in a single programming language (or in a family of languages having the same typing and/or data representation system). It is not sure if the software community is currently prepared for such a “monopoly” (which apparently violates our sense of democracy, free commercial competition, free possibilities of inventions and the need of diversity as a progress factor). However, as a matter of fact, many database applications are actually based on such monopoly. It is possible of course that each database management system would work with its own integrated programming language; PL/SQL of Oracle and T-SQL of SQL Server are examples. However, when one attempts to connect database servers of different types (or even different cultures) is a single application then impedance mismatch seems to be inevitable. Actually, impedance mismatch can be avoided by well-developed wrappers to external resources. However, wrappers introduce a lot of limitations (data models, performance, security, transaction processing, etc.) hence might not be the best for a lot of cases. We discuss this problem a bit later.
In the relationships between impedance mismatch and data independence there is no ideal solution. In particular, a tradeoff is necessary for the data independence principle. The principle was formulated at the time when databases (especially relational databases) contained pure data only. Current database servers, including relational ones, store many entities that must be prepared in a query and programming language. These entities include:
· Stored procedures and functions.
· Triggers, constraints and (business) rules.
· Stored classes, including methods that are defined within these classes, inheritance, and other features of object-orientedness.
· Database views, in particular, updatable database views.
· Definitions of workflow processes.
· Definitions of wrappers, mediators, adapters, integrators, exporters, importers and other interoperability or data distribution facilities.
These (persistent) entities are prepared during the database design phase or during database maintenance by a database server administrator. Some other entities are possible and are currently considered such as persistent threads, pre- and post-conditions, assertions, overloading views and so on. One can imagine that these entities can be written in many languages, but for several reasons such a freedom would be disadvantageous or unrealistic. All such languages should be based on the same data structures (determined by the database model and types) and this limitation much reduces the freedom. The assumption that any programming language can be used for this purpose is unrealistic at least for two reasons: (1) early binding assumed in popular languages (which would exclude many database features such as views, changes in the database schema, etc.); (2) severe problems with impedance mismatch. Hence, as a final conclusion, for a given DBMS all such active entities should be written in a single, integrated query and programming language that deals with persistence as a regular option. For these reasons the development of integrated database programming languages and their standards makes a great sense. The ODRA project has shown that within this solution some level of interoperability with external incompatible databases (with no impedance mismatch) can be achieved by properly constructed and implemented wrappers.
Impedance mismatch between models and schemas
The impedance mismatch phenomenon is not limited only to incompatibilities between a query language and a programming language. In the literature another form of impedance mismatch is discussed: the mismatch between an object-oriented conceptual schema and an equivalent relational schema. This mismatch is even more disadvantageous for the entire software life cycle than the previous one. To a big extent, just this kind of mismatch is the greatest motivation for object-oriented databases.
We illustrate the mismatch on an example. In Fig.52 we present a UML-like object oriented-database schema that is the result of the analysis and design phases of some application. Such a schema can be directly written in the ODRA system.
Fig.52. Object-oriented UML-like database schema (ODRA)
For simplification in the schema we omit atomic types considering them as obvious. The schema specifies obvious dependencies between persons, experts and companies, including cardinalities of particular data. An expert has some set of competences. He or she can do many works for a company; each work may require many payments and many expertises. The expert class inherits from the person class. A person can have many names and addresses. A company can have many locations. Experts, works and companies are connected by two obvious associations. Not that this schema is very close to UML, but with essential differences that (we hope) do not disturb smooth, seamless mapping between an UML class diagram to the database schema. In the schema cardinalities are assigned to all data types (cardinality [1..1] is omitted). Associations can be only binary and association role names are substituted by pointer types (EW, WE, WC and CW).
According to our experience, an average programmer can understand this schema in few minutes, then he or she can write SBQL queries and programs. If an object database management system is unavailable, then this schema must be transformed to the relational schema. One of possible solutions is shown in Fig.53.
Fig.53. Relational schema “equivalent” to the object schema from Fig.52
Now we clearly see that the “simplicity” of the relational schema is apparent. The programmer must spend a lot of time to understand it. Perhaps, he or she would require additional informal information. One of possible ways to give this information is to refer to the original UML design diagram a similar ER diagram. (Just a version of such diagrams is presented in Fig.52.) In the diagram presented in Fig.53 some conceptual information is lost. This concerns, in particular, boundaries of objects and cardinalities. For instance, the information that each company must possess at least one location cannot be inferred from Fig.53. The diagram is also littered by some inessential information that in the relational representation must be explicitly filled in: for instance, originally we have no C#, E#, P# and W# attributes – they have appeared as the result of the relational normalization. Arrows that lead from foreign to primary keys are actually not a part of the relational schema – they are depicted as obvious but may become much less obvious in case when (for various reasons) the names of foreign keys are different from the names of primary keys. This means that the relational schema may even be more complex in comparison to the equivalent object-oriented schema.
SBQL is one of few query languages that is prepared to address queries against to both object-oriented and relational schemas. Now we can observe how this case of impedance mismatch between schemas will influence the complexity of queries:
Get last names of experts that prepared at least 3 expertises for IBM:
SBQL for the objects schema; note the use of a path expression (26 lexical units):
SBQL for the relational schema (70 lexical units):
SQL (72 lexical units):
This example presents a schema impedance mismatch rule that can be formulated as follows:
The rule presents a great argument in favor of object-oriented database models. Some estimations have shown that all the cases of impedance mismatch cause that the size of application code addressing a relational database is about three times longer than the code addressing an original object-oriented database schema. This is obviously disadvantageous concerning time, money and quality of applications. The most disadvantageous is the cost of maintenance, which frequently requires reverse engineering in which the programmer must return to the original object-oriented analysis and design schema.
Another aspect of the above impedance mismatch case concerns relationships to object-relational mappers (ORM) such as Hibernate. The problem that we see is that the basis for an object-oriented schema in ORM is the relational schema rather than the original UML-like schema. Hence the question is what is the relationship between the final ORM database schema that is seen e.g. to Hibernate programmers and the initial UML schema that was the result of analysis and design phases. Because of several problems with ORM (in particular, SQL-based query optimization) we can suspect that ORM schema is rather trivial mapping of the relational schema and has little in common with the original UML schema. This means that the impedance mismatch persists when the programmer (for any reason) has to reconstruct the original UML schema. This could be especially painful and frustrating for the maintenance activities which frequently require referring to the original analysis and design documentation. Unfortunately, the Author of this page does not know any serious research that would investigate how many effort (or time, money, quality decrease, frustrations of programmers, etc.) is caused by this case of impedance mismatch.
Mediators, adapters, wrappers and virtual repositories
[Wied92] proposes the technology of mediators that seem to be a method to avoid impedance mismatch. A mediator adapts an external resource (incompatible in the sense of datamodel, semantics and access) to the requirements of the given applications. A similar idea (accomplished on a lower abstraction level) is accomplished in CORBA and object request brokers based on the standard. The corresponding terms are adapter and skeleton that change the application programming interface of some external resources to the interface determined an expression of IDL (Interface Definition Language). In this way CORBA accomplishes transparent access to distributed heterogeneous resources which can be implemented on different sites, media, with different representation of data, on different systems, etc. The idea is also known under the name wrapper, i.e. some software that converts an API (Application Programming Interface) into another API that is required by some canonical data model. There are other ideas that correspond to the previous ones, such as Web Services and Service Oriented Architecture (SOA).
More recently, the idea of mediators, adapters, wrappers, etc. is generalized under the name virtual repository. In a virtual repository the source information is not replicated or stored. The information is accessed from multiple applications by using some kinds of links or mappers that redirect or map an original request to the request addressing particular data and service sources. After the request is accomplished, its result is mapped back from an original source format to the format expected by the given application. Hence a virtual repository presents a transparent middleware layer (similar to a CORBA bus) that makes it possible to write homogeneous programs (e.g. in Java) addressing distributed and heterogeneous resources without involving in the programs any peculiarities or details related to organization or access to the resources. Idealistically, in this case the impedance mismatch no more holds, as there is no necessity to mix two incompatible languages in one source code. All the job with the access to external resources is done during the middleware building phase, earlier than any application code is to be written. The application programmer uses external resources in such a way that they are non-distinguishable from regular data structures (objects) in a regular working space of the application. In another parlance, peculiarities of external resources organization and access are transparent for the user of a virtual repository.
In the ODRA project and in the European project eGov Bus we have collected a lot of research and experience concerning the idea of virtual repositories.
Fig. 54. Reference architecture of a virtual repository (ODRA)
Fig.54 shows some architectural variant for a Virtual Repository that we can be developed in the ODRA system. It presents some configuration of developed software units. Many other architectural combinations are possible, depending on a particular application in question.
A central part of the architecture consists of ODRA, an object-oriented DBMS. Existing resources (bottom of the figure) are extended by wrappers and contributory views (or importers/exporters) that convert data/services proprietary to particular existing applications into the format acceptable for ODRA. The application developers can install as many ODRA servers as necessary, addressing the same distributed sources. The integration view on the ODRA server allows for virtual integration of data and services supplied by distributed sources, supporting data and function abstractions. The virtual repository front-end will provide various APIs to access virtually integrated data, including workflow applications, Java applications, Web services applications, and others. These APIs are available from SBQL, a query and programming language of the ODRA system. A particular user works with his/her own client view that is a tailored part of the entire virtual repository schema. Among many other functions, the virtual repository allows for transparent access to external information resources and for unlimited transformations of complex document structures.
To accomplish such an architecture without the impedance mismatch effect several conditions must be satisfied, in particular:
· Front-end datamodel (seen by application programmers) must be rich enough to cover conceptually all the data models of external sources. If the front-end datamodel is limited, some structures within existing sources have no direct mapping, hence must be supported by many commands in a front-end programming language. This is a kind of impedance mismatch. For instance, if any source supports collections, then the front-end datamodel must support collections too.
· Front-end query and programming language must be at least on the same abstract level as any of application programming interfaces of sources. For instance, if a source is accessible via a query language, then the front-end language must support a query language at least as powerful as the source query language. Otherwise the power of the source query language cannot be utilized in the front-end language, which implies (at least) severe performance problems and many statements of the front-end language that are needed for expressing one query in the source. This is another sort of impedance mismatch. Such impedance mismatch is inevitable if one assumes Java as a front-end language and SQL–based sources (independently whether native Java queries or embedded query strings are used).
· Mapping between a source data and a front-end virtual data must be done by a tool that is sufficiently powerful. If the mapping capabilities are restricted, then the front-end schema is much burdened by source schemas, hence the degree of abstraction and transparency is much decreased. In general, all the mapping freedom is available only in the case when the mapping tool has the algorithmic power of universal programming language. Other mapping tools (e.g. some special XML files) in any case introduce a lot of limits.
· The mapping should utilize native query optimizers that are implemented in source databases (e.g. rewriting rules or indices). Otherwise the performance is usually unacceptable. In particular, a front-end query and programming language should give all the chances for SQL optimizers to work. Unfortunately, tests on some ORM solutions show significant performance problems. Bad performance undermines scalability, thus is unacceptable for majority of business applications.
· Mapping is relatively easy when one considers data retrieval only. For mapping of updates that act on front-end virtual objects and result in updating of sources the mapping capabilities can be very complex, up to programming nightmare. The problem is recognized in the database literature as the view updating problem. For this reason mapping of updates severely restricts the conceptual distance between source schemata and the target (front end) schema. In majority of solutions (especially concerning ORM) the mapping is practically isomorphic (1:1), i.e. trivial and hardly acceptable from the point of view of object-oriented conceptual modeling.
· There are many challenging issues during implementation of such a virtual repository, such as mapping between incompatible atomic types, mapping of procedural capabilities implemented on the side of sources, global indexing of data existing in heterogeneous distributed sources, distributed query optimization, replicated and redundant data existing in heterogeneous distributed resources, distributed transaction processing, mapping of administrative facilities (such as security and user privileges), etc. No general approach exists to solve them, each peculiarity of the mapping must be solved by a proprietary method.
In ODRA the mapping between source and target schemata is done by object-oriented virtual updatable views. In contrast to many other proposals of views, ODRA views are defined in SBQL, which has full algorithmic power. There is no conceptual limit concerning the mapping. This especially concerns ORM. SBQL queries addressing an object-oriented front-end schema are modified by view definition (by the query optimization technique) thus are converted into SBQL queries addressing the corresponding relational database. Then, these queries are optimized and converted into SQL. Due to this technique no form of impedance mismatch (from discussed above) can occur on the front-end level. Mapping of updates of virtual objects into updates of source structures can also be implemented within ODRA updatable views, without any conceptual anomalies and warping the updating intention (which plague a lot of solutions concerning views). However, this issue is more critical than pure retrieval, especially concerning performance. Because for majority of business applications performance dominates over clean conceptual model, elegancy of solution and minimal programming effort, solving performance problem always means introducing some impedance mismatch from the kinds that are discussed above.
Last modified: January 15, 2008
 The term impedance mismatch has roots in electronics, where it denotes the difference between the impedance of a source and the impedance of a receiver, causing that the effective power is wasted.
 Pragmatic universality requires from a corresponding query/programming language all the facilities that are nowadays expected by designers and programmers for this kind of tools and allow for efficient accomplishing required programming tasks in reasonable time, cost, manpower and with the adequate quality. The concept has nothing in common with the relational completeness and with the Turing completeness. These concepts are irrelevant when considering the pragmatic universality of query/programming languages.