Issues and Current Practice


Accessing Databases Through the World Wide Web: Issues and Current Practice

This paper examines the issues involved with connecting traditional databases to the World Wide Web.

This paper is signi?cant because it examines the interaction and synergy between a large, mature area of computer science (databases) and a rapidly expanding and largely unexplored area of computer science (Internet / WWW). Databases are the workhorse of most information systems applications. Internet / World Wide Web is the interface the public has come to associate with information searching and retrieval. These two sub-disciplines have just begun to collide, and both are improving to the point where they will soon be inseparable from the users point of view.

Michael L. Nelson, m.l.nelson@larc.nasa.gov, CS 550, 12/1/95


Accessing Databases Through the World Wide Web: Issues and Current Practice
Michael L. Nelson CS 550 12/1/95

Abstract This paper examines the issues, current practices and trends for the future regarding accessing traditional databases through the World Wide Web (WWW). It is often easy to forget that the WWW has no data management facilities of its own; it is simply a “pointing” mechanism. Databases have historically focused on rich toolsets for describing, updating, and modelling data. WWW adds to this a platform independent, client-server access method that uses ubiquitous transport medium. The incorporation of these technologies has introduced a number of new challenges, and resurrected some older computing issues. However, the eventual result will be greater information access to a much wider user community.

1.0 Introduction
The World Wide Web (WWW) is a popular client-server information system architecture built on a number of Internet Protocols [1]. Although work on WWW began in the late 1980s, it was not until the arrival of the client program “NCSA Mosaic” in 1993 that interest in WWW really took off [2]. WWW servers have increased from 130 in June, 1993 to over 23,000 measured in June, 1995 [3]. In contrast, database management systems (DBMS) appeared in the mid 1960s with the most popular language for DBMSs, Structured Query Language (SQL), originally de?ned in the mid 1970s [4]. Given nearly 30 years of development and applicability to numerous disciplines, it is not surprising that database technology has achieved its current prevalence. Since these two technology areas, WWW and DBMS, are applicable to such wide audiences, it is not surprising that work is well underway on gatewaying between them. Each technology has strengths and weaknesses of its own, and the combination of the two introduces a new set of strengths and weaknesses.

2.0 Computer System Trends
This section examines a brief history of each technology, including the current trends that are leading up to the present synthesis. 2.1 Trends in Database Management Systems

Michael L. Nelson, m.l.nelson@larc.nasa.gov, CS 550, 12/1/95


There have been several language paradigms for DBMS. Each successive language not only adds functionality, but perhaps more importantly presents a different model for describing and modelling data and data ?ow. Highlights of this trend can roughly be summarized by the following models: - Network Data Model: the original model, this consists of 2 data structures: records and sets, and “links” (or a “network”) between these various data structures. - Hierarchical Data Model: closely related to the Network Data Model, the Hierarchical Data Model describes the familiar concept of hierarchies of information, which is familiar for most people. - Relational Data Model: A departure from previous models, the Relational Data Model concentrates on the relations (“works_for”, “visits_with”, etc.) that exist between data objects. The emphasis on the relations, and not just the data objects, gives greater insight to the real world being modelled. The Structured Query Language (SQL) and its derivations are chief language for relational DBMSs. - Entity-Relationship Data Model: the Entity-Relationship (ER) model, and the related Extended Entity-Relationship (EER) model are a popular method of describing database design at a high level. Generally, ER models are drafted for communication between the database designer and the database customer because it provides a simple yet rich toolset for expressing requirements. There is no ER DBMS; ER models are generally translated into relational or other models before they are implemented. - Object Oriented Data Model: Object-Oriented (OO) technology, with roots in OO languages such as Smalltalk and C++, has resulted in database hybrids. While some OO only prototype databases have been constructed, OO has had the biggest impact in DBMSs by adding OO concepts to relational DBMSs. OO databases (and all things OO) are currently popular. Unlike the relational data model, which eventually had a set of rules published to determine relational compliance [5], to date there is no de?nitive agreement on how much OO technology needs to be in a DBMS for it to be considered OO.

This progression of data models represents an evolution in methods of describing and modelling the real world. At ?rst, the models’ ability to describe data were highly constrained by their computer implementation. The progress in data model richness is analogous to the progress in power of programming languages (1st - 4th generation). Regardless of the data model chosen or the speci?c language for the DBMS, the database approach has always offered a number of advantages over a simple ?le system approach of managing data [4]: - Standards enforcement: by having a centrally de?ned format for all input data,

Michael L. Nelson, m.l.nelson@larc.nasa.gov, CS 550, 12/1/95


a DBMS can avoid “format drift” that naturally occurs without the presence of a central authority. - Reduced application development time: once the overhead for DBMS setup has been paid, development time for applications falls to 17-25% of non-DBMS methods. - Flexibility: changing the structure (to re?ect changing requirements) is much more manageable with the database approach. - Up-to-date information: by eliminating uncoordinated redundancy in data storage, transactions and updates can more accurately model the changing state of the real world. - Economies of scale: similar to the reduced application development time, once the startup costs have been paid, the data management ability scales nicely with the data input to the system. Clearly, the bene?ts of using the database approach are convincing. With arrival of personal computer relational DBMSs such as PARADOX, dBase IV, and Microsoft Access, DBMS use is no longer the province of large IS departments, but is now quite popular in the personal computing realm as well. 2.2 Trends in Internet / World Wide Web Internet and WWW, once restricted to academic and research communities, is also extremely popular in the broader computing communities as well. It is estimated that individuals using the Internet will increase from 3.3 million in 1995, to over 65 million in 1999 [6]. The Internet is a hierarchical network of networks that all utilize the Internet protocol suite (commonly referred to as TCP/IP, though the suite contains more than those 2 protocols) as the basis of services such as mail, ?le transfer and remote login [7]. The WWW is the collection of these and later application level protocols, such as the Hypertext Transfer Protocol (HTTP) [1]. What began as a text-only system around 1990, added multimedia capability with the advent of NCSA Mosaic in 1993 [2]. WWW and NCSA Mosaic were the right products at the right time. They bene?tted by the spread of networks supporting TCP/IP, and the ease of installation and use of WWW related products. Critical elements of WWW’s success include: - Client/Server architecture: by placing a handful of well de?ned and simple protocols in the center, it was easy to develop many different servers and clients for all the popular platforms (Mac, PC, Unix). - Backwards compatibility: WWW included some new features (HTTP), but obsoleted no servers. All FTP, gopher, and USENET resources remained relevant and reachable through the WWW. Protocols not directly supported could be easily gatewayed.

Michael L. Nelson, m.l.nelson@larc.nasa.gov, CS 550, 12/1/95


- Single interface concept: previously, most resources and servers required separate client applications to access them. Even telnet resources implemented highly variable user interfaces. WWW successfully implemented the concept of a single interface to all resources. Why have separate e-mail, news, ftp, and database clients? Seemingly overnight the world was using, or at least talking about, WWW. After the honeymoon was over, mostly people found out several things: 1) a lot of the interesting data still exists only in traditional databases; and 2) the web’s ?le orientation and ease of data creation made people forget the advantages of the database approach (listed in 2.1). In short: many users quickly found themselves with an unmanageable, motley collection of ?les despite the clean presentation by the client interface.

3.0 Synthesis of DBMS and WWW
In 1989, reference [8] listed four areas that will have impact on the reigning relational data model: - integration with expert systems - object orientation - distributed-database technology - hypermedia technology WWW is accomplishing the latter, and to some extent the distributed-database area as well. The key technology for DBMS access through the WWW is the Common Gateway Interface (CGI) [9]. CGI allows for remote clients to execute programs on servers. Figure 1 illustrates the general architecture of WWW - CGI - DBMS interaction.

HTTP client X A

http server



Figure 1: General WWW - CGI - DBMS Access Client A sends an HTTP request to server B, possibly sending in arguments. Server B invokes the CGI program C, which handles the communication through the DBMS applications program interface (API) to speak to DBMS D. 3.1 CGI CGI programs can be written in any language (C, Perl, Tcl, Fortran, etc.). The server just has to give

Michael L. Nelson, m.l.nelson@larc.nasa.gov, CS 550, 12/1/95


them special permission to be invoked remotely (arbitrary remote execution would be a bad thing), and there are some simple formatting rules to follow so the client can parse the returning data. After that, CGI programs are similar to any other program. Depending on the implementation of the DBMS, the CGI may have to go through some protocol conversion or gateway machines if the DBMS is not on a TCP/IP network. The importance of a well de?ned API for the DBMS cannot be stressed enough. The CGI program can be written to speak directly to a speci?c database (Oracle, Ingress, Informix, etc.), or the CGI program can use a higher level API such as the Open Database Connectivity (ODBC) [10]. ODBC is an API put forth by Microsoft to provide a general, open database API that will hide the speci?c DBMS implementation from the programmer. ODBC was not begun with CGI programmers in mind, but ODBC holds much promise for them. An additional complication for CGI programs is that HTTP is stateless. Much like the Network File System (NFS), a stateless protocol has an initial appeal because of its simplicity, but in practice, people spend much of their time trying to workaround the stateless nature. The current practice is maintain state with Uniform Resource Locator arguments with the GET method, and with the hidden data ?elds of the POST method. The end result is the same: if the transaction requested is not a simple, canned transaction the higher level, logical transaction (“make airplane reservation”) has to carry around the baggage of several smaller, intermediate transactions (“choose Norfolk to Atlanta”, “choose Flight 123”, “choose seat 13A”, etc.). In addition, since the WWW client can query the DBMS but then the connection is closed, the user’s view of the data will go “stale” quickly and will re?ect only a snapshot of the DBMS, not the true state. The method being explored now to workaround the stateless nature involves “Server Push” and “Client Pull” documents [11]. There are 4 desirable properties of transactions [4]: - atomicity - consistency preservation - isolation - durability These are collectively known as the ACID properties. Traditionally, it is the responsibility of the DBMS to enforce these properties. However, adding the additional steps complicates the situation, by adding additional points of failure. The ACID properties may still hold for the DBMS, but for the larger logical transaction, it becomes more dif?cult. More information about how the WWW - CGI - DBMS interaction is handled can be found in [12], [13], [14] and [15]. 3.2 Security With increased accessibility and openness comes increased security threats. Looking at ?gure 1, the potential network transmissions occur at the points labelled X, Y, and Z. Traditional database security occurs with the DBMS D and transaction Z. This would likely involved password protection and possibly allowing connections from speci?ed hosts. Note that what was likely a interactive transaction on Z is now batch-like.

Michael L. Nelson, m.l.nelson@larc.nasa.gov, CS 550, 12/1/95


The HTTP server B may have host level security in place (though this is a weak form of security). The CGI script C may enforce some password / user account level of security, but should be especially sensitive to attacks on the CGI program itself. Over?owing buffers and usurping control of the http server (much like the Internet Worm [16]) could compromise the DBMS as well as the http server. However, the most serious threat to security is that network mediums themselves are rarely secure. It would be not too dif?cult to arrange the system architecture so that transactions Y and Z are reasonably secure (moving them to the same machine, special connections from the http server to the DBMS, etc.), but in the general case transaction A is the most vulnerable if it goes off a local network. Anyone can “sniff” TCP/IP packets as they go by, so sensitive data (including passwords!) can be snooped by a third party as users and the rest of the system interact. Since a secure medium is rarely feasible, most attention is drawn to make that data itself secure. Proposed solutions include Secure Socket Layer (SSL) and Secure HTTP (SHTTP). WWW security is an extremely popular research area, and the interested reader should start with [17].

4.0 General Issues
“Database access through WWW” is a term likely to mean different things to different people. There are several levels of access that can be provided: 1. limited reads: some of the data can be viewed from WWW. The “click here for a canned transaction” examples are included here. 2. full reads: everything in the DBMS is viewable from WWW. This would typically have a “forms” interface that allows all query parameters of the original (non-WWW) interface. 3. limited update: can be combined with #1 or #2, but provides a limited facility for modifying the content of the database. 4. full updates: can also be combined with #1 or #2, but provides all the update capability expected from the original interface. Complexity increases moving from #1 - #4. This can be accomplished by providing access to straight SQL statements, or by providing an additional interface level that more closely models the real world. There are several implementations for the above-mentioned access levels: 1. snap shots of the database: some systems dump the database, process it into another format (either ?le based or another database). This method is often used if the main DBMS is not a “standard” database (most NASA databases fall into this category). This method limits the ability to update a database. 2. intermediate broker: an extension of the above, but could have some staging facility to allow database updates, even while lacking a direct interface 3. direct access: the database is queried and modi?ed as a result of WWW actions. This is obviously the most desirable state, since it eliminates the need to duplicate the database contents in some other format.

Michael L. Nelson, m.l.nelson@larc.nasa.gov, CS 550, 12/1/95


Providing WWW access at any of the levels can take more resources to implement than other WWW projects. One should not simply provide WWW access to legacy information systems, but take the opportunity to rethink and redesign the information system and the requirements it ful?lls [18]. In 1993, there was a lot of talk about DBMS access using WWW, but really only one general package available, GSQL [19]. Now, there are nearly 50 products ranging from free products from research labs and universities, to companion packages from the large database companies (Oracle, IBM, etc.). There are too many to cover individually; the reader is referred to [20]. Figure 2 illustrates the convergence of WWW products and DBMS products since 1993.

strength is access

World Wide Web

well balanced system

strength is data management time


Figure 2: WWW / Database Convergence

5.0 Future Trends
There are a number of trends that are likely to have a large impact as they exchange technology in the nearly completed WWW / DBMS merger. Two are highlighted here. 5.1 Client Side Execution Two main technologies are being developed that focus on client side execution and participation. Java [21] and Virtual Reality Modelling Language [22] are analogous to what CGI is on the server side -- they have the client side involved in executing sub tasks. This could be used in interesting fashion with agents. 5.2 Agents Agents, or autonomous intelligent programs, are likely to have the signi?cant impact predicted for Arti?cial Intelligence predicted in [8]. The interest in this area is growing, and could be especially

Michael L. Nelson, m.l.nelson@larc.nasa.gov, CS 550, 12/1/95


useful when used in conjunction with languages such as Java to use agents to search, broker and retrieve other agents from other databases. A good summary of agent activity is available at: [23]. Figure 3 presents an interesting possible future information access architecture.

databases network



Figure 3: Agents Acessing Database on Behalf of User

6.0 Conclusions
DBMS are the workhorse of the information systems community. All of us are impacted directly or indirectly by developments in database technology. In addition, the growth of Internet and WWW technology is allowing more people easy access to information in general. The combination of the two technologies allows for their strengths to be combined in order to move to more seamless information access from the user’s perspective. Before this can be achieved however, the newly expanded transaction system needs some previously resolved issues in the area of concurrency, recovery, and security to be revisited. But the real victory has not been achieved. The fact that users and provides still discuss them as separate (albeit related) areas indicates that integration is not complete. Database technology bene?tted by borrowing Object-Oriented technology recently, from WWW technology presently, and is likely to bene?t from Arti?cial Intelligence in the form of agents in the near future.

Michael L. Nelson, m.l.nelson@larc.nasa.gov, CS 550, 12/1/95


1.Berners-Lee, Tim; Cailliau, Robert; Groff, Jean-Francois; and Pollermann Bernd. “World-Wide Web: The Information Universe.” Electronic Networking: Research, Applications and Policy, vol. 2, no. 1, 1992, pp. 52-58. 2. Andreessen, Marc; Bina, Eric, “NCSA Mosaic: A Global Hypermedia System”, Internet Research: Electronic Networking Applications and Policy, vol. 4, no. 1, 1994, pp. 7-17. 3. Gray, Matthew, “Measuring the Growth of the Web”, available at: http://www.netgen.com/info/growth.html 4. Elmasri, Ramez; and Navathe, Shamkant: “Fundamentals of Database Systems”, BenjaminCummings, Redwood City, CA, 1994. 5. Codd, E. F.: “Is Your DBMS Really Relational?”, Computer World, 1985. 6. “Internet Use by Individuals”, Web Week, November 1995, p. 18. 7. Reynolds, J.; Postel, J.: “Of?cial Internet Protocols”, Internet RFC 1011, Information Sciences Institute, May, 1987. 8. Parsaye, K.; Chignell, M.; Khosha?an, S.; Wong, H.: “Intelligent Databases”, John Wiley and Sons, New York City, 1989. 9. CGI Group at NCSA, “The Common Gateway Interface”, available at: http://hoohoo.ncsa.uiuc.edu/cgi/ 10. Geiger, K.: “Inside ODBC”, Microsoft Press, Redmond, Washington, 1995. 11. “An Exploration of Dynamic Documents”, Netscape Communications Corp., available at: http://www.netscape.com/assist/net_sites/pushpull.html 12. Eichmann, D., T. McGregor and D. Danley, “Integrating Structured Databases Into the Web: The MORE System,” First International Conference on the World Wide Web, Geneva, Switzerland, May 25-27, 1994. 13. Leon Shklar, Kshitij Shah, and Chumki Basu: “Putting Legacy Data on the Web: A Repository De?nition Language”, Third International World Wide Web Conference: Technology, Tools and Application, Darmstadt, Germany, 10 - 13 April 1995. 14. Edwin Hasting, and Dilip H. Kumar, “Providing Customers Information Using the Web and CORBA”, Second International World Wide Web Conference: Mosaic and the Web, Chicago, IL, October 17-24, 1994, pp 973-983.

Michael L. Nelson, m.l.nelson@larc.nasa.gov, CS 550, 12/1/95


15. Lin, Y.; Chung, J.-Y.; Leung, J.; Dillenberger, D.; and Bowen, N.: “Web Access to IBM Legacy Systems Data”, to be presented at the Fourth International World Wide Web Conference, Boston, MA, Decemember 10-14, 1995. 16. Denning, P.: “The Internet Worm”, RICIS, NASA Ames Research Center, NASA-CR181576, February 1989. 17. Stein, L. D.: “The World Wide Web Security FAQ”, Version 1.1.3, November 30, 1995, available at: http://www-genome.wi.mit.edu/WWW/faqs/www-security-faq.html 18. Miller, R. H.: “Web Interface Design: Learning From Our Past”, to be presented at the Fourth International World Wide Web Conference, Boston, MA, Decemember 10-14, 1995. 19. Ng, J.: “GSQL - a Gateway to SQL databases”, Presented at the Federal Webmasters Workshop, Bethesda, MD, July 13, 1995. Available at: http://ox.ncsa.uiuc.edu/w/gsql/ 20. Rowe, J.: “Accessing a Database Server via the World Wide Web: Existing Products”, October 1995, available at: http://cscsun1.larc.nasa.gov/~beowulf/db/existing_products.html 21. Sun Microsystems: “Java: Programming for the Internet”, 1995, available at: http://java.sun.com/ 22. Pesce, M.; Behlendorf, B: “The VRML Forum”, October 1995, available at: http://vrml.wired.com/ 23. Open Software Foundation, “Web Resources Related to Agent Technology”, September 1994, available at: http://www.osf.org/www/agents_info.html

Michael L. Nelson, m.l.nelson@larc.nasa.gov, CS 550, 12/1/95