GSPP Home Grad School GRI Home GRI Contact GRI Site Map Gallaudet University

 

RAGU SEARCH SYSTEM
Kevin Cole, Gallaudet University
January 2002


INTRODUCTION

The past several issues of the "Research at Gallaudet University" (RAGU) newsletter have been made available in Adobe Portable Document Format (PDF). Susan suggested that it would be nice to have a search engine that would allow people to search for specific articles within the newsletter. That's where this project got its start.

DECISIONS, DECISIONS

I thought about different approaches to the system and decided:
   · There would not be a lot of data per record.
   · The number of records would grow very slowly.
   · Therefore, the data entry didn't need to be fast.
   · I like cross-platform, non-proprietary solutions.
   · The system should not require any special user software.
   · All the software would reside on the server.
   · I wanted to learn a new trick.

SOFTWARE

I ended up going with four components:
   · The apache web server.
   · The PostgreSQL database system.
   · The PHP embedded scripting language.
   · Python as the CGI-bin interface.

RATIONALE

Apache's a given. Something like 90% of the Internet uses apache to provide web pages to users. (This is one of the few areas that Microsoft is upset that it doesn't control). It's available for several operating systems, is free, and like so many other good tools, the source code is available, and it's highly configurable.

The PostgreSQL database system has proven to be fairly dependable for other projects, including the College and Career Guide. I've set up the server so that it dumps the contents of all PostgreSQL databases to an ASCII file nightly, for Dennis's backup. Previous experience showed that PostgreSQL could be used interactively from a command line, or called from C or python. With the right ODBC components (freely downloadable) PostgreSQL can talk to MS Access. There are also point-and-click menu GUI systems for PostgreSQL that make it look more like MS Access, but I've never used them, so I don't know if they're any good.

I've looked at the perl programming language, and even have a few excellent books, but I haven't really put any time into learning it. It's a wonderful language, but there are a lot of tricks to using it well. It uses a lot of special characters to compress complex commands into very few characters. It's sort of like TECO on steroids.

I opted for python because it's a little more readable, it is an interpreted language, as opposed to a compiled language, which means that it was very easy to test small pieces of the code interactively. I could test concepts by typing them directly into the interpreter, rather than writing the code to a file, then feeding the file to the language. (This is the way that the original BASIC programming language worked.)

Although it was possible to do everything with python and PostgreSQL, some of the tasks could be handled directly inside an HTML file with the use of an embedded scripting language. JavaScript could do some of what I wanted but doesn't have any built-in way to reach a database (as far as I know). Another common scripting language known as PHP does have that capability.

METHODOLOGY

The data entry starting point is a PHP file. The actual source code of a PHP file looks like a regular HTML file with some weird scripts in the middle (vaguely like JavaScript). This code queries the RAGU table, extracts a unique identifier, and an article title for each record, and generates a form with radio buttons for the data entry person to select an article. (There's also a radio button for creating a new article record. The "new article" button is selected by default.) Unlike JavaScript, PHP is pre-processed which means that the document is scanned and processed by PHP BEFORE being displayed to the end-user. Therefore, when an end-user tries to "View Source", he/she will never see the actual scripting code. It is replaced by whatever the script outputs.

Once a selection is made, the form is handled as a CGI-bin submission and is handed off to a python program that generates an entry form for the selected article. If the article already exists, the form is "pre-printed" with all the old information, and the data entry person is offered the option of changing or deleting the record, or cancelling the transaction. If no record exists, then a blank form is displayed and the user can either add or cancel. There is some minimal error checking done during this phase, but I'm assuming that the data entry people will be in-house and trained by me.

The form generates a unique ID based on the year, issue and article numbers. The information is then passed to a second python program which takes the data from the form and uses PostgreSQL to update the table.

The search engine is a regular HTML file with a form that hands its data to a third python program, that compiles criteria for an SQL select query. The results of that query are then displayed as a new web page.

PROS AND CONS

The biggest advantages are that the system is very ASCII based, (no binary files), and both python and PHP are available for other computers and operating systems. This means the system is fairly portable, provided that python, PHP and a web server software that accepts plug-ins (e.g. apache among others) are installed on the host server.

Another advantage is that all the software is freely available for downloading, and as is typical in the Linux universe, bugs are caught and patched quickly. There is also a wealth of on-line help available, both in the form of downloadable documentation and mailing lists.

This was my first use of PHP and, though this program doesn't really benefit from it much, the pre-processing aspect of PHP holds a possible security benefit, since the program part of the web page "disappears" before the user sees it, unlike JavaScript.

The only real downside that I've seen so far was a biggie: This past weekend I attempted to call up my PHP programs from home, and none of them worked. I have no idea what happened. As a last resort, I restarted apache (that's the program that listens for incoming requests for web pages) and all was well again. I don't know if the problem was related to strange happenings over in EMG or if there's something nasty about PHP that I'm unaware of.

The system's not particularly speedy, but to me that didn't seem to be a design criteria.

There's also a slight kink with embedded HTML in the database entries. This is not due to a limitation in the system. The problem is that I hadn't though about inserting HTML into the titles or abstracts until the design was well under way, and when using several languages that have different meanings for special characters (like quotes and greater-than and less-than symbols), one has to be extra careful. I wasn't.

THE FUTURE

There are still some finishing touches to be put on the system, (e.g. additional text explaining how to use the system, etc.) but the system works well enough that I'll be using a similar system to move the publications database over to gri.gallaudet.edu.

In the past, to produce PDF output from a database, I had two choices: Either I could write raw Postscript and then use Ghostscript's converter ps2pdf, which is how the SAT student reports are produced or I could write a source document, with special embedded formatting commands for a document layout system called lout and then use lout to generate the final output, which was how the College and Career print edition came to life. (For you old-timers, lout is very similar to the old DECsystem-10 program RUNOFF.)

However, I recently discovered an an add-on package for python called ReportLab which enables python to produce PDF output directly, without first producing an intermediate document. I've only scratched the surface of that, but it looks like it could come in very handy for dynamically generating PDF documents based on user-input. In other words, folks could fill out a web form, punch a button, and have a PDF with their name and address plugged into all the right places. ReportLab doesn't appear to offer the fine-tuned detail of the raw Postscript method, but it may be speedier, and for jobs that don't require a lot of messing around, it may be just the ticket. For simpler jobs, I think it will ultimately prove to be better than lout at least.

It's also fairly easy to get python to send e-mail, or even generate a fax and send it (provided, of course, the system that is sending has a fax-capable modem). The new GRI server will have that built-in.

REFERENCES

For folks wishing to investigate these toys for themselves, the following web pages should prove helpful:

   · The RAGU data entry system AUTHORIZED USERS ONLY
   · A quick list of RAGU titles and abstracts
   · The RAGU search engine
   · Color-coded source code for this system
   · The official apache home page: (www.apache.org)
   · The official PostgreSQL home page: (www.postgresql.org)
   · The official PHP home page: (www.php.net)
   · The official Python home page: (www.python.org)
   · The official ReportLab home page: (www.reportlab.com)

NOTES

The pages currently still reside in my personal test directory, rather than in a system-wide advertised area. I'd like users to beat on the system a little, and expose bugs or design flaws before broadcasting my mistakes in prime-time.

The HTML used in these pages, and in many others on the GRI server use the latest and greatest HTML specifications, including style sheets. These new specs are designed to make HTML look more like XML, which has more formal requirements. One difference that HTML wizards will notice is that tags which have no closing tag (e.g. the tag used for line breaks) now MUST be ended with a space, a slash and a greater-than: e.g. <br /> not <br>. Another is that all tags and parameters must be lower case, and all parameters must be quoted, including numeric parameters: e.g. <table width="120"> not <TABLE WIDTH=120>.

These pages also make use of server-side includes (SSI) which allows pages to include one file within another. This makes it easy to generate a stock header and footer and apply them to all the web pages by embedding a single line that "calls" the header and another line which "calls" the footer. Used wisely, together with PHP and python, this allows for the system to send different pages back to the end-user, based on the browser being used. For example, if the end-user is using a text-only browser like Lynx, the image map menus are replaced by simple text menus.


Gallaudet Research Institute
Graduate School and Professional Programs
Gallaudet University
Hall Memorial Building, Fourth Floor, South Wing
800 Florida Avenue, NE
Washington, DC 20002-3695

Webmaster: Kevin Cole, GRI

| GSPP HOME | GRAD SCHOOL | GRI HOME | GRI CONTACT | GRI SITE MAP | GALLAUDET |

Valid XHTML 1.1 and CSS