25-Aug-87 07:12:52-MDT,12439;000000000000 Return-Path: Received: from nems.ARPA by SIMTEL20.ARPA with TCP; Tue, 25 Aug 87 07:12:21 MDT Received: by nems.ARPA id AA12811; Tue, 25 Aug 87 09:12:03 edt Message-Id: <8708251312.AA12811@nems.ARPA> Date: 25 Aug 87 09:11 EDT From: science@nems.ARPA (Mark Zimmermann) Subject: brwsr descriptive notes To: gergely@drea-xx.ARPA, decvax!savax!mf@ucbvax.berkeley.edu, pbr%pco@bco-multics.ARPA, science@nems.ARPA, rthum@simtel20.ARPA, lsuc!dave%ai.toronto.edu@relay.cs.net Appended below, if all goes well, is a short (ca. 11 kB) file describing how to use the "brwsr.c" program to browse through indexed text files ... it also includes some of my standard philosophizing about free-text indexing and browsing tools. Recipients of this note should also be getting two C programs, ndxr.c and brwsr.c, in separate mailings later today.... Let me know of any problems you experience! I have successfully compiled and run brwsr.c and ndxr.c on Macintosh, VAX, and Sun, but certainly there are many things that can and will go wrong .... ^z The Browser Project -- by Mark^Zimmermann SYNOPSIS: This note describes a project to develop tools for indexing, browsing, and otherwise making use of massive free-text databases. It also contains specific information about the C programs Ndxr and Brwsr which implement a transportable version of the Browser system originally developed in MacForth Plus for the Apple Macintosh. FANTASIES: Imagine being able to remember the entire contents of all the UNIX manuals and the other system documentation. Imagine that every message you've seen on CompuServe for the past year is at your fingertips. Imagine being able to browse through the source code of all the programs you've ever written. No more nagging feeling that "I've done this before, but I don't know where." No more apologies for asking again for help in solving a problem that you know was answered online last year. No more hours spent finding the obscure paragraph documenting the crucial half-remembered ROM routine you need to call. Imagine researching a paper and having hundreds of relevant citations at your fingertips. That's the power that Brwsr tries to provide you! MOTIVATION: The volume of electronically available information has grown explosively in recent years. Today, quite inexpensively, one can download or acquire in electronically-readable form many megabytes of valuable information -- technical reference manuals, on-line discussions of all sorts of topics, news wire-service cables, dictionaries and encyclopaedias, etc. A researcher can accumulate thousands of bibliographic citations -- authors, titles, institutional affiliations, abstracts, key words, and full text of articles. Magnetic hard disks, and now optical disks, allow quick access to vast libraries of information on specialized topics of great interest. But these large volumes of information aren't really useful! Typical information retrieval systems fall short in one or more crucial areas: - they require "clean" input data, in structured, homogeneous formats - they break down when input files exceed a few megabytes - they are intolerably slow in responding to simple queries - they do not allow easy, interactive "browsing" of the data - they are not integrated with writing or programming systems - they require exotic, expensive hardware GOALS: I am working to develop tools to aid people in browsing, indexing, retrieving, and using massive quantities of free-text information. The tools I want to develop are not "artificial intelligence" computer projects, except perhaps in a few ways (e.g., maybe some very limited natural-language understanding, pattern recognition, and deductive reasoning). Rather, I want to use computer systems to do what they currently do best -- simple, boring, repetitive tasks. The human users are then free to do what they do best -- hypothesis formulation, pattern recognition, and common-sense understanding of situations. The Brwsr program described below is the first of (I hope) many programs to enable people to make better use of big disorganized databases. Future programs will focus on clustering and correlating terms in the inverted indices, and on giving people new, non-alphanumeric, ways to move around in "dataspaces". REQUIREMENTS: Tools must meet four basic requirements in order to be successful: - Responsiveness: the user should not have to wait for more than a few seconds for an answer to a simple query -- otherwise, trains of thought get disrupted and alternatives are never explored; - Transparency: the user should not have to learn an arcane or convoluted query language -- the computer system should have a simple, clear, obvious user interface, so that the human's mental effort can be devoted to problem-solving, not tool-manipulation; - Seamlessness: the user should not have to translate or reformat information to move it into or out of tools -- rather, data must be able to flow directly from sources through tools into notes and then into finished products; - Portability: the user should not have to give up old tools and learn new ones as progress in computer technology renders current state-of-the- art hardware obsolete -- instead, tools must be transportable to new systems which will be introduced every year. As the volume of information handled becomes larger, these requirements become ever more important. As computer power becomes cheaper, the requirements become easier to meet. I believe that the curves have now crossed. Current computer technology can support the development of responsive, transparent, seamless, portable tools. The current volume of raw data available demands such tools. BRWSR: Here, very briefly, I will describe the Brwsr program. It is an implementation, meant to be portable to almost any computer system, of my MacForth Plus "Browser" program, without the Macintosh user interface features. However, the Brwsr program retains all the other features of the Macintosh Browser: - easy access to a complete inverted index for a huge file - easy access to a key-word-in-context display for a chosen index item - easy access to the full text of the document for a chosen item - easy proximity searching restrictions - easy note-taking To use Brwsr, one first must have generated an index using the Ndxr program. The index files created will have names x.k and x.p where "x" represents the name of the original document/database text file that has been indexed. The Ndxr program currently indexes at a rate of over 2 megabytes/hour on a Macintosh Plus, and over 6 megabytes/hour on a Sun workstation. Upon running Brwsr, type a "?" command to get a synopsis of the available commands. Suppose you have a file named "stuff" that has been indexed which you wish to browse. Begin with the command ":stuff" to open that file. Brwsr tells you how many total characters, words, and unique words are in the indexed file. Hit the key a few times to see the first few words in the alphabetized index. (They will probably be numbers, which come before the letters in the alphabet.) Type a word and hit to jump the index to that word. Suppose you are interested in UNIX; the display for that word might be something like "314 UNIX", meaning that the word "UNIX" came up 314 times in your indexed database file "stuff". Type "=" and you move down into the key-word-in-context display for the word "UNIX". In this level of the program you can see every occurrence of the indexed word with half a line of contextual information on either side of it. Hit a few times to see several lines of this display. If you want to see more lines at once, say 15 lines, type ".15". If you want to go up to a previously displayed line, say 5 lines earlier, type "-5". To jump down to a later occurrence of "UNIX", say 100 lines down, type "+100". When you find a line that looks interesting, type "=" again and you will begin to view the actual, unfiltered text of the document in the region where your chosen index item occurred. To see 20 lines around the target line, type "-10" and then ".20". If the item looks interesting and you would like to "clip" it out for further use elsewhere, you can open a file for taking notes. Say you want to call your current notes file "stuffnotes": type the Brwsr command ">stuffnotes". From this point onwards, all the lines output by the Brwsr will be copied to the "stuffnotes" file as well as displayed on the screen. Add annotations to the notes file by prefacing them with a "'" mark, e.g., "'this is a comment". Close off the notes file by entering the Brwsr command ">" without a file name. At any time, you may jump back to the top-level index display by simply typing a word not prefaced by a Brwsr command character. Suppose you decide to browse the indexed word "C". Type "C" and get the display of how many times "C" occurs in your database; perhaps it comes up 5000 times, far too many for convenient browsing. You can restrict the occurrences of C based on proximity to other words of interest. Go back to UNIX by typing "UNIX", and you'll again see the display line "314 UNIX". Create an empty working subset of the whole database by entering the Brwsr command "*". This causes the display to appear now as "0/314 UNIX", meaning that no occurrences of "UNIX" are yet included in the subset. Type "&" to add the neighborhoods of every occurrence of "UNIX" to your working subset. The display now reads "314/314 UNIX" meaning that all of the 314 times that "UNIX" appears are now "good", as far as you are concerned. Then go back to "C" by typing that command, and you should see something like "81/5000 C" meaning that only 81 out of the 5000 times that "C" appears in the database file are within a few words of "UNIX". You can type "=" just as before to jump down into a key-word-in-context display of the 81 "good" occurrences of "C". And, just as before, you can type "=" for any of those 81 occurrences in order to move into the full text of the database in that spot. To go back to browsing the whole database, type "**". When working in subset, to add a larger neighborhood than a few words around a chosen term, type "&&" or "&&&" instead of just "&". Any number of word neighborhoods can be combined into a working subset just by sequentially calling for them ... they add up in boolean "OR" fashion. When finished browsing, type ":" without a file name and answer "y" to the question "Quit?" in order to exit the program. FOR FURTHER INFORMATION: Contact me, Mark^Zimmermann, 9511 Gwyndale Drive, Silver Spring, MD 20910. I'd love to discuss the subjects of information retrieval, browsing, post-processing, etc., with anybody who cares to. Phone me at 301-565-2166, send arpanet messages to "science (at) nems.arpa", or on CompuServe to [75066,2044]. The Macintosh Browser program is available online in various places, including the Sumex archives of and the MAUG Macintosh Users' Forum on CompuServe. Alternatively, send me a formatted Macintosh disk and a self-addressed stamped envelope and I'll be happy to mail a copy to you. Work is underway to further develop the Browser at MicroDynamics, Ltd. in Silver Spring, and possibly enhanced versions will be available commercially from them in late 1987. DISCLAIMER: Any programs you get from me are supplied with no guarantees whatsoever. I have never had data lost due to Brwsr crashes, and am making every effort to make the program bug-free, but I cannot be responsible for any problems which the program may cause you or your computer system. My employer has no responsibility for and makes no warranties about anything I do or say, either! Various words used in the above are probably trademarks of various organizations. ^z - 1987 March 16 - Revised 1987 Apr 20, May 14, Aug 19. -------