MIME-Version: 1.0
Server: CERN/3.0
Date: Sunday, 01-Dec-96 20:22:13 GMT
Content-Type: text/html
Content-Length: 5409
Last-Modified: Wednesday, 01-Feb-95 20:51:02 GMT
Gerard Salton
Gerard Salton
Professor
gs@cs.cornell.edu
Ph.D. Harvard University, 1958
Natural-language text processing is a rapidly expanding field of research and development. Large masses of machine-readable text now exist that can be cheaply stored on high-density optical storage media and rapidly retrieved on demand. Furthermore, sophisticated methods are available for analyzing document texts, formulating appropriate user queries, conducting rapid file searches, and ranking the retrieved items in decreasing order of importance to the users.
At Cornell, we design and operate large, general-purpose text processing environments where texts can be handled without restrictions as to size or subject matter. In the absence of knowledge bases that would be useful for unrestricted text databases, we use corpus-based text analysis systems that determine the meaning of words and expressions by a refined context analysis using statistical and probabilistic criteria. Using the corpus-based approaches, we are able to determine text similarity with a high degree of accuracy. There are two main applications:
- The automatic generation of structured text collections (hypertext) where semantically similar pieces of text are automatically linked. Hypertext representations of large databases provide flexible browsing capabilities for general-purpose text access.
- The automatic retrieval of interesting text excerpts in response to available search queries.
We have done extensive work with an automated encyclopedia consisting of about 25,000 encyclopedia articles (the Funk and Wagnalls New Encyclopedia). In addition, we are also processing the TREC collection consisting of about 800,000 full-text documents covering a number of different subject areas (over 2 gigabytes of text).
A sophisticated search and retrieval service exists, as well as a text linking system capable of relating different text sections, paragraphs, and sentences. The main test vehicle continues to be the current version of the Smart text analysis and retrieval system, operating under UNIX on Sun Sparc Stations and Sun-4 terminal equipment.
University Activities
- Member, Engineering College Library Committee
Professional Activities
- Associate Editor, ACM Transactions on Information Systems
- Program Committee: SIGIR 94, Seventeenth Int. Conference on Research and Development in Information Retrieval, Dublin, Ireland, 1994; EP '94, Electronic Publishing, Darmstadt, Germany, 1994; Information Retrieval and Genomics, National Library of Medicine, Bethesda, Maryland, May 1994; Multimedia-Hypermedia and Virtual Reality, Moscow, September 1994
Lectures
- Automatic Construction of Hypertext Links, Federal Institute of Technology (ETH) Zurich, Switzerland, June 1993.
- Progress in Information Retrieval Research, University of Konstanz, Germany, June 1993.
- Hypertext and Information Retrieval, ASIS National Meeting, Columbus, Ohio, October 1993.
- Automatic Text Utilization in Large Full Text Databases. Computer Science Colloquium, Ohio State University, Columbus, Ohio, October 1993.
- Automatic Information Retrieval. Lecture Course at Hypertext-93, Seattle, Washington, November 1993.
- Full Text Information Retrieval. Microsoft Corporation, Seattle, Washington, November 1993.
- Automatic Text Utilization. Workshop on Information and Genomics, National Library of Medicine, Bethesda, Maryland, May 1994.
Publications
- Approaches to Passage Retrieval in Information Systems. Proceedings 16th Annual National Conference on Research and Development in Information Retrieval (SIGIR-93), Association for Computing Machinery, New York (1993), 49-58 (with J. Allan and C. Buckley).
- Selective Text Utilization and Text Traversal. Proceedings Hypertext-93, Association for Computing Machinery, New York (November 1993), 131-144 (with J. Allan).
- Automatic Structuring and Retrieval of Large Text Files. Communications of the ACM, 37: 2 (February 1994), 97-108 (with J. Allan and C. Buckley).
- Text Retrieval Using the Vector Processing Model. Proceedings Third Annual Symposium of Document Analysis and Information Retrieval, University of Nevada, Las Vegas, Nevada (April 1994), 9-22 (with J. Allan).
Software
- The Smart text analysis and retrieval system is made available free of charge for research purposes. Several hundred copies of Smart (version 11) have been distributed and are used around the world.
Return to:
List of Faculty
1993-1994 Annual Report Home Page
Departmental Home Page
If you have questions or comments please contact:
www@cs.cornell.edu.
Last modified: 9 November 1994 by Denise Moore
(denise@cs.cornell.edu).