Syllabus 17:610:551 Information Retrieval

Paul B. Kantor

Spring 2003

 

  1. Catalog Description
  2. TBD

II. Pre- and/or Corequisites

610:550 or at least one course in computer programming and permission of the instructor

  1. Course Objectives

To understand how modern information retrieval systems, such as those found on the Web work, and to understand the important issues in the design, study and evaluation of such systems.

IV. Organization of the Course

The course will follow a lecture/discussion format, and every class will include some kind of a "small group" breakout session either in the classroom or in the computer lab. The major sections are:

    1. Introduction and overview
    2. Relation of IR to indexing and cataloging
    3. Representation of texts by numbers and vectors; Basic notions of probability for IR
    4. Design for usability; principles of user-centered design
    5. Vector space models; other linear classifiers: probabilistic models; inference models
    6. Evaluation of systems; precision, recall and utility
    7. Bases for representation: terms; phrases; word pairs; n-grams; natural language
    8. Collection based indexes; LSI; PIRCS; network structures;
    9. Collection descriptions; collection fusion; meta-crawlers
    10. Non-textual materials: images and sound; moving images
    11. Evaluation in context: users, communities, goals and measures
    12. Collaborative systems: passive and active models
    13. Presentation of term projects
    14. Presentation of term projects

V. Major Assignments

    1. Small tasks. One or two simple tasks will be assigned each week to ensure that neither students nor teachers are surprised by a backlog of poorly understood concepts at the end of the semester. These are referred to below as "knowledge crumbs" KC. Note: These problems are "easy" in the sense that when you get the point you can do them in less than 5 minutes. The point, of course, is to help you get the point.
    2. Readings. Each week there will be one assigned reading from materials either available on the Web, or distributed in class the prior week. Note: We are trying to avoid an expensive "Professor Packet" by this process, but it is still experimental.
    1. The following books will each be of some use to some of you, but because of the diversity of interests and preparation, no book seems suitable for all the students in any given year.

 

*Baeza-Yates, R. & Ribeiro-Neto, B. Modern information retrieval. New York: ACM Press. This text takes a computer-science perspective. It is more current than Frakes & Baeza-Yates (1992), and broader than Witten Moffat and Bell. If you select only one CS-type text, this would be the better choice.

Frakes, W.B. & Baeza-Yates, R., eds. (1992) Information retrieval: data structures and algorithms. Englewood Cliffs, NJ: Prentice Hall. This is a collection of chapters, which vary widely. It is a good place to learn some basics of how IR systems are built. Much of the code is available at ftp://sunsite.dcc.uchile.cl/pub/users/rbaeza/irbook/

*Hersh, W.R. (1995) Information retrieval: A health care perspective. New York: Springer Verlag.

General introductory text. This is accessible to all students in this course. Bill Hersh (MD) has concentrated his evaluation efforts on health care applications, but covers the general principles, and pays particular attention to the problem of designing good evaluation experiments.

van Rijsbergen, C.J. (1979) Information retrieval, 2nd ed. London: Butterworths. This is an important early text on the subject, from an IR+CS perspective. Parts are quite mathematical. This is the origin of the so-called "F-measure" which has become popular among computer scientists. The great thing is that Keith van Rijsbergen has made it available for FREE, at http://www.dcs.gla.ac.uk/Keith/Preface.html

Salton, G. (1989) Automatic text processing: The transformation, analysis and retrieval of information by computer. Reading, MA: Addison-Wesley. Gerry Salton was the founder of automated IR in the US, and his academic descendants have done much of the good work. This book is quite different from the earlier book Salton and McGill, and perhaps less useful in introducing fundamental ideas.

Salton, G. & McGill, M. (1983) Introduction to modern information retrieval. New York: McGraw-Hill. Of course this is quite old now, but it introduces many of the ideas that are otherwise only to be found in the papers and techincal reports of Saltonís group at Cornell.

*Sparck Jones, K. & Willett, P. eds. (1997) Readings in Information Retrieval. San Francisco: Morgan Kaufmann. A collection of significant research papers in the evolution of information retrieval from its roots in librarianship. The editors have written good historical introductions to each section. The mathematical rigor and clarity of the papers varies widely, and in many cases it is not easy to reconstruct the "commonly accepted meaning" of the terms used by the author(s). This is a very good place to look for papers on which to report to the class, and any other papers contained in this collection will be acceptable for that purpose.

Witten, I, Moffat, & Bell T. Managing Gigabytes. compressing and indexing documents and images. New York : Van Nostrand Reinhold, c1994.xiv, 429 p. : ill. ; 26 cm.

 

Online resource: http://www.searchenginewatch.com/

 

    1. Presentation of a paper. Each student is required to select one paper from an extended list that will be made available on the web site, to study that paper thoroughly, and to present the key results of that paper to the entire class on a date to be selected. The presentation may include slides, handouts, or web pages. The presentation should make clear: (1) What problem the paper addresses; (2) what relation it has to prior cited literature; (3) what idea it proposes to solve or improve the problem; (4) what was done to implement that idea; (5) what results were found, and (6) what suggestions were made for further work. Note: We have adopted this approach to presenting readings because several years of experience has established that when I explain the papers they tend to become less accessible, rather than more. When you explain them to each other, they become clear. Of course, if the presentation does not make the paper clear, ASK QUESTIONS.
    2. Term project: System building OR system evaluation. Students are encouraged to work in teams of 2 for this assignment, although a student may elect to work alone. More specifically, to
    1. BUIILD A SYSTEM.
    1. Select a body of texts. These may either be the body of texts that the class assembles in KC-1 below, or they may be assembled in some other way. There must be at least 80 texts in the collection, but they may be as short as abstracts if desired. They may contain tags only if that tagging is used to index or retrieve on the collection.
    2. Select a way to represent those texts Ė this may be simple (such as term occurrence, or more complex, using term frequency, term position, or distance between terms).
    3. Select a way to match simple queries (sets of terms) to those representations to produce a ranked list of the 10 documents most likely to be relevant.
    4. Give the system some kind of interface (this may be web based, but it could be Unix command line if you like).
    5. Get at least one other person from this class to use the system in the intended way, to ensure that you are not totally fooling yourself.
    6. Write up a coherent explanation of what you have done, including the code.
    7. Present the system to the class in one of the last two class meetings of the semester.
    1. EVALUATE A SYSTEM
    1. Select a system to evaluate.
    2. Describe the systems in terms of the functions that it supports for a user.
    3. Describe, as well as you can determine, how it must represent documents internally, in order to be able to support those functions
    4. Describe as well as you can how it must match queries to documents, in order to support these functions
    5. Prepare a "tasket" (a basket of tasks) that seem suitable for the system and submit them to the system. Process the results to produce all the following measures of effectiveness: precision at 10 documents for each task; average precision versus recall for each task; averages of each of those measures over the tasket; cumulated number of useful documents versus number presented for each task; cumulated cumulation for the tasket. (This is most easily done in a spreadsheet, of course, since it will draw the graphs for you from the data).
    6. Write up a coherent explanation of your evaluation including both "objective" measures derived form the tasket, and user-oriented measures based on your own use and observation of the system.
    7. Present your analysis to the class in one of the last two meetings of the semester.

I think that 12-15 pages is a goood length for the paper. Keep graphs small, so that each takes up no more than 1/4th of a page. Figure to make a 10-15 minute presentation on the work telling what you did; why you did it that way; what you found out; what you would have changed, after you saw how it came out.

Note: In previous years we have sometimes used class projects that involved studying multiple subjects using a system of interest. Because of the increased vigilance that we (Rutgers) are exercising to protect human subjects in experimental or research studies, this will not be done this semester. However, if you are interested in evaluation of real systems using real users, you should visit the relevant Web site at Rutgers to learn more about the notion of Human Subjects Review.

.

  1. Methods of Assessment

Grades will be assigned for each of the activities, with the following weights:

Knowledge Crumbs 25%

Presentation of a Reading 20%

Term Project 45%

Participation 10% *

TOTAL 100%

Mathematically the mapping from letter grades to numbers is as follows

96-100 -> A+ -> 98

90-95 -> A -> 93

83-89 ->B+ -> 86

78-82 ->B -> 80

72-77 ->C+ -> 75

65-71 ->C -> 68

Work which does not achieve a grade of C will be assigned an honorary score of 50% if it is submitted on time. If it is not submitted, it will be assigned a courtesy score of 10%.

Late Work: Work which is submitted late will experience a "decay of its grade" at the rate of 5% per day. Thus an A paper received 3 days late receives a score of A -> 93%x95%x95%x95% = 79.7% -> B. And so forth. I will not count weekend days, if you donít count them against me in grading papers.

* Participation. We will have small group sessions each week. The first will be random. Thereafter, I will ask each of you to give me ranked list of those in your group, with the person you would most like to meet with again listed first, and so on. In succeeding weeks we will keep re-arranging groups to try to satisfy those preferences. These rankings will also factor into half the grade for participation. This is part of an ongoing experiment to determine whether peer assessment is consistent with instructor assessment in my courses. You are free to not submit a list of preferences, in which case you will surely be assigned to a group at random the following week.

 

Knowledge Crumbs