Syllabus 17:610:551 Information Retrieval
Paul B. Kantor
II. Pre- and/or Corequisites
610:550 or at least one course in computer programming and permission of the instructor
To understand how modern information retrieval systems, such as those found on the Web work, and to understand the important issues in the design, study and evaluation of such systems.
IV. Organization of the Course
The course will follow a lecture/discussion format, and every class will include some kind of a "small group" breakout session either in the classroom or in the computer lab. The major sections are:
V. Major Assignments
*Baeza-Yates, R. & Ribeiro-Neto, B. Modern information retrieval. New York: ACM Press. This text takes a computer-science perspective. It is more current than Frakes & Baeza-Yates (1992), and broader than Witten Moffat and Bell. If you select only one CS-type text, this would be the better choice.
Frakes, W.B. & Baeza-Yates, R., eds. (1992) Information retrieval: data structures and algorithms. Englewood Cliffs, NJ: Prentice Hall. This is a collection of chapters, which vary widely. It is a good place to learn some basics of how IR systems are built. Much of the code is available at ftp://sunsite.dcc.uchile.cl/pub/users/rbaeza/irbook/
*Hersh, W.R. (1995) Information retrieval: A health care perspective. New York: Springer Verlag.
General introductory text. This is accessible to all students in this course. Bill Hersh (MD) has concentrated his evaluation efforts on health care applications, but covers the general principles, and pays particular attention to the problem of designing good evaluation experiments.
van Rijsbergen, C.J. (1979) Information retrieval, 2nd ed. London: Butterworths. This is an important early text on the subject, from an IR+CS perspective. Parts are quite mathematical. This is the origin of the so-called "F-measure" which has become popular among computer scientists. The great thing is that Keith van Rijsbergen has made it available for FREE, at http://www.dcs.gla.ac.uk/Keith/Preface.html
Salton, G. (1989) Automatic text processing: The transformation, analysis and retrieval of information by computer. Reading, MA: Addison-Wesley. Gerry Salton was the founder of automated IR in the US, and his academic descendants have done much of the good work. This book is quite different from the earlier book Salton and McGill, and perhaps less useful in introducing fundamental ideas.
Salton, G. & McGill, M. (1983) Introduction to modern information retrieval. New York: McGraw-Hill. Of course this is quite old now, but it introduces many of the ideas that are otherwise only to be found in the papers and techincal reports of Saltonís group at Cornell.
*Sparck Jones, K. & Willett, P. eds. (1997) Readings in Information Retrieval. San Francisco: Morgan Kaufmann. A collection of significant research papers in the evolution of information retrieval from its roots in librarianship. The editors have written good historical introductions to each section. The mathematical rigor and clarity of the papers varies widely, and in many cases it is not easy to reconstruct the "commonly accepted meaning" of the terms used by the author(s). This is a very good place to look for papers on which to report to the class, and any other papers contained in this collection will be acceptable for that purpose.
Witten, I, Moffat, & Bell T. Managing Gigabytes. compressing and indexing documents and images. New York : Van Nostrand Reinhold, c1994.xiv, 429 p. : ill. ; 26 cm.
I think that 12-15 pages is a goood length for the paper. Keep graphs small, so that each takes up no more than 1/4th of a page. Figure to make a 10-15 minute presentation on the work telling what you did; why you did it that way; what you found out; what you would have changed, after you saw how it came out.
Note: In previous years we have sometimes used class projects that involved studying multiple subjects using a system of interest. Because of the increased vigilance that we (Rutgers) are exercising to protect human subjects in experimental or research studies, this will not be done this semester. However, if you are interested in evaluation of real systems using real users, you should visit the relevant Web site at Rutgers to learn more about the notion of Human Subjects Review.
Grades will be assigned for each of the activities, with the following weights:
Knowledge Crumbs 25%
Presentation of a Reading 20%
Term Project 45%
Participation 10% *
Mathematically the mapping from letter grades to numbers is as follows
96-100 -> A+ -> 98
90-95 -> A -> 93
83-89 ->B+ -> 86
78-82 ->B -> 80
72-77 ->C+ -> 75
65-71 ->C -> 68
Work which does not achieve a grade of C will be assigned an honorary score of 50% if it is submitted on time. If it is not submitted, it will be assigned a courtesy score of 10%.
Late Work: Work which is submitted late will experience a "decay of its grade" at the rate of 5% per day. Thus an A paper received 3 days late receives a score of A -> 93%x95%x95%x95% = 79.7% -> B. And so forth. I will not count weekend days, if you donít count them against me in grading papers.
* Participation. We will have small group sessions each week. The first will be random. Thereafter, I will ask each of you to give me ranked list of those in your group, with the person you would most like to meet with again listed first, and so on. In succeeding weeks we will keep re-arranging groups to try to satisfy those preferences. These rankings will also factor into half the grade for participation. This is part of an ongoing experiment to determine whether peer assessment is consistent with instructor assessment in my courses. You are free to not submit a list of preferences, in which case you will surely be assigned to a group at random the following week.