C4: Text Representation and Text Operations
Objectives
This module provides basic understanding of text representation and text operations in Information Retrieval (IR). After completing this module, students will have general knowledge of the basic terms and methods used in text representation and operations.
Description
The topics covered in this module include:
- Text formats
- Metadata and Markup languages such as SGML, HTML, and XML
- Information theory and text properties, including Zipf's law and Heaps' law
- Document preprocessing, including lexical analysis, stop-word removal, stemming, index term selection, and thesauri using
- Inverted indices
- Pattern matching
Class Notes
- Text Representation and Text Operations
[ flash] | [ Windows Media] | [.ppt] | [.doc] | [.pdf] -
Information Theory, Natural Language Models, and Text Properties
[ flash] | [ Windows Media] | [.ppt] | [.doc] | [.pdf]
Practice Exercise
- Matching Exercise (opens in a new window)
Suggested Readings and Resources
In the Books
- Chapter 6 and 7 of textbook Modern Information Retrieval, written by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, ISBN: 0-201-39829-X, Publisher: Addison-Wesley, 1999.
