C10: Text Categorization and Clustering
Objectives
This module provides an overview of text categorization and clustering. After learning this module, students should have an understanding of categorization tasks and methods, clustering techniques, and important issues regarding data.
Description
This module provides information concerning text categorization and clustering algorithms. Major topics include:
- Text Categorization Tasks and Methods
- Clustering Algorithms
- Machine Learning Classification
- Rocchio;
- Naive Bayes;
- Decision Trees;
- Nearest Neighbor (k-NN);
- K-means;
- Self-Organizing Map (SOM);
- Search Algorithms;
Class Notes
- Text Clustering [flash] | [Windows Media] | [.ppt] | [.doc] | [.pdf]
- Text Categorization [flash] | [Windows Media] | [.ppt] | [.doc] | [.pdf]
Practice Exercises
-
Exercise: Text Clustering
[Online Exercise]
Exercise and Answer Keys [.doc] | [.pdf] -
Exercise: Text Categorization
[Online Exercise]
Exercise and Answer Keys [.doc] | [.pdf]
Suggested Readings and Resources
In the Book
- Chapter 3.2 of textbook Information Retrieval (The Information Retrieval Series), written by David A. Grossman and Ophir Frieder, ISBN: 1-4020-3004, Publisher: Springer, 2004.
- Chapter 4 of textbook Natural Language Processing for Online Applications Text Retrieval, Extraction and Categorization, written by Peter Jackson and Isabelle Moulinier, Editor Prof. Ruslan Mitkov, ISBN: 90-272-4989, Publisher: John Benjamins Publishing, 2002.
On the Web
- Information Retrieval: Data Structures and Algorithms, Editors William B. Frakes and Ricardo Baezar-Yates, ISBN: 0-13-463837-9, Publisher: Prentice-Hall PTR, 1992.
- A probabilistic Analyst of the Rocchio Algorithm with TFIDF for Text Categorization
- Decision Trees
- An Improved KNearest Neighbor Algorithm for Text Categorization
- http://www.cs.ualberta.ca/%7Ezaiane/courses/cmput690/slides/Chapter8/index.htm
- http://www1.imim.es/~eblanco/seminars/docs/clustering/index_types.html
