Thursday, May 7, 2009

Chapter 2 Review, Continued, Part 2 -- "Automatic Discovery of Similar Words"

(Direct continuation of yesterday's post, w/r/t Senellart & Blondel on "Automatic Discovery of Similar Words" in Survey of Text Mining II. I give the references that cite, which I discuss in this post, at the end of the post.)

In Chapter 2's revieww of previous methods and associated literature, Senellart & Blondel start with banal and get progressively more interesting.

The one thing I found interesting in the first model that Senellart and Blondel discussed was that the model was the "inverse" of the usual. Typical word-frequency / document representations build a matrix of word frequencies as the rows, or dimensions, and the columns as the documents. This yields an mXn matrix (word x document), where typically m>>n. Thus, the vectors (columns) are the documents.

In contrast, the authors first discuss the document vector space model, first used by Chen & Lynch, which represents the documents as dimensions and the terms as vectors within the document space; i.e., a term's vector values depend on whether or not the term is used within a given document.

Simple cosine similarity typically gives good results; terms are similar if their corresponding vectors are similar; i.e., the terms are used in (more or less) the same documents. Chen & Lynch, as cited, also use a cluster measure, which is more attuned to the non-orthogonality of the vector space. We note that while Chen & Lynch obtained good results, they applied their method to a heavily annotated corpus, with emphasis on the metadata (keywords, countries, organizations, authors, etc.). One would expect such high-quality results when the algorithms are applied to such a carefully-refined corpus; applying this method to keywords alone should produce a good word-association map.

Senellart & Blondel recount (in Section 2.2) two means for building a thesaurus of infrequent words. The interesting thing to consider from their review of both Crouch's work and that of Salton, Yan, & Yu, is the possibility of combining two methods to get improved thesaurus-building, where low-frequency words build thesaurus classes.

Noting that the best results for similar words come when there is "light synactic analysis" (the author's term), together with syntactic context, Senellart & Blondel devote substantial attention to Grefenstette's work with SEXTANT (Semantic EXtraction from Text via Analyzed Networks of Terms). Good, interesting results on noun similarity. The paper by Grefenstette (see cite below) is worth a follow-up.

Next posting will address Senellart's method for graph-based synonym extraction.


References cited by Senellart & Blondel, and identified in this post:

Chen, H., & Lynch, K.J. Automatic construction of networks of concepts characterizing document databases. IEEE Trans SMC, 22(5): 885-902, 1992.
Crouch, C.J. An approach to the automatic construction of global thesauri. Info. Proc. & Mgmnt, 26(5): 629-640. 1990.
Grefenstette, G. Exporations in Automatic Thesaursus Discovery, Kluwer Academic Press, Boston, MA, 1994.
Salton, G., Yang, C.S., & Yu, C.T. A theory of term importance in automatic text analysis. Journal of the Am Soc. for Info. Science 26(1): 33-44, 1975.

No comments: