Wednesday, May 13, 2009

Python Textbook: "Cloud Computing with Python"

Have been away from the book review for past few days while pushing out the bare-bones companion website for Cloud Computing with Python. This book is designed for as a introductory Python textbook, with the additional features of having instructional material on loading Python programs to the Google cloud, and having a moderate review of cloud computing. The website itself is much more cloud-rich, and I'll endeavor to keep it current.

The book will meet the needs of three different groups:
1) College professors / students who are teaching / taking a first-semester programming course, where the language of choice is Python,
2) College professors / students who are teaching / learning Python as a "second langauge," and
3) Self-study IT folks, techno-geeks, and even "hands-dirty" managers, who want to get current with BOTH Python AND cloud computing. A "one-stop shopping" place.

More to come.

Friday, May 8, 2009

Chapter 2 (Part 3), Sennelart & Blondel - Automatic Discovery of Similar Words

In Section 2.3, we get to the meat of Sennelart & Blondel's work, which is a graph-based method for determining similar words, using a dictionary as source. Their method uses a vXv matrix, where each v is a word in the dictionary. They compare their method and results with that of Kleinberg, who proposes a method for determining good Web hubs and authorities, and with the ArcRank and WordNet methods. They test the four methods on four words: disappear, parallelogram, sugar, and science. By and large, their method performs 2nd-best, after WordNet. They propose improving their results by taking a larger subgraph.

The most interesting result is not so much the specific method, but that their approach makes it possible to see a dictionary, and the resulting vector space model, as a (possibly weighted) directed graph. This is a significant result, as graph theory has many powerful theoretical aspects which can be brought to bear on this problem class.

Overall, a good read, and some food for thought.

The Kleinberg reference is:
Kleinberg, J.M. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46 (5): 604-632 (1999).

Thursday, May 7, 2009

Chapter 2 Review, Continued, Part 2 -- "Automatic Discovery of Similar Words"

(Direct continuation of yesterday's post, w/r/t Senellart & Blondel on "Automatic Discovery of Similar Words" in Survey of Text Mining II. I give the references that cite, which I discuss in this post, at the end of the post.)

In Chapter 2's revieww of previous methods and associated literature, Senellart & Blondel start with banal and get progressively more interesting.

The one thing I found interesting in the first model that Senellart and Blondel discussed was that the model was the "inverse" of the usual. Typical word-frequency / document representations build a matrix of word frequencies as the rows, or dimensions, and the columns as the documents. This yields an mXn matrix (word x document), where typically m>>n. Thus, the vectors (columns) are the documents.

In contrast, the authors first discuss the document vector space model, first used by Chen & Lynch, which represents the documents as dimensions and the terms as vectors within the document space; i.e., a term's vector values depend on whether or not the term is used within a given document.

Simple cosine similarity typically gives good results; terms are similar if their corresponding vectors are similar; i.e., the terms are used in (more or less) the same documents. Chen & Lynch, as cited, also use a cluster measure, which is more attuned to the non-orthogonality of the vector space. We note that while Chen & Lynch obtained good results, they applied their method to a heavily annotated corpus, with emphasis on the metadata (keywords, countries, organizations, authors, etc.). One would expect such high-quality results when the algorithms are applied to such a carefully-refined corpus; applying this method to keywords alone should produce a good word-association map.

Senellart & Blondel recount (in Section 2.2) two means for building a thesaurus of infrequent words. The interesting thing to consider from their review of both Crouch's work and that of Salton, Yan, & Yu, is the possibility of combining two methods to get improved thesaurus-building, where low-frequency words build thesaurus classes.

Noting that the best results for similar words come when there is "light synactic analysis" (the author's term), together with syntactic context, Senellart & Blondel devote substantial attention to Grefenstette's work with SEXTANT (Semantic EXtraction from Text via Analyzed Networks of Terms). Good, interesting results on noun similarity. The paper by Grefenstette (see cite below) is worth a follow-up.

Next posting will address Senellart's method for graph-based synonym extraction.


References cited by Senellart & Blondel, and identified in this post:

Chen, H., & Lynch, K.J. Automatic construction of networks of concepts characterizing document databases. IEEE Trans SMC, 22(5): 885-902, 1992.
Crouch, C.J. An approach to the automatic construction of global thesauri. Info. Proc. & Mgmnt, 26(5): 629-640. 1990.
Grefenstette, G. Exporations in Automatic Thesaursus Discovery, Kluwer Academic Press, Boston, MA, 1994.
Salton, G., Yang, C.S., & Yu, C.T. A theory of term importance in automatic text analysis. Journal of the Am Soc. for Info. Science 26(1): 33-44, 1975.

Tuesday, May 5, 2009

"Automatic Discovery of Similar Words" - Chapter 2 in Survey of Text Mining II

This post begins a review of "Automatic Discovery of Similar Words," by Pierre Senellart and Vincent D. Blondel, published as Chapter 2 in Berry and Castellanos' Survey of Text Mining II.

This is an excellent and useful chapter, in that it:
1) Addresses the broad issue of computational methods for discovering "similar words" (including synonyms, near-synonyms, and thesauri-generating techniques) from large data corpora,
2) Illustrates the different leading mathematical methods, giving an excellent overview of the SoA,
3) Competently discusses how different methods perform in domain-specific vs. broad corpora, and also addresses related methods for determining semantic similarity and/or distance.

Their unique contribution is the dictionary graph, which uses a graph-based method. This is particularly interesting and useful, as graph-theory methods are growing in importance. (See my February 2, 2009 - Graph Theory post.)

Time constraints require that this review be done in multiple posts. More tomorrow.

Monday, May 4, 2009

Follow-on Thoughts: Clustering Algorithm Improvements for Text-based Data Mining

A good night's sleep is excellent for clearing away mental cobwebs, and has given me more perspective on Chapter 1, "Cluster-Preserving Dimension Reduction Methods," by Howland and Park in Survey of Text Mining II: Clsutering, Classification, and Retrieval (ed. by Berry & Castellanos).

If you will, please recall with me that the Howland & Park work proposed a two-step dimensionality reduction method. They successfully reduced over 20,000 "dimensions" (of words found in the overall corpus collection) to four dimensions, and maintained excellent classification results.

Note also that their work was applied to a very structured domain (medical abstracts), and confined to five very distinct classes (heart attach, colon cancer, diabetes, oral cancer, tooth decay). We would expect some - but still very limited - overlap between colon cancer and oral cancer, and oral cancer and tooth decay. (It would be interesting to have reports on the centroid distances of these areas.)

Using their two-stage dimensionality reduction and clustering, they achieved results in the 98-99% range, which was an improvement (typically 10% or more) over simple centroid or k-nearest-neighbor clustering.

When we back off a bit and think this through, these results are not at all surprising.

First, the key terms identifying the concept areas should indeed be very distinct. The five selected topics should be VERY separable in a well-defined concept-space.

We know from experience that overly large feature vectors do far more harm than good, especially when unweighted. All the oddball, one-time-only terms do nothing except weaken the results that really depend on relatively few keywords.

Just about any method can and should remove those term "dimensions" that:
1) Are very sparse, both throughout the entire training / testing corpus AND
2) Are either so unique that their frequency of occurrence is not a reliable indicator of concept presence OR
3) Their presence - although sparse - is distributed over multiple concept classes.

Yes, this is coming at the problem from a top-down perspective, which betrays my bias immediately -- but experience as well as theory have shown me that a little top-down knowledge, astutely coupled with some algorithms, can cut down on processing time and improve results reliability.

Be that as it may; they get anticipated results.

Which brings me to identifying the key questions that I believe truly confront the text mining / knowledge discovery community. These are:
1) Correct apportionment of a concept / entity across an ontology / taxonomy, and
2) The "mixtures problem."

I wrote sufficiently about the first area in the previous blog, and touched briefly on the second. But it is this second one that requires substantive attention by the entire community if we are to have useful, reliable tools.

The term "mixtures problem" comes from the area of physical and analytical chemistry. Again, I'm showing my bias, but bear with me - this is an important analogy.

When we want to determine the composition of a chemical mixture, say a beaker full of organic solvents, we do a variety of tests -- not just one! (This in itself should be a good heads-up for designing comprehensive text-mining tool suites.) For example, we use NMR to identify the presence of various elements. We use mass spectroscopy to determine the presence of various "fragments" within the chemical compounds. And we use a huge RANGE of spectroscopic methods (e.g., gas chromotography, infrared, etc.) to identify characteristic base chemical units (e.g., benzene rings, methyl groups, etc.).

Taking this over to the text-mining area, this immediately suggests that a suite of algorithms is useful, especially at the front-end. Even the crudest of algorithms - e.g., simple string-matching - can alert us to certain entities, and this will further direct us to certain kinds of processes, or certain aspects of context that will modify the processes which we select.

But back to analytical chemistry.

Suppose that we do our analyses, and get a set of the various chemical "base units" that are present. We can even identify full compounds by matching spectral peak patterns.

Moreover, using some fancy software, we can approximate the relative fraction of each compound in the mixture.

At first glance, this would seem to be a linear optimization problem.

And to a very reasonable and realistic extent, it is.

The strongest nonlinear effects -- the degree to which, say, one methyl group attached to a benzene ring affects the spectral signature of another attached group -- are within a given compound. These are well-known, and they form part of the distinct "spectral signature" that allows us to characterize a compound uniquely.

There will, however, be some degree to which the presence of one compound can affect the "signature" of another compound. This introduces a second-order, or nonlinear effect.

So back now to textual data mining.

We know that any document of interest will typically contain multiple concepts and entities. It stands to reason that when we classify the document, we should accept a range of match strengths against multiple concept classes. I am talking about concepts from MULTIPLE, very DISTINCT taxonomies / ontologies, as well as the "vertical spread" that we anticipate for matching levels of abstraction within a taxonomy.

We need to think through our method for this matching.

We can do something very simplistic; e.g., find the "best match," then reduce the feature vectors by the amount proportional to that match, then find the "next best," etc.

We can do a linear optimization spread, which essentially is performing the above but using all the potential matches simultaneously.

And I'm certain that there are other methods, open to our ingenuity. (If you have good references; please send; I'll attempt to read / review / cite, although with probably the same long time-delay that I've had in putting this review together.)

Where the real challenge comes now is not so much in defining the method, as in defining our test criteria.

If we are going to have a good method for matching a document across a set of concepts / entities, we need to have ground-truthed data that has an agreed-upon evaluation.

Yes, there are certain databases (e.g., TREK, etc.) that have multiple concepts, etc. However, as we approach this more rigorously, we need to look more closely at second-order effects as we set up our evaluation data.

I mentioned that second order effects could happen -- but were not likely to be too severe -- in chemical mixtures.

My suspicion is that some of these effects can be VERY severe in text data. That is because we are dealing with a many-to-many (word-to-concept) match situation which is not subject to the laws of nature, but to human preferences and predilictions in writing and reporting style.

Good quantification should be a starting point.

Again, if you know of good work in this area, please share.

I will say that -- if memory serves me right -- this was NOT an issue that I saw addressed in this book, lovely though each of the chapters are. Nor have I seen it much in the technical material that has crossed my path over the last few years.

Comments, anyone?

Sunday, May 3, 2009

Survey of Text Mining II: Cluster-Preserving Dimension Reduction Methods (Chapter 1)

Some time ago, I promised a colleague a review of an excellent book' Survey of Text Mining II: Clustering, Classification, and Retrieval, edited by Michael W. Berry and Malu Castellanos. Overall, this book would serve well as the basis for a one-semester graduate course in specialized methods for (textual) data analytics. It presupposes an expert's (or at least a solid journeyman's) understanding of basic algorithms along with the issues of textual data mining / analytics. Each chapter presents a new and interesting insight. Taken together, the chapter collection reasonably spans the "space" of relevant algorithms; their usefulness, their limitations, and some means for honing them for improvement.

In that context, this book is useful both as a primary text for a graduate-level course, and also as a valuable reference to those doing research and development in this area.

What is missing, though, is a sense of how these various algorithms play into the larger picture. Limited through necessity, the positioning of this book deals with the specific subjects of clustering, classification, and retrieval. This means that there are some "bigger picture" issues that are not identified or brought together. Thus, this book should be used in the context of a greater understanding that an astute professor would provide to students, or an expert practitioner would bring to his or her own evaluation.

In this blog, over the next several days, I'll be doing a detailed, chapter-by-chapter review. The review that I submit to my colleague will contain a summary of the most salient points.

This blog entry addresses Chapter 1: "Cluster-Preserving Dimension Reduction Methods for Document Classification," by Peg Howland and Haesun Park.

Howland and Park state the challenge that they are addressing as: "To be useful, the lower dimensional representation must be a good approximation of the original document set given in its full space. Toward that end, we present mathematical models, based on optimization and a general matrix rank reduction formula, which incorporate a priori knowledge of the existing structure. From these models, we develop new methods for dimension reduction..."

They begin by identifying their initial dataspace representation as the classic mXn term-document matrix, where the entries are weighted frequencies of term i in document j. They note that clustering would be apparent in this term-document matrix, and they desire a dimensionality reduction method that preserves this clustering structure.

Their overall goal is to identify a computationally low-cost dimensionality reduction mechanism that preserves the cluster structure found in the original document collection.

To this end, they evolve a two-stage approach; specifically, they apply a combined linear discriminant analysis (LDA) combined with generalized singular value decomposition (GSVD) to the data set as a second stage after initial application of either principal components analysis (PCA) or latent semantic indexing (LSI). (They establish mathematical equivalence of the PCA and LSI methods using trace optimization.) Finally, they combine the two perspectives (PCA and LSI) by making a factor analysis approximation in the first stage.

Their mathematics is precise, their trace of intellectual antecedents is excellent. Time permitting, I'll add a brief summary of their more important points.

But skipping now to their final result: They propose to overcome the computationally-expensive limits of LDA/GSVD method by applying it AFTER a factor analysis. They test their results on a very carefully controlled MEDLINE abstract document sets. The training and testing sets each comprise five medically very distinct categories (heart attack, colon cancer, diabetes, oral cancer, tooth decay); with 250 document abstracts in each category for each of the training and testing data.

The dimensionality of mXn was approximately 22000X1250, or about 20:1.

The two-stage (centroid followed by LDA/GSVD) classification accuracies for their training data range from 81.5 to 88.7%, depending on selection of parameters and centroid method. This is a slight improvement over centroid-alone methods (77.8 - 83.8%). When they apply their methods to "test corpora" of 200 abstracts selected randomly from the overall test corpus of 1250 abstracts, they achieve excellent results: 98.5 - 99%.

Clearly, if we desire to perform straightforward dimensionality reduction on a mXn data set, where m>>n, they offer a useful approach.

Now, let us (using the editorial, if not the imperial, "we") briefly touch on potential limitations.

There is absolutely nothing wrong with their mathematics or methodology. They have a good process, it is well-formed and well-demonstrated. But let's first keep in mind that their intention is to address data corpora where m>>n, and SVD methods are limited by this. What happens when we move to very, very large data corpora?

Similarly, their corpora -- for all that they unarguably need a clean training /test set -- are very simplistic. At a glance, we would expect easy discrimination between medical abstracts dealing with "heart attacks" vs. "colon cancer," etc. The limitation of their application is that realistic classifications will involve:
1) Multiple classifications (or concepts) per document;
2) A hierarchy of classifications;
3) (Desirably) a weighted set of classifications across a hierarchy, i.e., positioning a document at the right "levels" of a hierarchical classification set, where the distribution of classification weightings gives a sense of how the document addresses the generality / specificity of a topical area.

Most clearly, the method that they devise seems useful in well-defined domains, but it is not easily clear how useful it would be extended to more "blurry" domains, to corpora in which the language was not as technically precise or well-structured, to corpora in which there were different reporting styles, etc.

The overall question of clustering becomes much more pertinent when we look at complex knowledge structures, expressed either as taxonomies or ontologies.

Suppose that we have a medical ontology. (I'm specifying ontology rather than taxonomy, because I'll jump into objects which inherit from multiple parent classes in a moment -- in other words, object-oriented thinking, which can be used to model ontologies.) Suppose that we have multiple high-level ontological structures in this medical ontology. One of these is the "cancer" ontology, which includes both oral cancer and colon cancer. (Both of these are classes within their study.) Suppose that we also have an "oral diseases" ontology - within the same large medical ontology. Within this "oral diseases" ontology, we have both oral cancer and tooth decay.

A document on oral cancer should match not only to the oral cancer "ontological node" -- but also match (with perhaps less strength) to the higher ontologies of "cancer" and "oral diseases." In other words, it should match with a weighted spread to a structure of potential classifications.

This does not diminish the value of the Howland and Park work, but does suggest that we need to think through the big picture requirements before selecting a particular mathematical method.

Second, we need to think through the fact that many high-value documents are characterized by participation in multiple classes (taxonomic or ontological) simultaneously. In this case, the documents will likely defy "simple" classification. They will also defy "simple" clustering. By their very nature, the value that these documents bring will be the association of multiple concept classes - whether spread over an ontological/taxonomic structure (e.g., comparing two related topics), or whether creating associations across disparate ontologies / taxonomies.

THIS is really the challenge that we need to address.

We can go on forever finding improved means for simple classification, but that obscures the objective of real textual data mining / text analysis. We are looking for the associations between concepts, not simple classification.

The best way to do this (and the subject of my knowledge discovery patent, forgive me for tooting my own horn, but this patent addressed this issue precisely), is to go back to the very core of AI thinking in terms of representation layers. (This is the basis of brain signal processing as well.) To achieve useful results, we need to transform data from lower to higher representation levels. We need to accomplish successive stages of data abstraction.

Dimensionality reduction is really not the same thing as data abstraction. It is close, and it is a useful method, but it is not the same.

This is not to disparage the method of Howland and Park. We simply need to hold it in context. Preferably, we use higher-order methods to identify desired "clusters" (actually, to identify known concepts and entities), and train classifiers / clustering methods to provide weighted maps of concepts / entities to documents as a first-level set of metadata.

The clustering methods (centroid, k-nearest-neighbor) offered in the Howland & Park work are examples of unsupervised learning. The follow-on methods (LDA/GSVD) are also unsupervised. Unsupervised methods have their role in the universe of text analytics, but we should not be ignoring the value of supervised methods, which essentially bring in a control mechanism to guide processes towards desired outcomes.

Yes, the latter does impose a "world view" on results. Thus, in addition to supervised methods to produce weighted, hierarchcial classifications, we also need novelty detection methods.

Thus, we hold the Howland & Park method in context. A good mathematical tool. Well-thought-out. Potentially useful, if we carefully control the conditions under which it is used. But let us also keep our eye on the big picture, and our attention tuned to the full set of necessary capabilities.

This set of comments -- meant to place the reviewed works in appropriate context -- will be similar for most of the chapters in this book. For that reason, once again -- the book is useful. But it needs to be presented (or considered) within appreciation for the overall challenges of the problem space. It is a set of (mostly) useful tools; each of which addresses a unique, particular, and relatively small set of the big problem. "Big picture thinking" is still required.

Monday, February 2, 2009

Graph Theory -- Becoming "Organizing Framework"

Something I've been noting -- both on my own, and in conversations with Jenn Sleeman , who's in touch with the academic world at UMBC -- Graph theory is growing in centrality as a fundamental organizing framework for many current and emerging computational processes.

Specifically, anything more complex than a simple "tuple" (RDF or OWL, etc.), needs to be matched against a graph or partial graph.

One good "integrative" paper is Understanding Belief Propagation and its Generalizations by J.S. Yedidia, W.T. Freeman, and Y. Weiss (2002). One of their points [made right after Eqn. 18, w/r/t Belief Propagation and Graph Theory] is:
"In a practical computation .. one starts with the nodes at the edge of aaph ...
in general, it is easy to see that each message need only be computed once for
singly connected graphs [my note: not all will be singly-connected]. That means
that the whole computation takes a time proportional to the number of links in
thegraph, which is dramatically less than the exponentially large time that
would be required to compute marginal probabilities naively."
There is more -- connecting with Ising, Bethe, and Kikuchi free energy models.

I particularly enjoy the connection with the Kikuchi Cluster Variation Method (CVM), which began in the 1950's as a relatively obscure approach to a more accurate assessment of free energy.