Landauer et al. An introduction to latent semantic analysis. Discourse processes (1998)
LSA produces measures of word-word, word-passage and passage-passage relations that are well correlated with several human cognitive phenomena involving association or semantic similarity.
…the similarity estimates derived by LSA are not simple contiguity frequencies, co-occurrence counts, or correlations in usage, but depend on a powerful mathematical analysis that is capable of correctly inferring much deeper relations (thus the phrase “Latent Semantic”), and as a consequence are often much better predictors of human meaning-based judgments and performance…
LSA uses as its initial data not just the summed contiguous pairwise (or tuple-wise) co-occurrences of words but the detailed patterns of occurrences of very many words over very large numbers of local meaning-bearing contexts, such as sentences or paragraphs, treated as unitary wholes. Thus it skips over how the order of words produces the meaning of a sentence to capture only how differences in word choice and differences in passage meanings are related.
It is this dimensionality reduction step, the combining of surface information into a deeper abstraction, that captures the mutual implications of words and passages. Thus, an important component of applying the technique is finding the optimal dimensionality for the final representation.
LSA is a fully automatic mathematical/statistical technique for extracting and inferring relations of expected contextual usage of words in passages of discourse. It is not a traditional natural language processing or artificial intelligence program; it uses no humanly constructed dictionaries, knowledge bases, semantic networks, grammars, syntactic parsers, or morphologies, or the like, and takes as its input only raw text parsed into words defined as unique character strings and separated into meaningful passages or samples such as sentences or paragraphs.
(basically, you set up a matrix of words, and the count of their occurrences in passages of text (say, in sentences of a paragraph, etc…), and can then crunch the matrix to find correlations between different combinations of words) – DN