Latent Semantic Analysis
Recently people in the know of things about search engines use a buzz word – LSI – Latent Semantic Indexing. For example when you query Google for the phrase – women portal like this : ~women portal
http://www.google.com/search?num=100&hl=en&lr=&q=%7Ewomen+portal
You can see the tilde(~) before the search phrase to indicate that you want the latent semantic analysis to be turned on, Google will try to look for the logical extension of the supplied phrase – to put it in a simple manner. Actually, it is a lot more complicated than this.
One thing I like about the Internet is the way it offers level playing field for anyone whether you are a 900 lb Gorilla or a timid 6 lb Chihuahua. We have been tinkering with LSA for sometime now – about a couple of years. Our agenda is much simpler in nature – to deliver the right page from our thousands of pages of content for a given search phrase. You would have noticed in our main page and elsewhere a search box with some mumbo jumbo about Natural language navigation.
To tell you the truth with out much technicalities and hype, LSA ( Latent Semantic Analysis) is a simple behind-the-scene process by a computer program to figure out the concept behind the word phrase and identify the matching content. Most writers use different word or phrase to describe the same idea or concept. Even the most painstaking editing effort before the publication stage will not weed out the individual bias of the language to mean different things for different people. Editors can offer consistent style and language across the entire website – but can do little to bring homogeneity to the choice of words.
For those who are technically inclined this is what is called as synonymy – where many words exist to describe a singular idea. In contrast, Polyseme describes a word or phrase with multiple meanings – again a problem in our search engine approach. Some of the words people use to search might end up getting the wrong page.
Human languages are probably one of the most complicated issues to be handled by computers. Subtle nuances of the language are not so easy to quantify in objective terms. People often intuitively “arrive” at the intended meaning of the written word by the position of the words in relation to each other. Cues like modulation of the voice, emphasis placed on syllables etc, which exist in spoken words don’t exist in a written page.
The only cue left to analyze is the relative position of the words to each other and the frequency of occurrence in each article page. Most of our pages contain thousands of occurrences of common words, which receive less weight than the unique primary keyword phrases. Evidently, these ‘weighted’ phrases are factored in to our search engine along with their synonyms for classification.
To cut a long story short, we decided to use an extensive dictionary of English words to help our version of LSA. Sometimes it is really thrilling to see that our internal search engine delivers the most appropriate article for the search phrase with relative ease. On the other hand, equally it is stumped by a contrived phrase though the frequency of this occurring is relatively small.
The technology to negotiate the vast realms of human languages is still nascent and our LSA is still at beta.
To go back to the first example of the LSA concept where we used Google to look for the semantically related words matching the phrase – women portal, you should see many occurrences of lady, woman, female and so on from the search results page. It is also in beta …