|
Hakia: semantic search... set to music
By Nate Anderson The intersection of search and samba If you ever wondered what search queries sound like when set to music, Hakia has your answer. Employees at the beta semantic search engine have put together a band and recorded an album of songs based on actual user queries, things like "Weapons of Mass Instruction." Think samba meets lounge music meets beat poetry and you get the idea. It's a little... odd. But Hakia doesn't mind being odd. The company was born out of a desire to do search differently. Search engines generally don't understand either content on the Web or the content of user queries; they work through keyword analysis, link weighting, and other statistical methods that allow an engine to produce more or less relevant results without ever needing to understand the implicit question in the search query. Hakia is about semantic search, though, which requires understanding. It's a more difficult problem than traditional search, which already seems hard enough, but Hakia executives believe that only understanding can provide the next great leap forward in search. Ars talked with Hakia's CEO, Dr. Riza C. Berkan, and Melek Pulatkonak, the company's COO, about how Hakia aims to stake out territory in a province already claimed by plenty of other homesteaders. Seeking understanding Here's the pitch: Hakia allows people to search for "painkillers" and "headache," but can then turn up results—say, for "Tylenol"— that don't use either of those words. The engine recognizes the concepts that lie behind the search terms and attempts to match those rather than keywords. Pretty nifty if it works, but will it allow Hakia to carve a chunk out of Google's kingdom when Microsoft and Yahoo are both having trouble gaining ground? "I don't know," Berkan says. "We just want to improve the capability of search in general." This is essentially what the company has been doing for the past 36 months. The technology behind what Hakia wants to do is so complex that the company's scientists and engineers have spent years simply working out the fundamental research behind it. Hakia was founded in 2004 and has raised $16 million so far to fund its operations. Unusually, none of this comes from venture capital; it's all institutional investors. Hakia has one office in Turkey that does some research and development work, but the rest of the staff is located in New York City, where the band occasionally gets together and plays "Weapons of Mass Instruction" at The Knitting Factory. If the company can make semantic search pay off, it will accomplish a couple of difficult feats at once. First, it will provide a useful search interface to long-tail content. The more specialized a Web query gets, the harder it is for an engine that relies on counting and weighting links (like Google, for instance) to return useful results. For esoteric topics, there may simply not be enough of this material to make a purely statistical approach work. But if the search engine can actually understand the content of webpages in a basic way, it can evaluate their worthiness on its own—without counting links. In a recent blog entry, Hakia's chief architect, Kartal Guner, compared long-tail queries to the hidden, underwater mass of an iceberg—these queries far outnumber the small group of popular searches that represent the visible tip of that iceberg. "Long-tail is more like a black hole," he says, shifting metaphors, "seemingly infinite, dark, cold, and merciless against popularity algorithms. These are the unpopular, longer than usual, complex, unique, and personal queries. These are the ones that need precision, accuracy, and relevancy." These are the queries Hakia hopes to answer best. A second, related result would be that a new article on the ins and outs of collectible Boba Fett figurines could become a top result on that topic the moment it becomes a part of Hakia's index. Again, understanding the content is key. With other search engines, it could take days or weeks for a new page to filter the top of certain queries because the page does not initially have incoming links from trusted sources. "Search is at a primitive stage," says Pulatkonak when we talk. She claims that traditional indexing has reached its limit as a technology. Search engines now need to move toward understanding, going from keyword search to sentence analysis. Those that don't will, in the end, only be good for answering the most popular queries as users recognize the value of semantic search. But creating the backend technology to power a search engine that knows what it's reading is a daunting job. Here's how Hakia handles it. The concept tree, the QDEX, and SemanticRank The sentence-level analysis of Internet information is the key to Hakia's engine. The process begins by breaking down each sentence into "knowledge bits" and then applying a proprietary concept tree to the entire sentence with the goal of understanding what the words mean and how they fit together. Just like a map of the US shows the links between cities and towns, a concept tree shows the links between ideas. Every concept in the tree has a parent, children, and siblings, making this tree the equivalent of a California redwood. Berkan calls the process of creating the tree "almost like rewriting Webster's dictionary." It's painstaking work, and "hundreds of linguists" have worked on it for the last few years. This tree is applied to each sentence, helping the engine to understand the event, the agent, and the theme of the sentence, which provides a basic understanding of what was written. All of this processing falls under the heading of "ontological semantics," or "OntoSem" in Hakia-speak (for more information, take a look at Dr. Victor Raskin's ; Raskin is currently an adviser to Hakia). Those curious about how concept trees work in Hakia's system can see an OntoSem example using the word "bow." The word is parsed into its various meanings (a weapon, part of a musical instrument, etc.), and each is broken down into various syntactic and semantic structures that show how the word might operate in sentences and how it connects to other concepts. Imagine doing this for the 100,000+ entries in Hakia's dictionary, and it's easy to see why it has taken several years to develop. Web pages are analyzed using OntoSem and the results are passed to the QDEX algorithms. QDEX replaces the typical "index" used by many other search engines; instead of keywords, it stores knowledge bits. Berkan describes QDEX as the "middleware that enables full-scale semantic analysis," and the software goes to work on web pages gathered by Hakia's spidering system. Once the text in these pages has gone through OntoSem processing and the words and concepts are all tagged, QDEX algorithms skim information and attempt to generate "all possible queries that can be asked to this content." Even for a simple web page, this could obviously result in millions of different queries based on the listed concepts. But humans reading the page would be likely to ask only a few questions about the information it contains, and one of the main jobs of QDEX is to reduce the vast number of possible queries into the much smaller number of likely human queries. QDEX then stores these possible queries that point back to the page or sentence where the answer can be found. All of this work is done off-line, of course; it takes far too much processing power for real-time use on vast amounts of data. On its web site, Hakia describes this technology as providing "great flexibility in a search engine platform for utilizing semantically rich data and multiple-thread processing of equivalent queries. Otherwise, deep semantic analysis is virtually impossible over a vast amount of textual data area." With the QDEX in place, the search engine is ready to tell users about Richard Nixon, or puppy dogs. The QDEX has already generated all of the necessary queries when it analyzed the webpage, but these now need to be ranked for display. Hakia uses a final technology called SemanticRank to comb through the stored queries and decide which ones best fit the search terms (which have themselves been parsed with OntoSem technology). Hakia claims that "no keyword matching or Boolean algebra [is] involved" in the process; it's all about concept-matching. Well, mostly. SemanticRank does take credibility and age of information into account, though the company stresses that it does not do this by counting link referrals (popularity). Berkan tells Ars that a true semantic search should never require page rank; it should be about the quality of the information. Hakia evaluates credibility by analyzing the way a page is written (does it use proper English, for instance), who authored it, and what site it's on. Medical claims on blogs will never make it above information from the NIH, Berkan says, no matter how popular they are, because the system understands that the NIH is a credible institution. Still in beta But the search is still in beta, and with reason. In my initial work on this piece, I did a search (based on the Hakia-suggested example at the beginning of the article) for "What painkiller can I take for a headache?" It returned as its first result something about a prisoner's psychotic episodes and painkiller addiction. Result number two was a news link titled "Man jailed for blowing up toilet." Link three pointed to an online anorexia forum with a post about a headache. The fourth link did point to the UK's National Health System, but the link was bad, and an empty page was the result. When it comes to answering the query "the essence of capitalism," though, Hakia does a fine job, arguably better than Google. Any formal evaluation of the search engine's usefulness will need to wait. The current search algorithm is being updated every six weeks (it's in Beta 15 now), and Hakia hopes to move beyond the beta stage by the end of 2007. The test above was conducted under Beta 14, which makes it easier to see how the system is evolving. A new search on the same question about headaches returns slightly better results, though the first several links aren't especially helpful. Berkan says that the meaning-based capabilities—Hakia's secret sauce—"have not fully propagated into the system yet," but they will be in place by the end of the year. To a user, all of this background mumbo-jumbo remains invisible. What users do see is a search results page with a spartan design clearly borrowed from Google. Just below the search box is a small green head in profile. This head offers suggestions, pats on the back for good searches, and suggested top answers. Below this begins the search results proper, and here Hakia provides some useful features. The best of these is something called "galleries." These are generated automatically by the system whenever it encounters a query with many different sorts of available information. A search for Richard Nixon brings up a gallery containing information on his biography, speeches, and statistics, films about Nixon, criticism and commentary of the man and his policies, along with photographs, a bibliography, and more. It's a bit like running a whole set of searches at once. Because these don't rely on any hand coding, it's simple for the system to generate galleries on any topic with enough different references, like Hungary or cats. Even the category headings are generated automatically by the system. Not the only ones with a good idea The company recognizes that taking on the industry leaders requires time. Berkan says that Hakia wants to operate in the long tail anyway and that it has a chance to start building its reputation as an engine that can produce good results for even obscure queries. But it's not alone in what it's trying to do. Well-funded competitors like Powerset are also hard at work on semantic search, and then there's the 800-pound search gorilla: Google. Some reports suggest that the company has been hard at work on semantic search technologies for some time and has already begun to deploy them in a limited way. For users, the promised benefit is more relevant search results, but there's something in all this for advertisers as well. Because Hakia's engine should understand what a search query is really about, advertisers can target campaigns more accurately. If it works, Hakia shouldn't show hot dog ads for the query, "how hot can my dog get?" More targeted ad campaigns should produce more cash for Hakia, but the company's own meaning-based advertising system won't be ready until 2008. For now, they use advertising.com to show the ads that appear on the right side of the results page, and the query "how hot can my dog get?" does, in fact, turn up an ad for the Vienna Beef online store. But the first company that gets this right stands to make a small mountain of cash. While Hakia claims to be interested only in the "search for better search" at the moment, it's a good bet that its investors want something more. Should Hakia's tech truly produce better long tail results, they may see that wish fulfilled. Posted at: http://arstechnica.com/articles/culture/hakia-semantic-search-set-to-music.ars |