Seven Trends that Influence Search Technologies

Arumugham S

The Semantic Web is a project that intends to create a universal medium for information exchange by giving meaning (semantics), in a manner understandable by machines, to the content of documents on the Web. Currently, under the direction of the Web's creator, Tim Berners-Lee of the World Wide Web Consortium, the Semantic Web extends the World Wide Web through the use of standards, markup languages and related processing tools.

 

Most people are capable of using the web to, say, find the Swedish word for "car", renew a library book, or find the cheapest DVD and buy it. However, if you ask a computer to do the same thing, it wouldn't know where to start. That is because web pages are designed to be read by humans and not machines. The Semantic Web is a project aimed to make web pages understandable to computers, so that they can search websites and perform actions in a standardized way.

The potential benefits are that computers can harness the enormous network of information and services on the web. Your computer could, for example, automatically find the nearest dentist to where you live and book an appointment for you that fit in with your schedule.

A lot of the things that could be done with the Semantic Web could also be performed without it, and indeed already are being done in some cases. However, the Semantic Web provides a standard that makes such services far easier to implement.

The Semantic Web comprises standards and tools of XML, XML Schema, RDF, RDF Schema and OWL. The OWL Web Ontology Language Overview describes the function and relationship of each of these components of the Semantic Web:

  • XML provides a surface syntax for structured documents, but imposes no semantic constraints on the meaning of these documents;
  • XML Schema is a language for restricting the structure of XML documents;
  • RDF is a simple data model for referring to objects ("resources") and how they are related. An RDF-based model can be represented in XML syntax;
  • RDF Schema is a vocabulary for describing properties and classes of RDF resources, with a semantics for generalization-hierarchies of such properties and classes; and
  • OWL adds more vocabulary for describing properties and classes: among others, relations between classes (e.g. disjointness), cardinality (e.g. "exactly one"), equality, richer typing of properties, and characteristics of properties (e.g. symmetry) and enumerated classes.

The intent is to enhance the usability and usefulness of the Web and its interconnected resources through:

  • Documents "marked up" with semantic information (an extension of the HTML <meta> tags used in today's web pages to supply information for web search engines using web crawlers). This could be machine-readable information about the human-readable content of the document (such as the creator, title, description, etc. of the document) or it could be purely metadata representing a set of facts (such as resources and services elsewhere in the site). (Note that anything that can be identified with a Uniform Resource Identifier (URI) can be described, so the semantic web can reason about people, places, ideas, cats, etc.);
  • Common metadata vocabularies (ontologies) and maps between vocabularies allow document creators to know how to mark up their documents so that agents can use information in the supplied metadata (so that Author in the sense of 'the Author of the page' will not be confused with Author in the sense of a book that is the subject of a book review);
  • Automated agents to perform tasks for users of the Semantic Web using this metadata; and
  • Web-based services (often with agents of their own) to supply information specifically to agents (e.g. a Trust service that an agent could ask if some online store has a history of poor service or spamming).

The current projects in web searching are all surrounding semantic web technologies. The idea is to look at the entire web like a book and the technologies supported by semantic web will help you read and find information just the way you browse a book.

Clustering

Clustering is fast becoming a popular topic on forums and blogs as both Google and MSN use them for their search results. The truth is that search software, web-mining software, computational language software (especially) have been using this pretty basic theory for a very long time, in fact since the 70's.

It is a statistical technique used to identify groups in a multi-dimensional space. The idea is simple: to organize or discover a set of clusters for a given document set. The document similarity between clusters must be minimized and within clusters must be maximized. In partitions, the documents are divided into non-over-lapping groups.

Partitioning methods yield a set of X clusters belonging to their respective clusters. Each cluster is represented by a centroid, which holds the definition of that cluster. Different types of algorithms belonging to this group include:

  • The single pass method, where the first object can be seen as the centroid of the first cluster. Then the next object is calculated using the similarity S, using the same similarity measure as for the clusters or centroids. If S is greater than a specified threshold value - the object is added and the centroid is re-calculated. It only goes through the data set once, hence its name;
  • The hierarchical agglomerative clustering method is the most commonly used. Two closest objects are merged into a cluster, then we find and merge the next two closest points, a point being an object or a cluster. This is repeated until there is no single cluster. Within this method, there are variants such as the second matrix approach, the NN matrix;
  • The Single Link Method (SLINK) works by joining the two most similar objects that are not yet in the same cluster;
  • The Complete Link Method (CLINK) is about comparing inter-cluster similarity; and
  • The Group Average Method focuses on the similarity measure between the groups of clusters. Every query K is compared against every document in the relevant clusters. If a large set of documents can be divided up into N coherent clusters, then the queries K can be compared with the representations of every cluster N.

Clustering is widely used in bioinformatics and work of a scientific nature. Even Windows NT uses a clustering algorithm and has done so for some time now. Research is still ongoing in the field of classification (clustering).

Caching

Caching is the process by which web pages are stored closer to the client (browser). Technologies have been developed to cache both static and dynamic content. Akamai is one of the trendsetters. EdgeComputing for Java is built on Akamai EdgeSuite, a content delivery solution that leverages the globally distributed Akamai Platform to deliver Web content and applications via more than 15,000 servers in over 1,000 networks in 65+ countries.

EdgeComputing for Java supports the execution of Java Server Pages (JSP), Servlets and JavaBeans on the edge of the Internet, thus avoiding network latency and the need for costly infrastructure over-provisioning, while improving the performance and reliability of mission-critical enterprise applications. To adapt an application for EdgeComputing for Java, applications are separated into two layers: a centralized origin layer and a distributed edge layer. The edge layer is deployed on to the Akamai network and is composed of presentation and business components optimized for the edge.

With over 4 billion indexed and meticulously sorted web files, images and messages dating back to the 1980's, Google, a four and a half year old California-based company, has indisputably become the world's largest information powerhouse. Wielding a mixture of superior technology and purposeful business marketing, the once university student project utterly transformed the perception of Internet searching and triggered never before seen ensuing ramifications. One of the strengths of Google is its caching technology.

There are a number of projects to improve caching technologies. The search engine vendors are promoting some of these projects.

Distributed Computing

Distributed computing is an aspect of computer science that deals with the coordination of multiple computers in remote physical locations in order to accomplish a common objective or task. In distributed computing, the type of each computer, hardware, programming languages, Operating System and resources may vary drastically. Clustering shares many things in common with distributed computing, but the main difference is the practical physical accessibility of the machines that are working together.

Organizing the interaction between each computer is of prime importance. In order to be able to use the widest possible range and types of computers, the protocol or communication channel should not contain and use any information that may not be understood by certain machines. Special care must also be taken that messages are indeed delivered correctly and that invalid messages are rejected, which would otherwise bring down the system and perhaps the rest of the network.

Another important factor is the ability to send software over to another computer in a portable way so that it may execute and interact with the existing network. Obviously, this may not always be possible when using differing hardware and resources, so other methods must be used, such as cross-compiling or manually porting this software.

Distributed Computing remains to be one of the biggest challenges for most search engine vendors. While there have been great progress, the scope of improvement is indeed large. Distributed Computing will influence the way large search engines work.

Non-textual Searches

The biggest challenge of search engines is to search and identify non-textual data. Today, most search engines use crude but effective algorithms based on HTML tags and the content in a specific page. The fact remains that most search engines will identify an apple as an orange as the information linked by the search engine is obtained from, say the name of the file or content in the web page, and not what is pictured in the image. This is hardly a solution thriving for maximum accuracy.

Finally, looking into the future, how many of these ideas can be extended to video retrieval? Combining the audio track from videos with the images that are being displayed may not only provide additional sources of information on how to index the video, but also provide a tremendous amount of (noisy) training data for training object recognition algorithms en masse.

Even with the variety of research topics discussed previously, we are only still scratching the surface of the myriad of issues that artificial intelligence technologies can address with respect to web search. One of the most interesting aspects of working with web data is the insight and appreciation that one can get for large data sets.

In contrast to traditional approaches, which solely make use of standard term lexicons to make spelling corrections, the Google spelling corrector takes a Machine Learning approach that leverages an enormous volume of text to build a very fine-grained probabilistic context sensitive model for spelling correction.

This allows the system to recognize far more terms than a standard spelling correction system, especially proper names that commonly appear in web queries but not in standard lexicons. For example, many standard spelling systems would suggest the text “Beeg Shahi” be corrected to “Big Shah”, being completely ignorant of the proper name and simply suggesting common terms with small edit distance to the original text. Contrastingly, the Google spelling corrector does not attempt to correct the text “Beeg Shahi” since this term combination is recognized by its highly granular model.

Context Sensitiveness

More interesting, however, is the fact that by employing a context sensitive model, the system will correct the text in a more intelligent way. Such fine-grained context sensitivity can only be achieved through analyzing very large quantities of text. The Open Directory Project (ODP) (http://dmoz.org/) is a large Open Source topic hierarchy into which web pages have been classified manually. The hierarchy contains roughly 500,000 classes/topics.

Since this is a useful source of hand-classified information, we sought to build a query classifier that would identify and suggest categories in the ODP that would be relevant to a user query. At first blush, this would appear to be a standard text classification task. It becomes more challenging when we consider that the “documents” to be classified are user queries, which have an average length of just over two words.

Moreover, the set of classes from the ODP is much larger than any previously studied classification task and the classes are non-mutually exclusive, which can create additional confusion between topics. Despite these challenges, we have available roughly four million pre-classified documents, giving us quite a substantial training set. Hence, context sensitive models is another area that influences searching technologies.

Responsiveness to Spam

Believe it or not! Spam is the biggest issue that most web-based searching techniques use. The plethora of Search Optimization tricks have resulted in may methods that are close to spamming. Search engines and searching techniques need to combat this effectively. Otherwise, the effectiveness of web querying, which depends on the accuracy of results thrown in will be questioned. Google and others are trying their best to understand what is relevant to them and their users. Research in this direction is also quite important.








}