Information Discovery on the Web : Issues & Problems

Sunil Kr.Pandey , R.B.Mishra

Web information discovery (WID) presents a wonderfully rich and varied set of problems. Efficient Web Search systems are important for relevant information discovery on the Web. In order to best serve the needs of users, a search system must find and filter the most relevant information matching a user’s query, and then present that information in a manner that makes the information most readily available to the user. Moreover, the task of information discovery and presentation must be done in a scalable fashion to serve the user queries that are issued every day. In addressing the problem of information discovery on the web, there are a number of challenges. We begin by briefly outlining some of the issues that arise in web information discovery that showcase its differences with research traditionally done in Information Retrieval (IR), and then focus on more specific problems.

Introduction

The essential feature that led to the explosive growth of the web – decentralized content publishing with essentially no central control of authorship – turned out to be the biggest challenge for web search engines in their quest to index and retrieve this content. Web page authors created content in dozens  of (natural) languages and thousands of dialects, thus demanding many different forms of stemming and other linguistic operations. Because publishing was now open to tens of millions, web pages exhibited heterogeneity at a daunting scale, in many crucial aspects. First, content creation was no longer the privy of editorially trained writers; although this represented a tremendous democratization of content creation, it also resulted in a tremendous variation in grammar and style (and in many cases, no recognizable grammar or style). Indeed, web publishing in a sense unleashed the best and worst of desktop publishing on a planetary scale, so that pages quickly became riddled with wild variations in colors, fonts, and structure. Some web pages, including the professionally created home pages of some large corporations, consisted entirely of images (which, when clicked, led to richer textual content) – and therefore, no indexable text.

The democratization of content creation on the web meant a new level of granularity in opinion on virtually any subject. This meant that the web  contained truth, lies, contradictions, and suppositions on a grand scale. This gives rise to the question: Which web pages does one trust? In a simplistic approach, one might argue that some publishers are trustworthy and others not  – begging the question of how a search engine is to assign such a measure of trust to each website or web page. There may be no universal, user-independent notion of trust; a web page whose contents are trustworthy to one user may not be so to another.

In traditional publishing this is not an issue: users self-select sources they find trustworthy. But when a search engine is the only viable means for a user to become aware of (let alone select) most content, this challenge becomes significant. Static web pages are those whose content does not vary pages from one request for that page to the next. Dynamic pages are typically mechanically generated by an application server in response to a query to a database.

The Web Information Discovery (WID)

Evolution
The Web is unprecedented in many ways: unprecedented in scale, unprecedented in the almost-complete lack of coordination in its creation, and unprecedented in the diversity of backgrounds and motives of its participants. Each of these contributes to making web search different – and generally far harder – than searching “traditional” documents.

The invention of hypertext ( by Vannevar Bush in the 1940s )  first realized in working systems in the 1970s with the evolution of World Wide Web ( refer to as the Web) in the 1990s.Web usage has shown tremendous growth to the point where it now claims a good fraction of humanity as participants, by relying on a simple, open client-server design:

  • The server communicates with the client http via a protocol (the http or hypertext transfer protocol) that is lightweight and simple, asynchronously carrying a variety of payloads (text, images, and –over time – richer media such as audio and video files) encoded in a simple markup language called HTML (hypertext markup language).
  • The client – generally a browser, an application within a graphical user environment.

Each of these features has contributed enormously to the growth of the Web. The basic operation is as follows: A client (such as a browser) sends an http URL request to a web server. The browser specifies a URL (for universal resource locator) such as http://www.skpsoft.com/skpandey.htm. In this example URL, the string http refers to the protocol to be used for transmitting the data. The string www.skpsoft.com is known as the domain and specifies the root of a hierarchy of web pages (typically mirroring a file system hierarchy underlying the web server). In this example,/skpandey.htm is a path in this hierarchy with a file skpandey.htm that contains the information to be returned by the web server at www.skpaoft.com in response to this request. The HTML-encoded file skpandey.htm holds the hyperlinks and the content as well as formatting rules for rendering this content in a browser. Such an http request thus allows us to fetch the content of a page, something that will prove to be useful to us for information discovery and indexing documents.

The designers of the first browsers made it easy to view the HTML markup tags on the content of a URL. This simple convenience allowed new users to create their own HTML content without extensive training or experience; rather, they learned from example content that they liked. As they did so, a second feature of browsers supported the rapid proliferation of web content creation and usage: Browsers ignored what they did not understand. This did not lead to the creation of numerous incompatible dialects of HTML. What it  did promote was amateur content creators who could freely experiment with and learn from their newly created web pages without fear that a simple syntax error would “bring the system down.”

Publishing on the Web became a mass activity that was not limited to a few trained programmers, but rather open to tens and eventually hundreds of millions of individuals. For most users and for most information needs, the Web quickly became the best way to supply and consume information on everything from rare ailments to subway schedules. The mass publishing of information on the Web is essentially useless unless this wealth of information can be discovered and consumed by other users.

Early attempts at making web information “discoverable” fell into two broad categories:

  • Full-text index search engines such as Altavista, Excite, and Infoseek and ,
  • Taxonomies populated with web pages in categories, such as Yahoo!

The former presented the user with a keyword search interface supported by inverted indexes and ranking mechanisms . The latter allowed the user to browse through a hierarchical tree of category labels. Although this is at first convenient and intuitive metaphor for finding web pages, it has a number of drawbacks: accurately classifying web pages into taxonomy tree nodes is for the most part a manual editorial process, which is difficult to scale with the size of the Web. We only need to have “high-quality” Web pages in the taxonomy, with only the best web pages for each category. However ,just discovering these and classifying them accurately and consistently into the taxonomy entails significant human effort. Furthermore, for a user to effectively discover web pages classified into the nodes of the taxonomy tree, the user’s idea of what subtree(s) to seek for a particular topic should match that of the editors performing the classification. This quickly becomes challenging as the size of the taxonomy grows. Given these challenges, the popularity of taxonomies declined over time, even though variants (such as About.com and the Open Directory Project) sprang up with subject matter experts collecting and annotating web pages for each category.

The first generation of web information discovery transported classical search techniques focusing on the challenge of scale. The earliest web search engines had to contend with indexes containing tens of millions of documents, which was a few orders of magnitude larger than any prior information discovery  system in the public domain. Indexing, query serving, and ranking at this scale required the harnessing together of tens of machines to create highly available systems, again at scales not witnessed hitherto in a consumer-facing search application. The first generation of web search engines was largely successful at solving these challenges while continually indexing a significant fraction of the Web, all the while serving queries with sub second response times. However, the quality and relevance of web search results left much to be desired owing to the idiosyncrasies of content creation on the Web .This necessitated the invention of new  ranking and spam-fighting techniques to ensure the quality of the search results. Although classical IR techniques continue to be necessary for web search, they are not by any means sufficient. A key aspect  is that whereas classical techniques measure the relevance of a document to a query, there remains a need to gauge the authoritativeness of a document based on cues such as which website hosts it.

Web Information Discovery :  Techniques and their limitations

Some basic web search techniques are covered below :

  • Random searches: Begin with a search log of web searches.

  • Random IP addresses: A second approach is to generate random IP addresses and send a request to a web server residing at the random address, collecting all pages at that server. The biases here include the fact that many hosts might share one IP (due to a practice known as virtual hosting) or not accept http requests from the host where the experiment is conducted. Furthermore, this technique is more likely to hit one of the many sites with few pages, skewing the document probabilities; we may be able to correct for this effect if we understand the distribution of the number of pages on websites.

  • Random walks: If the web graph were a strongly connected directed graph, we could run a random walk starting at an arbitrary web page. This walk would converge to a steady state distribution from which we could in principle pick a web page with a fixed probability. This method has a number of biases. First, the Web is not strongly connected so that, even with various corrective rules, it is difficult to argue that we can reach a steady-state distribution starting from any page. Second, the time it takes for the random walk to settle into this steady state is unknown and could exceed the length of the experiment.

  • Random queries: This approach is noteworthy for two reasons: It has been successfully built upon for a series of increasingly refined estimates, and conversely it has turned out to be the approach most likely to be misinterpreted and carelessly implemented, leading to misleading measurements. The idea is to pick a page (almost) uniformly at random from a search engine’s index by posing a random query to it. It should be clear that picking a set of random terms from (say) Webster’s Dictionary is not a good way of implementing this idea. For one thing, not all vocabulary terms occur equally often, so this approach will not result in documents being chosen uniformly at random from the search engine. For another, there are a great many terms in web documents that do not occur in a standard dictionary such as Webster’s. To address the problem of vocabulary terms not in a standard dictionary, we begin by amassing a sample web dictionary. This could be done by discovering  a limited portion of the Web, or by discovering a manually assembled representative subset of the Web. Consider a conjunctive query with two or more randomly chosen words from this dictionary.

A sequence of research has built on this basic paradigm to eliminate some of these issues; there is no perfect solution yet, but the level of sophistication in statistics for understanding the biases is increasing. The main idea is to address biases by estimating, for each document, the magnitude of the bias. From this, standard statistical sampling methods can generate unbiased samples. In the checking phase, the newer work moves away from conjunctive queries to phrase and other queries that appear to be better-behaved. Finally, newer experiments use other sampling methods besides random queries. The best known of these is document random walk sampling, in which a document is chosen by a random walk on a virtual graph derived from documents. In this graph, nodes are documents; two documents are connected by an edge if they share two or more words in common. The graph is never instantiated; rather, a random walk on it can be performed by moving from a document d to another by picking a pair of keywords in d, running a query on a search engine and picking a random document from the results.

Web Information Discovery : Results Evaluation

Even when advances are made in the ranking of search results, proper evaluation of these improvements is a non-trivial task. In contrast to traditional IR evaluation methods using manually classified corpora such as the TREC collections, evaluating the efficacy of web search engines remains an open problem. Recent efforts in this area have examined interleaving the results of two different ranking schemes and using statistical tests based on the results users clicked on to determine which ranking scheme is “better” . There has also been work along the lines of using decision theoretic analysis (i.e., maximizing users’ utility when searching, considering the relevance of the results found as well as the time taken to find those results) as a means for determining the “goodness” of a ranking scheme. Commercial search engines often make use of various manual and statistical evaluation criteria in evaluating their ranking functions. Still, principled automated means for large-scale evaluation of ranking results are wanting, and their development would help improve commercial search engines and create better methodologies to evaluate IR research in broader contexts.

Information Discovery on the Web : Technical Aspects

An Web Information Discovery (ID) on the web is to identify which pages are of high quality and relevance to a user’s query. There are many aspects of webID that differentiate it and make it somewhat more challenging than traditional problems exemplified by the TREC competition .The pages on the web contain links to other pages and by analyzing this web graph structure it is possible to determine a more global notion of page quality. Notable early successes in this area include the PageRank algorithm, which globally analyzes the entire web graph and provided the original basis for ranking in the search engine, and Kleinberg’s HITS algorithm, which analyzes a local neighborhood of the web graph containing an initial set of web pages matching the user’s query. Since that time, several other linked-based methods for ranking web pages have been proposed including variants of both PageRank and HITS , and this remains an active research area in which there is still much fertile research ground to be explored. Besides just looking at the link tructure in web pages, it is also possible to exploit the anchor text contained in links as an indication of the content of the web page being pointed to. Especially since anchor text tends to be short, it often gives a concise human generated description of the content of a web page. By harnessing anchor text, it is possible to have index terms for a web page even if the page contains only images. Determining which terms from anchors and surrounding text should be used in indexing a page presents other interesting research venues.

Discovery of Multimedia data from the Web

With the proliferation of digital still and video cameras, camera phones, audio recording devices, and mp3 music, there is a rapidly increasing number of non-textual “documents” available to users. One of the challenges faced in the quest to organize and make useful all of the world’s information, is the process by which the contents of these non-textual objects should indexed. An equally important line of study  is how to present the user with intuitive methods by which to query and access this information. The difficulties in addressing the problem of non-textual object retrieval are best illustrated on the basis of following factors. In addressing this sort of diversity, we presently give three basic approaches to the task of retrieving images and music.

  1. Content Detection: For images, this method means that the individual objects in the image are detected, possibly segmented, and recognized. The image is then labeled with detected objects. For music, this method may include recognizing the instruments that are played as well as the words that are said/sung, and even determining the artists. Of the three approaches, this is the one that is the furthest from being adequately realized, and involves the most signal processing.

  2. Content Similarity Assessment: In this approach, we do not attempt to recognize the content of the images (or audio clips). Instead, we attempt to find images (audio tracks) that are similar to the query items. For example, the user may provide an image (audio snippet) of what the types of results that they are interested in finding, and based on low-level similarity measures, such as (spatial) color histograms, audio frequency histograms, etc, similar objects are returned. Systems such as these have often been used to find images of sunsets, blue skies, etc.  and have also been applied to the task of finding similar music genres .

  3. Using Surrounding Textual Information: A common method of assigning labels to non-textual objects is to use information that surrounds these objects in the documents that they are found. For example, when images are found in web documents, there is a wealth of information that can be used as evidence of the image contents. For example, the site on which the image appears (for example an auction  site or a site about music, news , sports  etc.), how the image is referred to, the image’s filename, and even the surrounding text all provide potentially relevant information about the image.

Web Data Harnessing

One of the most interesting aspects of working with web data is the insight and appreciation that one can get for large data sets. This has been exemplified by Banko and Brill in the case of word sense disambiguation, but as a practical example, we also understand it on the basis of two factors: Spelling Correction and Query Classification.

  • Spelling Correction: In contrast to traditional approaches which solely make use of standard term lexicons to make spelling corrections, the Google spelling corrector takes a Machine Learning approach that leverages an enormous volume of text to build a very fine grained probabilistic context sensitive model for spelling correction. This allows the system to recognize far more terms than a standard spelling correction system, especially proper names which commonly appear in web queries but not in standard lexicons.

  • Query Classification : It becomes more challenging when we consider that the “documents” to be classified are user queries, which have an average length of just over two words. Despite these challenges, we have available roughly four million pre-classified documents, giving us quite a substantial training set. We tried a variety of different approaches that explored many different aspects of the classifier model space: independence assumptions between words, modeling word order and dependencies for two and three word queries, generative and discriminative models, boosting, and others.

WID: Issues in indexing the World Wide Web

An ideal search engine would give a complete and comprehensive representation of the Web . Unfortunately, such a search engine does not exist. There are technical and economical factors that prevent these engines from indexing the whole web every day.

On the economic side, it is very expensive to crawl the whole Web. Such a challenge can only be met with the use of server farms consisting of hundreds if not thousands of computers. On the technical side, the challenge starts with finding all the relevant documents in an environment where no one knows how large it is. Therefore, it is difficult to measure the part of the Web that a certain search engines covers.

  • Size of the databases, Web coverage: Search engine sizes are often compared by their self-reported numbers. Google claims to have indexed approx. 8 billion documents and Yahoo claims that its total index size is 19 billion Web documents, which seems to be highly exaggerated. Estimates show that this engine has indexed approx. 5–7 billion documents, while competitor MSN – which does not report numbers – lies between 4 and 5 billion . Some studies tried to measure the exact index sizes of the search engines and their coverage of the indexable Web . They suggest that the data published by the search engines is usually reliable, and some indices are even bigger than the engines claim. To determine the Web coverage of search engines, one has first to discover how large the Web actually is. This is very problematic, since there is no central directory of all Web pages. The only possibility is to estimate the size based on a representative sample. A recent study  found that the indexable Web contains at least 11.5 billion pages, not including the Invisible Web Another important fact is that search engines should not index the entire Web. An ideal search engine should know all the pages of the Web, but there are contents such as duplicates or spam pages  that should not be indexed. So the size of its index alone is not a good indicator for the overall quality of a search engine. But it seems the only factor to compare the competitors easily.

  • Up-to-dateness of search engines’ databases: Search engines should not only focus on the sizes of their indices, but also on their up-to-dateness. The contents on the Web change very fast  and therefore, new or updated pages should be indexed as fast as possible. Search engines face problems in keeping up to date with the entire Web, and because of its enormous size and the different update cycles of individual websites, adequate crawling strategies are needed.The big search engines MSN, HotBot, Google, AlltheWeb, and AltaVista all had some pages in their databases that were current or one day old. The newest pages in the databases of the smaller engines Gigablast, Teoma, and Wisenut were pages that were quite older, at least 40 days. When looking for the oldest pages, results differed a lot more and ranged from 51 days (MSN and HotBot) to 599 days (AlltheWeb). This shows that a regular update cycle of 30 days, as usually assumed for all the engines, is not used. All tested search engines had older pages in their databases. In a recent study by Lewandowski, Wahlig and Meyer-Bautor , the three big search engines Google, MSN and Yahoo are analysed. The question is whether they are able to index current contents on a daily basis. 38 daily updated web sites are observed within a period of six weeks. Findings include that none of the engines is able to keep current with all pages analysed, that Google achieves the best overall results and only MSN is able to update all pages within a time-span of less than 20 days. Both other engines have outliers that are quite older.

  •  Web content: Web documents differ significantly from documents in traditional information systems. On the Web, documents are written in many different languages, whilst other information systems usually cover only one or a few selected languages. Documents are indexed using a controlled vocabulary, which allows it to search for documents written in different languages with just one query. Another difference is the use of many different file types on the Web. Search engines today not only index documents written in HTML, but also PDF, Word, or other Office files. Each file format provides certain difficulties for the search engines. In the overall ranking, all file formats have to be considered. There are some characteristics, which often coincide with certain file formats, such as the length of PDF files, which are often longer than documents written in HTML. The length of documents on the Web varies from just a few words to very long documents. This has to be considered in the rankings. Another problem is the documents structure. HTML and other typical Web-documents are just vaguely structured. There is no field structure similar to traditional information systems, which makes it a lot more difficult to allow for exact search queries.

  •  The Invisible Web: The Invisible Web is defined as the part of the Web that search engines do not index. This may be due to technical reasons or barriers made by website owners, e.g. password protection or robots exclusions. The Invisible Web is an interesting part of the Web because of its size and its data, which is often of high quality. Sherman and Price say the Invisible Web consists of “text pages, files, or other often high-quality authoritative information available via the World Wide Web that general purpose search engines cannot, due to technical limitations, or will not, due to deliberate choice, add to their indices of Web pages”. Surely the most interesting part of the Invisible Web are databases that are available via the Web, many of which can be used free of charge. Search engines can index the search forms of these databases but are not able to get beyond them. The content of the databases itself remains invisible for the search engines. But in many cases, databases offer a large amount of quality information. Commercial database vendors such as Lexis-Nexis are omitted because they protect their contents, which are only available for paying customers. But other databases can be used for free. For example, the databases of the United States Patent and Trademark Office (like many other patent databases) contain millions of patents and patent applications in full text, but search engines are not able to index these valuable contents. There are different solutions for this. One is to integrate the most important of these databases manually. Google, for example, does this for patent data, but only when one searches for a patent number. Above the regular hits, Google displays a link to the USPTO database. Another solution is a kind of meta search engine that integrates not only regular Web search engines but also Invisible Web databases . Finally, another solution comes from the Webmasters themselves: They convert their databases to regular HTML pages. A well-known example for this is the Amazon website. Originally a large database of books, each database record is converted into HTML and can be found in the search engine indices. Today, it is unclear to what extent this method is used and if the Invisible Web is still such a big problem as it used to be some years ago. Regarding the size of the Invisible Web, it is surely quite smaller than proposed by Bergman in 2001. He said the Invisible Web was 400 to 500 times larger than the Surface Web, but his calculations were based on some of the largest Invisible Web databases, which included sites such as the National Climate Data Center (NOAA) and NASA EOSDIS, both of which are databases of satellite images of the earth. For each picture included, its size in kilobytes was added. As a result, Bergman concludes that the NOAA database contains 30 times more data than Lexis-Nexis, which is a mere textual database. But this says nothing about the amount of information. In conclusion, Bergman’s figures seem highly overestimated. Other authors assume that the Invisible Web is 20 to 50 times larger than the Surface Web .

Web Information Discovery: Users perspective

The users of Web search engines are very heterogeneous and the engines are used by laypersons, as well as by information professionals or experts in certain fields. Apart from studies discussing the common user behaviour, there are some studies that discuss the behaviour of certain user groups. But there are no scientific investigations that discuss how real information professionals in intelligence departments or management consultancies use Web search engines. Instead, most user studies focus on the typical lay user.

The main findings of these studies are that the users are not very sophisticated. Only half of the users know about Boolean operators and only slightly more (59 percent) know about advanced search forms. But knowing them does not mean that they are used: Only 14 percent say that they use them. In a laboratory test in the same study, the use of the advanced search forms was even lower. In studies based on transaction log analysis, Spink and Jansen found that Boolean operators are only used in one out of ten queries. Half of the Boolean queries are ill-formed ; when plus and minus signs are used (which is generally preferred by the users), the fraction of ill-formed queries rises to two thirds.
Users look only seldom at results coming after the first search results page, which means that results which are not among the top 10 are nearly invisible for the general user. There is a tendency that users often only look at the results set that can be seen without scrolling . Within one search session, users look at five documents on average  and each document is only shortly examined. Sessions are usually terminated when one suitable document is found. A typical search session lasts less than 15 minutes.
Issues

WID: Issues in link-based ranking algorithms

Link-based ranking algorithms are dominant in today’s search engines and it is often forgotten that these approaches face some difficulties and provide some kind of bias in the results.

Firstly, they are based on a certain quality model. Quality is equated with authority (a notion used by Kleinberg ) or (link) popularity. Other quality factors are disregarded and the algorithms are solely based on an improved quality model as used in citation indexing. The reason for this lies mainly in the link structure of the Web, which can be exploited relatively easy, the regress on well-established bibliometric methods and the plausibility of the basic assumption. In link-based ranking algorithms, every link is counted as a vote for the linked page. But there are several reasons for linking to a certain page, so links cannot be seen as analogous to citing literature. Some links are just put for navigational purposes, some are indeed pointing to content, but they are used as a deterring example. Link-based ranking algorithms cannot differentiate between these and links pointing to good content. Other links are placed out of favour or for promotional purposes. There is no strict border between “good” link exchange and manipulation, and therefore it is difficult for search engines to find links that should not be counted.

WID : Issues in Web Crawling

We consider a Web crawler that has to download a set of pages, with each page p having size Sp measured in bytes, using a network connection of capacity B, measured in bytes per second. The objective of the crawler is to download all the pages in the minimum time. A trivial solution to this problem is to download all the Web pages simultaneously. Some major  issues in web crawler scheduling are considered below :

  • Strategies with no extra information :  These strategies only use the information gathered during the current crawling process.

  • Breadth-first : Under this strategy, the crawler visits the pages in breadth-first ordering. It starts by visiting all the home pages of all the “seed” Web sites, and Web page heaps are kept in such a way that new pages added go at the end. This is the same strategy tested by Najork and Wiener, which in their experiments showed to capture high-quality pages first.

  • Backlink-count : This strategy crawls first the pages with the highest number of links pointing to it, so the next page to be crawled is the most linked from the pages already downloaded. This strategy was described by Cho et al.

  • Batch-pagerank : This strategy calculates an estimation of Pagerank, using the pages seen so far, every K pages downloaded. The next K pages to download are the pages with the highest estimated Pagerank. We used K = 100,000 pages, which in our case gives about 30 to 40 Pagerank calculations during the crawl. This strategy was also studied by Cho et al., and it was found to be better than backlink-count. However, Boldi et al.  showed that the approximations of Pagerank using partial graphs can be very inexact.

  • Partial-pagerank : This is like batch-pagerank, but in between Pagerank re-calculations, a temporary pagerank is assigned to new pages using the sum of the Pagerank of the pages pointing to it divided by the number of out-links of those pages.

  • OPIC : This strategy is based on OPIC [APC03], which can be seen as a weighted backlink-count strategy. All pages start with the same amount of “cash”. Every time a page is crawled, its “cash” is split among the pages it links to. The priority of an uncrawled page is the sum of the “cash” it has received from the pages pointing to it. This strategy is similar to Pagerank, but has no random links and the calculation is not iterative – so it is much faster.

  • Larger-sites-first : The goal of this strategy is to avoid having too many pending pages in any Web site, to avoid having at the end only a small number of large Web sites that may lead to spare time due to the “do not overload” rule. The crawler uses the number of un-crawled pages found so far as the priority for picking a Web site, and starts with the sites with the larger number of pending pages. This strategy was introduced in 2004 and was found to be better than breadth-first. 

  • Strategies with historical information :  These strategies use the Pagerank of a previous crawl as an estimation of the Pagerank in this crawl, and start in the pages with a high Pagerank in the last crawl. This is only an approximation because Pagerank can change: Cho and Adams report that the average relative error for estimating the Pagerank four months ahead is about 78%. Also, a study by Ntoulas et. al reports that “the link structure of the Web is significantly more dynamic than the contents on the Web. Every week, about 25% new links are created”. We explore a number of strategies to deal with the pages found in the current crawl which were not found in the previous one:

    • Historical-pagerank-omniscient : New pages are assigned a Pagerank taken from an oracle that knows the full graph.
    • Historical-pagerank-random : New pages are assigned a Pagerank value selected uniformly at random among the values obtained in previous crawl.
    • Historical-pagerank-zero:  New pages are assigned Pagerank zero, i.e., old pages are crawled first, then new pages are crawled.
    • Historical-pagerank-parent: New pages are assigned the Pagerank of the parent page (the page in which the link was found) divided by the number of out-links of the parent page.
  • Strategy with all the information :Omniscient- this strategy can query an “oracle” which knows the complete Web graph and has calculated the actual Pagerank of each page. Every time the omniscient strategy needs to prioritize a download, it asks the oracle and downloads the page with the highest ranking in its frontier. Note that this strategy is bound to the same restrictions as the others, and can only download a page if it has already downloaded a page that points to it.

This article covers only the basic concepts .For core technical aspects & consultancy you can consult the following references or e-mail at : skphind@yahoo.co.uk


References

? Mining the Web By Soumen Chakrabarti ,Morgan Kaufmann Pub.
? Advances in Web and Network Technologies, and Information Management By Kevin Chen-Chuan Chang, Lei-da Chen,Springer Pub.
? Information modeling for Internet applications By Patrick van Bommel , Idea Group Inc (IGI) Pub.
? Web Engineering By Woojong Suh , Idea Group Inc (IGI) Pub.

About the Authors

Sunil Kr.Pandey
Asst. Professor
Department of Computer Science,
School of Management  Sciences(SMS),
Varanasi(UP)
India.
E-mail:skphind@rediffmail.com
R.B.Mishra
Professor
Department of Computer Engineering
Institute of Technology(IT),
Banaras Hindu University(BHU),
Varanasi(UP)
India.

 








}