History and Working of Web Crawler

Md. Abu Kausar

Web Crawler

A web crawler is a program/software or programmed script that browses the World Wide Web in a systematic, automated manner. The structure of the WWW is a graphical structure, i.e., the links presented in a web page may be used to open other web pages. Internet is a directed graph where webpage as a node and hyperlink as an edge, thus the search operation may be summarized as a process of traversing directed graph. By following the linked structure of the Web, web crawler may traverse several new web pages starting from a webpage. A web crawler move from page to page by the using of graphical structure of the web pages. Such programs are also known as robots, spiders, and worms. Web crawlers are designed to retrieve Web pages and insert them to local repository. Crawlers are basically used to create a replica of all the visited pages that are later processed by a search engine that will index the downloaded pages that help in quick searches. Search engines job is to storing information about several webs pages, which they retrieve from WWW. These pages are retrieved by a Web crawler that is an automated Web browser that follows each link it sees.

The History of Web Crawler

The first Internet “search engine”, a tool called “Archie” — shortened from “Archives”, was developed in 1990 and downloaded the directory listings from specified public anonymous FTP (File Transfer Protocol) sites into local files, around once a month. In 1991, “Gopher” was created, that indexed plain text documents. “Jughead” and “Veronica” programs are helpful to explore the said Gopher indexes. With the introduction of the World Wide Web in 1991 numerous of these Gopher sites changed to web sites that were properly linked by HTML links. In the year 1993, the “World WideWebWanderer” was formed the first crawler. Although this crawler was initially used to measure the size of the Web, it was later used to retrieve URLs that were then stored in a database called Wandex, the first web search engine. Another early search engine, “Aliweb” (Archie-Like Indexing for the Web) allowed users to submit the URL of a manually constructed index of their site.

The index contained a list of URLs and a list of user wrote keywords and descriptions. The network overhead of crawlers initially caused much controversy, but this issue was resolved in 1994 with the introduction of the Robots Exclusion Standard which allowed web site administrators to block crawlers from retrieving part or all of their sites. Also, in the year 1994, “WebCrawler” was launched the first “full text” crawler and search engine. The “WebCrawler” permitted the users to explore the web content of documents rather than the keywords and descriptors written by the web administrators, reducing the possibility of confusing results and allowing better search capabilities. Around this time, commercial search engines began to appear with Infoseek, Lycos, Altivista, Excite, Dogpile, Inktomi, Ask.com and Northern light being launched from 1994 to 1997. Also introduced in 1994 was Yahoo! , a directory of web sites that was manually maintained, though later incorporating a search engine. During these early years Yahoo! and Altavista maintained the largest market share. In 1998 Google was launched, quickly capturing the market. Unlike many of the search engines at the time, Google had a simple uncluttered interface, unbiased search results that were reasonably relevant, and a lower number of spam results. These last two qualities were due to Google’s use of the PageRank algorithm and the use of anchor term weighting.

While early crawlers dealt with relatively small amounts of data, modern crawlers, such as the one used by Google, need to handle a substantially larger volume of data due to the dramatic enhance in the amount of the Web.

Working of Web Crawler

The working of Web crawler is beginning with initial set of URLs known as seed URLs. They download web pages for the seed URLs and extract new links present in the downloaded pages. The retrieved web pages are stored and well indexed on the storage area so that by the help of these indexes they can later be retrieved as and when required. The extracted URLs from the downloaded page are confirmed to know whether their related documents have already been downloaded or not. If they are not downloaded, the URLs are again assigned to web crawlers for further downloading. This process is repeated till no more URLs are missing for downloading. Millions of pages are downloaded per day by a crawler to complete the target. Figure 2 illustrates the crawling processes.

 

The working of a web crawler may be discussed as follows:

  • Selecting a starting seed URL or URLs
  • Adding it to the frontier
  • Now picking the URL from the frontier
  • Fetching the web-page corresponding to that URL
  • Parsing that web-page to find new URL links
  • Adding all the newly found URLs into the frontier
  • Go to step 2  and reiterate till the frontier is empty

Thus a web crawler will recursively keep on inserting newer URLs to the database repository of the search engine. So we can see that the major function of a web crawler is to insert new links into the frontier and to choose a fresh URL from the frontier for further processing after every recursive step.

About the Author

Md. Abu Kausar Md. Abu Kausar
Md. Abu Kausar received his Master in Computer Sc.from G. B. Pant University of Ag. & Tech., Pantnagar,India in 2006 and MBA (IT) from Symbiosis, Pune,India in 2012. He has received MCTS. At present, he ispursuing Ph.D in Computer Sc. from Jaipur NationalUniversity, Jaipur, India. He is having 9 years ofexperience as a Software Developer&Teaching.








}