Name: Enhancement in Web Crawler using Weighted Page Rank Algorithm based on VOL
Price: 47.95 EUR
Availability: InStock
Author: Sachin Gupta
ISBN: 978-3-656-70004-3

As the World Wide Web is growing rapidly day by day, the number of web pages is increasing into millions and trillions around the world. To make searching much easier for users, search engines came into existence. Web search engines are used to find specific information on the WWW. Without search engines, it would be almost impossible for us to locate anything on the Web unless or until we know a specific URL address. Every search engine maintains a central repository or databases of HTML documents in indexed form. Whenever a user query comes, searching is performed within that database of indexed web pages. The size of repository of every search engine can’t accommodate each and every page available on the WWW. So it is desired that only the most relevant and important pages are stored in the database to increase the efficiency of search engines. This database of HTML documents is maintained by special software called “Crawler”. A Crawler is software that traverses the web and downloads web pages. Broad search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Since the Web is a distributed, dynamic and rapidly growing information resource, a crawler cannot download all pages. It is almost impossible for crawlers to crawl the whole web pages from World Wide Web. Crawlers crawls only fraction of web pages from World Wide Web. So a crawler should observe that the fraction of pages crawled must be most relevant and the most important ones, not just random pages. In our Work, we propose an extended architecture of web crawler of search engine, to crawl only relevant and important pages from WWW, which will lead to reduced sever overheads. With our proposed architecture we will also be optimizing the crawled data by removing least or never browsed pages data. The crawler needs a very large memory space of database for storing page content etc, by not storing irrelevant and unimportant pages and removing never accessed pages, we will be saving a lot of memory space that will eventually speed up the searches (queries) from the database. In our approach, we propose to use Weighted page Rank based on visits of links algorithm to sort the search results, which will reduce the search space for users, by providing mostly visited pages links on the top of search results list.

ACKNOWLEDGEMENT

First of all, I would like to thank the Supreme power, The GOD, who guided me to work on the right path of life. Without his grace, this would not have been possible.

Words are often too less to reveal ones deep regards. An understanding of the work like this is never the outcome of the efforts of a single person. I take this opportunity to express my profound sense of gratitude and respect to all those who helped me through the duration of this thesis.

I would like to place on record my deep sense of gratitude to my guide Mr. Sashi Tarun. Asstt. Prof. Computer Science Engineering Department, ARNI UNIVERSITY, Kathgarh (H.P), India for his stimulating guidance, continuous encouragement and supervision throughout the course of present work.

It brings a great happiness to thank Mr. Rajiv Ranjan, H.O.D, Computer Science Engineering, ARNI UNIVERSITY, Kathgarh (H.P), India for his generous cooperation provide facilities for completion of my thesis.

I am extremely thankful to ARNI UNIVERSITY, Kathgarh (H.P), India, for providing me infrastructural facilities to work in, without which this work would not have been possible.

Finally, a note of thanks to my parents who have been the ultimate source of support. I can never thank them enough for their constant guidelines, inspiration, and encouragement.

ABSTRACT

As the World Wide Web is growing rapidly day by day, the number of web pages is increasing into millions and trillions around the world. To make searching much easier for users, search engines came into existence. Web search engines are used to find specific information on the WWW. Without search engines, it would be almost impossible for us to locate anything on the Web unless or until we know a specific URL address. Every search engine maintains a central repository or databases of HTML documents in indexed form. Whenever a user query comes, searching is performed within that database of indexed web pages. The size of repository of every search engine can’t accommodate each and every page available on the WWW. So it is desired that only the most relevant and important pages are stored in the database to increase the efficiency of search engines. This database of HTML documents is maintained by special software called “Crawler”. A Crawler is software that traverses the web and downloads web pages. Web Crawlers are also known as “Spiders”, “Robots”, “Bots”, “Agents” and “Automatic Harvesters / Indexers” etc. Broad search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Since the Web is a distributed, dynamic and rapidly growing information resource, a crawler cannot download all pages. It is almost impossible for crawlers to crawl the whole web pages from World Wide Web. Crawlers crawls only fraction of web pages from World Wide Web. So a crawler should observe that the fraction of pages crawled must be most relevant and the most important ones, not just random pages. The crawler is an important module of a search engine. The quality of a crawler directly affects the searching quality of search engines. In our Work, we propose an extended architecture of web crawler of search engine, to crawl only relevant and important pages from WWW, which will lead to reduced sever overheads. With our proposed architecture we will also be optimizing the crawled data by removing least or never browsed pages data. The crawler needs a very large memory space of database for storing page content etc, by not storing irrelevant and unimportant pages and removing never accessed pages, we will be saving a lot of memory space that will eventually speed up the searches (queries) from the database. In our approach, we propose to use Weighted page Rank based on visits of links algorithm to sort the search results, which will reduce the search space for users, by providing mostly visited pages links on the top of search results list.

LIST OF TABLES

7.1 Syntactically Different URLS 50
8.1 Saved pages results for Weighted Page Rank (WPR) Algorithm and Page Rank (PR) Algorithm

LIST OF Figures

2.1 General Structure of Search Engine
2.2 Working of General Search Engine
3.1 Components and Working of Web Crawler
3.2 Flow chart of Web Crawler
3.3 Architecture of Parallel Crawler
4.1 Breadth First Crawler
4.2 Best First Crawler
4.3 Web Interlinked Structure
7.1 Page Object Structure
7.2 Pages Tree Structure.
7.3 Proposed Architecture of Web Crawler
7.4 Block Diagram of WPR comparison Module for Pages
7.5 Block Diagram of WPRVOL comparison Module for Pages
7.6 URL Normalization Process
7.7 Block Diagram of Removing Redundancy with Hash
8.1 Fetching Pages for vvguptaandco.com
8.2 Calculating Domain age and relevancy for vvguptaandco.com
8.3 Saves Relevant Pages to Database for vvguptaandco.com
8.4 Saving Pages on basis of WPR Criteria for vvguptaandco.com
8.5 No. of unique Pages and md5 rejected pages for vvguptaandco.com
8.6 Fetching Pages for ovhconsulting.com
8.7 Calculating Domain age and relevancy for ovhconsulting.com
8.8 Saves Relevant Pages to Database for ovhconsulting.com
8.9 Saving Pages on basis of WPR Criteria for ovhconsulting.com
8.10 No. of unique Pages found for ovhconsulting.com
8.11 No. of STN rejected and MD5 rejected Pages for ovhconsulting.com
8.12 Fetching Pages for infopathankot.com
8.13 Calculating Domain age, Relevancy and saving page on relevancy basis for infopathankot.com
8.14 Calculating WPR of Pages for infopathankot.com
8.15 Saving pages on WPR basis for infopathankot.com
8.16 Unique pages for infopathankot.com
8.17 STN and MD5 Rejected Pages for infopathankot.com
8.18 Visits of Links data on crawlnsearch.com server
8.19 VOL enhancement started
8.20 WPRVOL Score computation part 1
8.21 WPRVOL Score computation part 2
8.22 Pages retained in database after WPRVOL Threshold check
8.23 Pages deleted in database after WPRVOL Threshold check
8.24 Snapshot of Reduced Execution time while searching
8.25 No. of fetched pages vs. No of saved Pages

LIST OF ABBREVIATION

illustration not visible in this excerpt

LIST OF SYMBOLS

illustration not visible in this excerpt

CHAPTER 1 INTRODUCTION

According to Internet World Stats survey[1], as on June 30, 2012 - 2,405,518,376 (app. 241 billion) people use the internet[1]. Among them 44.8% internet users are from Asia. According to Pew Research Center’s Internet and American Life Project Survey Report 59% adults use web search engines on daily basis[2]. On the average 75% traffic to a website is generated by the Search Engine[3]. The vast expansion of the internet is getting more and more day by day. The World Wide Web (commonly termed as the Web) is a system of interlinked hypertext documents accessed via the Internet. With a Web browser, a user views Web pages that may contain text, images, videos, and other multimedia and navigates between them using hyperlinks.

Difference between Web and Internet One can easily get confused by thinking that both World Wide Web and the Internet is the same thing. But the fact is that both are quite different. The Internet and the World Wide Web are not one and the same. The Internet is a collection of interconnected computer networks, linked by copper wires, fiber-optic cables, wireless connections, etc. In contrast, the Web is a collection of interconnected documents and other resources, linked by hyperlinks and URLs. The World Wide Web is one of the services accessible via the Internet, along with various others including e-mail, file sharing, online gaming and others described below. However, "the Internet" and "the Web" are commonly used interchangeably in non-technical settings.

1.1 Working of Web

To view a Web page on the World Wide Web , the procedure begins either by typing the URL of the page into a Web browser, or by following a hyperlink to that page or resource. The Web browser then initiates a series of communication messages, behind the scenes, in order to fetch and display it. First, the server-name portion of the URL is resolved into an IP address using the global, distributed Internet database known as the domain name system, or DNS. This IP address is necessary to contact and send data packets to the Web server. The browser then requests the resource by sending an HTTP request to the Web server at that particular address. In the case of a typical Web page, the HTML text of the page is requested first and parsed immediately by the Web browser, which will then make additional requests for images and any other files that form a part of the page. When a user needs any information, he or she searches it over web. All this searching within the Web is performed by the special engines, known as Web Search Engines. Web search engines are used to find specific information on the World Wide Web. Without search engines, it would be almost impossible for us to locate anything on the Web unless or until we know a specific URL address.

Most people using search engines like Google, Bing, Yahoo!, Alta Vista, Ask, Baidu or Yandex to find what they are looking for on the World Wide Web. Hence it is better to know how these search engines actually work and how they present information to the user initiating a search. When you ask a search engine to locate the information, it is actually searching through the index which it has created and not actually searching through the Web. Different search engines produce different rankings because not every search engine uses the same algorithm to search through the indices. Many leading search engines use a form of software program called the crawlers or spiders or bots to find information on the World Wide Web and store it for search results in huge databases. Some spiders record every word on a Web site for their respective indexes, while others only report certain keywords listed in title tags or Meta tags. Search Engines use spiders to index the websites. When you submit your website pages to a search engine by completing their required submission page, the search engine spider will index your entire site.

A Crawler is an automated program that is run by the search engine system. Search engine indexing collects, parses, and stores the data to facilitate fast and accurate information retrieval. Earlier Crawlers / Spiders were unable to index pictures or read text that is contained within graphics. But now, note that Google can index the content of Flash files even, other search engines may not be able to[4]. WebCrawler was the Internet’s first search engine that performed keyword searches in both the names and texts of pages on the World Wide Web. It won quick popularity and loyalty among surfers looking for information. During the Web’s infancy, WebCrawler[5] was born in January 1994. It was developed by Brian Pinker- ton, a computer student at the University of Washington, to cope with the complexity of the Web. Pinkerton’s application, WebCrawler, could automatically scan the individual sites on the Web, register their content and create an index that surfers could query with keywords to find Web sites relevant to their interests.

1.2 Problems with Search Engine Crawling

World Wide Web is growing rapidly day by day, the number of web pages is increasing into millions and billions around the world. According to worldwidewebsize.com on 9 June, 2014 - World Wide Web contains at least 4.96 billion pages[6]. To make searching of information much easier for users, web search engines came into existence. As mentioned earlier, Web search is currently generating more than 75% of the traffic to Web sites. The main problem search engines have to deal with is the size of the Web, which currently is in the order of thousands of millions of pages. This large size induces a low coverage, with no search engine indexing more than one third of the publicly available Web. We would like to develop a scheduling policy for downloading pages from the Web which guarantees that, even if we do not download all of the known pages, we still download the most valuable ones. As the number of pages grows, it will be increasingly important to download the better ones first, as it will be impossible to download them all [7].

1.3 Thesis Organization

This section discusses the organization of this thesis. This thesis is organized as follows:

Chapter 2 reviews types, working of Search Engines and their components.
Chapter 3 describes Web Crawlers with their architecture.
Chapter 4 describes the Crawling Algorithms of Web Crawlers.
Chapter 5 constitutes Literature Review and the work done in the field of search engines and Crawling.
Chapter 6 describes problem statement, the issue related with any crawling algorithm.
Chapter 7 gives the Methodology of our work. It will suggest the new architecture, the assumptions taken into consideration during the implementation.
Chapter 8 gives the results obtained after implementation and discussions on those results.
Chapter 9 concludes the thesis work; illustrates future work and improvements that can be done to enhance performances of the thesis work.

CHAPTER 2 SEARCH ENGINE

Finding key information from gigantic World Wide Web is similar to find a needle lost in haystack. For this purpose we would use a special magnet that would automatically, quickly and effortlessly attract that needle for us. In this scenario magnet is “Search Engine”. Even a blind squirrel finds a nut, occasionally. But few of us are determined enough to search through millions, or billions, of pages of information to find our “nut”. So, to reduce the problems to a, more or less, manageable solution, web “search engines” were introduced a few years ago[8]. On the basis of working[9], search engine is categories in following group.

Crawler-Based Search Engines
Directories
Hybrid Search Engines
Meta Search Engines

2.1 Types of Search Engine

2.1.1 Crawler-Based Search Engine

It uses automated software programs to survey and categories web pages, which is known as spiders, crawlers, robots or bots. A spider will find a web page, download it and analyses the information presented on the web page. The web page will then be added to the search engine’s database. When a user performs a search, the search engine will check its database of web pages for the key words the user searched. The results (list of suggested links to go to), are listed on pages by order of which is “closestெ (as defined by the “bots”). Examples of crawler-based search engines are: Google (www.google.com), Bing (www.bing.com), Ask Jeeves (www.ask.com).

2.1.2 Directories

A “directory” uses human editors who decide what category the site belongs to. They place websites within specific categories or subcategories in the “directories” database. By focusing on particular categories and subcategories, user can narrow the search to those records that are most likely to be relevant to his/her interests. The human editors comprehensively check the website and rank it, based on the information they find, using a pre-defined set of rules. Examples are: Dmoz (www.dmoz.org), VMOptions (www.vmoptions.in) and Yahoo (dir.yahoo.com)

2.1.3 Hybrid Search Engines

Hybrid search engines use a combination of both crawler-based results and directory results. It differs from traditional text oriented search engine such as Google or a directory- based search engine such as Yahoo in which each program operates by comparing a set of metadata, the primary corpus being the metadata derived from a Web crawler or taxonomic analysis of all internet text, and a user search query. Examples of hybrid search engines are: Yahoo (www.yahoo.com), Google (ww.google.com).

2.1.4 Meta Search Engines

Meta Search Engines are also known as Multiple Search Engines or Metacrawlers. Meta search engines query several other Web search engine databases in parallel and then combine the results in one list. Examples of Meta search engines include: Metacrawler (www.metacrawler.com), Dogpile (www.dogpile.com).

2.2 Structure and Working of Search Engine

The basic structure of any crawler based search engine is shown in fig. 2.1. Thus the main steps in any search engine are:

illustration not visible in this excerpt

Fig. 2.1: General Structure of Search Engine[10]

2.2.1 Gathering also called “Crawling”

Every engine relies on a crawler module to provide the grist for its operation. This operation is performed by special software called “Crawlers” Crawlers are small programs that browse the Web on the search engine's behalf, similarly to how a human user would follow links to reach different pages. The programs are given a starting set of URLs, whose pages they retrieve from the Web. The crawlers extract URLs appearing in the retrieved pages, and give this information to the crawler control module. This module determines what links to visit next, and feeds the links to visit back to the crawlers.

2.2.2 Maintaining Database / Repository

All the data of the search engine is stored in a database as shown in the fig. 2.1.All the searching is performed through that database and it needs to be updated frequently. During a crawling process, and after completing crawling process, search engines must store all the new useful pages that they have retrieved from the Web. The page repository (collection) in Fig. 2.1 represents this possibly temporary collection. Sometimes search engines maintain a cache of the pages they have visited beyond the time required to build the index. This cache allows them to serve out result pages very quickly, in addition to providing basic search facilities.

2.2.3 Indexing

Once the pages are stored in the repository, the next job of search engine is to make a index of stored data. The indexer module extracts all the words from each page, and records the URL where each word occurred. The result is a generally very large “lookup table" that can provide all the URLs that point to pages where a given word occurs. The table is of course limited to the pages that were covered in the crawling process. As mentioned earlier, text indexing of the Web poses special difficulties, due to its size, and its rapid rate of change. In addition to these quantitative challenges, the Web calls for some special, less common kinds of indexes. For example, the indexing module may also create a structure index, which reflects the links between pages.

2.2.4 Querying

This sections deals with the user queries. The query engine module is responsible for receiving and filling search requests from users. The engine relies heavily on the indexes, and sometimes on the page repository. Because of the Web's size, and the fact that users typically only enter one or two keywords, result sets are usually very large.

illustration not visible in this excerpt

Fig. 2.2: Working of General Search Engine[11]

2.2.5 Ranking

Since the user query results in a large number of results, it is the job of the search engine to display the most appropriate results to the user. To do this efficient searching, the ranking of the results are performed. The ranking module therefore has the task of sorting the results such that results near the top are the most likely ones to be what the user is looking for. Once the ranking is done by the Ranking component, the final results are displayed to the user. This is how any search engine works.

2.4 Examples of Search Engines

There are a number of web search engines available in the market. The list of the most important and famous search engines can be listed as below:

Google
Yahoo
Bing
Ask
AltaVista
Yandex
Baidu
MSN
Khoj
AOL Search
You tube

And there are so many more search engines available in the market to assist the internet users.

2.5 Important about Search Engines

The Web Search Engines are the key or the main-gate of World Wide Web. The evolution and their components are very important portion of study. Essential components of search engine are - Crawler, Parser, Scheduler and Database. Some of the mostly used search engines are - Google, Bing, Yahoo etc.

CHAPTER 3 WEB CRAWLERS

As cleared from the previous chapter, one of the most essential jobs of any search engine is gathering of web pages, also called, “Crawling”. This crawling procedure is performed by special software called, “Crawlers” or “Spiders”. In this chapter, we will discuss the detailed structure of any web crawler.

3.1 Introduction

A Crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program, which is also known as a "spider" or a "bot." Crawlers are typically programmed to visit sites that have been submitted by their owners as new or updated. Entire sites or specific pages can be selectively visited and indexed. Crawlers apparently gained the name because they crawl through a site a page at a time, following the links to other pages on the site until all pages have been read[12]. Crawlers are the programs that automatically fetches and download web pages. Once all the pages are fetched and saved in a repository, we are done. However, the Web is a dynamic entity evolving at rapid rates. Hence there is a continuous need for crawlers to help applications stay current as pages and links are added, deleted, moved or modified.

A Web crawler may also be called a Web Spider, an Ant, an Automatic Indexer, a Worm, a Wanderer, Automatic Harvester, Internet Bot, Search Bot, a Simple Bot, Robot or a Web Scooter. The crawler for the AltaVista search engine and its Web site is called Scooter. Scooter adheres to the rules of politeness for Web crawlers that are specified in the Standard for Robot Exclusion (SRE). It asks each server which files should be excluded from being indexed. It does not (or cannot) go through firewalls. It uses a special algorithm for waiting between successive server requests so that it doesn't affect response time for other users[13]. The first thing you need to understand is what a Web Crawler is and how it works. A Search Engine Spider is a program that most search engines use to find what’s new on the Internet. Google’s web crawler is known as GoogleBot. Bing uses Bingbot and Yahoo after having contract with Microsoft starts using their bingbot. There are many types of web spiders in use, but for now, we’re only interested in the Bot that actually crawls the web and collects documents to build a searchable index for the different search engines. The program starts at a website and follows every hyperlink on each page. So we can say that everything on the web will eventually be found and spidered, as the so called “spider” crawls from one website to another. Search engines may run thousands of instances of their web crawling programs simultaneously, on multiple servers. When a web crawler visits one of your pages, it loads the site’s content into a database. Once a page has been fetched, the text of your page is loaded into the search engine’s index, which is a massive database of words. All of this may sound too technical, but it’s very important to understand the basics of how a Web Crawler actually works.

So there are basically three steps that are involved in the crawling procedure. First, the search bot starts by crawling the pages of your site. Then it continues indexing the words and content of the site, and finally it visit the links (web page addresses or URLs) that are found in your site. When the spider doesn’t find a page, it will eventually be deleted from the index. However, some of the spiders will check again for a second time to verify that the page really is offline[14].

There are many applications for Web crawlers. One is business intelligence, where by organizations collect information about their competitors and potential collaborators. Another use is to monitor Web sites and pages of interest, so that a user or community can be notified when new information appears in certain places. There are also malicious applications of crawlers, for example, that harvest email addresses to be used by spammers or collect personal information to be used in phishing and other identity theft attacks. The most widespread use of crawlers is, however, in support of search engines. In fact, crawlers are the main consumers of Internet bandwidth. They collect pages for search engines to build their indexes. Well known search engines such as Google, Yahoo! and MSN run very efficient universal crawlers designed to gather all pages irrespective of their content. Other crawlers, sometimes called preferential crawlers, are more targeted. They attempt to download only pages of certain types or topics[8].

3.2 Basic Crawling Terminology

Before we discuss the working of crawlers, it is worth to explain some of the basic terminology that is related with crawlers[15].

3.2.1 Seed Page

By crawling, we mean to traverse the Web by recursively following links from a starting URL or a set of starting URLs. This starting URL set is the entry point though which any crawler starts searching procedure. This set of starting URL is known as “Seed Page”. The selection of a good seed is the most important factor in any crawling process.

3.2.2 Frontier (Processing Queue)

The crawling method starts with a given URL (seed), extracting links from it and adding them to an un-visited list of URLs. This list of un-visited links or URLs is known as, “Frontier”. Each time, a URL is picked from the frontier by the Crawler Scheduler. This frontier is implemented by using Queue, Priority Queue Data structures. The maintenance of the Frontier is also a major functionality of any Crawler.

3.2.3 Parser

Once a page has been fetched, we need to parse its content to extract information that will feed and possibly guide the future path of the crawler. Parsing may imply simple hyperlink/URL extraction or it may involve the more complex process of tidying up the HTML content in order to analyze the HTML tag tree. The job of any parser is to parse the fetched web page to extract list of new URLs from it and return the new un-visited URLs to the Frontier.

3.3 Working of Basic Web Crawler

From the beginning, a key motivation for designing Web crawlers has been to retrieve web pages and add them or their representations to a local repository. Such a repository may then serve particular application needs such as those of a Web search engine.

illustration not visible in this excerpt

Fig. 3.1: Components and Working of Web Crawler[16]

In its simplest form a crawler starts from a seed page and then uses the external links within it to attend to other pages. The structure of a basic crawler is shown in fig. 3.1. The process repeats with the new pages offering more external links to follow, until a sufficient number of pages are identified or some higher level objective is reached. Behind this simple description lies a host of issues related to network connections, and parsing of fetched HTML pages to find new URL links[17].

Common web crawler implements method composed from following steps:

Acquire URL of processed web document from processing queue.
Download web document.
Parse document’s content to extract set of URL links to other resources and update processing queue.
Store web document for further processing.

The basic working of a web-crawler can be discussed as follows:

Select a starting seed URL or URLs.
Add it to the frontier.
Now pick the URL from the frontier.
Fetch the web-page corresponding to that URL.
Parse that web-page to find new URL links.
Add all the newly found URLs into the frontier.
Go to step 2 and repeat while the frontier is not empty.

Thus a crawler will recursively keep on adding newer web pages to the database repository of the search engine. So we can see that the main function of a crawler is to add new links into the frontier and to select a new URL from the frontier for further processing after each recursive step.

Web crawler starts by parsing a specified Web page, noting any hypertext links on that page that point to other Web pages. They then parse those pages for new links, and so on, recursively. WebCrawler software doesn't actually move around to different computers on the Internet, as viruses or intelligent agents do. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve Web pages at a fast enough pace[8].

A crawler resides on a single machine. The crawler simply sends HTTP requests for documents to other machines on the Internet, just as a Web browser does when the user clicks on links. All the crawler really does is to automate the process of following links. Web crawling can be regarded as processing items in a queue. When the crawler visits a Web page, it extracts links to other Web pages. So the crawler puts these URLs at the end of a queue, and continues crawling to a URL that it removes from the front of the queue.

The working of the crawlers can also be shown in the form of a flow chart (Fig. 3.2). This flow chart also depicts the 7 steps given earlier. Such crawlers are called sequential crawlers because they follow a sequential approach.

illustration not visible in this excerpt

Fig. 3.2: Flow Chart of Web Crawler[17]

3.4 Designing Issues of Web-Crawler

The crawler module retrieves pages from the Web for later analysis by the indexing module. As discussed above, a crawler module typically starts off with an initial set of URLs called, Seed. Roughly, it first places Seed in a queue, where all URLs to be retrieved are kept and prioritized. From this queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and puts the new URLs in the queue. This process is repeated until the crawler decides to stop. Given the enormous size and the change rate of the Web, many issues arise, including the following: What pages should the crawler download? In most cases, the crawler cannot download all pages on the Web. Even the most comprehensive search engine currently indexes a small fraction of the entire Web. Given this fact, it is important for the crawler to carefully select the pages and to visit “important" pages first by prioritizing the URLs in the queue properly, so that the fraction of the Web that is visited (and kept up-to-date) is more meaningful[17].

How should the crawler refresh pages? Once the crawler has downloaded a significant number of pages, it has to start revisiting the downloaded pages in order to detect changes and refresh the downloaded collection. Because Web pages are changing at very different rates, the crawler needs to carefully decide what page to revisit and what page to skip, because this decision may significantly impact the “freshness" of the downloaded collection. For example, if a certain page rarely changes, the crawler may want to revisit the page less often, in order to visit more frequently changing ones.
How should the load on the visited Web sites be minimized? When the crawler collects pages from the Web, it consumes resources belonging to other organizations. For example, when the crawler downloads page p on site S, the site needs to retrieve page p from its file system, consuming disk and CPU resource. Also, after this retrieval the page needs to be transferred through the network, which is another resource, shared by multiple organizations. The crawler should minimize its impact on these resources. Otherwise, the administrators of the Web site or a particular network may complain and sometimes completely block access by the crawler.
How should the crawling process be parallelized? Due to the enormous size of the Web, crawlers often run on multiple machines and download pages in parallel. This parallelization is often necessary in order to download a large number of pages in a reasonable amount of time. Clearly these parallel crawlers should be coordinated properly, so that different crawlers do not visit the same Web site multiple times, and the adopted crawling policy should be strictly enforced. The coordination can incur significant communication overhead, limiting the number of simultaneous crawlers. In the next section, the concept of parallel crawlers is given in detail along with their design issues.

3.5 Parallel Crawlers 15

As the size of the Web grows, it becomes more difficult to retrieve the whole or a significant portion of the Web using a single sequential crawler. Therefore, many search engines often run multiple processes in parallel to perform the above task, so that download rate is maximized. We refer to this type of crawler as a parallel crawler. Parallel crawlers work simultaneously to grab pages from the World Wide Web and add them to the central repository of the search engine.

illustration not visible in this excerpt

Fig. 3.3: Architecture of Parallel Crawler[18]

The parallel crawling architecture is shown in the Fig. 3.3. Each crawler is having its own database of collected pages and own queue of un-visited URLs. Once the crawling procedure finishes, the collected pages of every crawler are added to the central repository of the search engine.

Parallel crawling architecture no doubt increases the efficiency of any search engine but there are certain problems or issues related with the usage of parallel crawlers[18]. These are discussed in following section:

3.5.1 Issues related with Parallel Crawlers

Risk of Multiple copy download: When multiple crawlers run in parallel to download pages, it is possible that different processes download the same page multiple times. One process may not be aware that another process has already downloaded the page. Clearly, such multiple downloads should be minimized to save network bandwidth and increase the crawler’s effectiveness.
Quality: Every crawler wants to download “important” pages first, in order to maximize the “quality” of the downloaded collection. However, in a parallel crawler environment, each crawler may not be aware of the whole image of the Web that they have collectively downloaded so far. For this reason, each process may make a crawling decision solely based on its own image of the Web (that itself has downloaded) and thus make a poor crawling decision. Thus ensuring the quality of the downloaded pages in a parallel crawler environment is a major issue.
Communication bandwidth: In order to prevent overlap, or to improve the quality of the downloaded pages, crawling processes need to periodically communicate to coordinate with each other. However, this communication may grow significantly as the number of crawling processes increases.

Thus we can see that it’s not very easy to implement parallel crawling environment in any search engine. So a lot of care has to be taken into account before designing any parallel crawling structure.

3.5.2 Advantages of Parallel Crawlers

Although parallel crawling architecture is challenging to implement, still we believe that a parallel crawler has many important advantages, compared to a single-process crawler or a sequential crawler.

Scalability: The size of Web is huge. Due to this size of the Web, it is often useful to run a parallel crawler. A single-process crawler simply cannot achieve the required download rate in certain time space.
Network-load Dispersion: Multiple crawling processes of a parallel crawler may run at geographically distant locations, each downloading “geographically-adjacent” pages. For example, a process in Sweden may download all European pages, while another one in India crawls all Asian pages. In this way, one can easily disperse the network load to multiple regions. In particular, this dispersion might be necessary when a single network cannot handle the heavy load from a large-scale crawl.
Reduction in Network-load: In addition to the dispersing load, a parallel crawler may actually reduce the network load also. For example, assume that a crawler in India retrieves a page from Europe. To be downloaded by the crawler, the page first has to go through the network in Europe, then the Europe-to-Asia inter-continental network and finally the network in India. Instead, if a crawling process in Europe collects all European pages, and if another process in Asia crawls all pages from Asia, the overall network load will be reduced, because pages go through only “local” networks.

Thus we can conclude that Parallel Crawler Architecture is a better option as compared to single crawler architecture. Also the number of web pages around the globe is huge. So to crawl the whole web, only parallel crawlers can do this job in short span of time by keeping in consideration the issues listed above.

3.6 Controlling the access of Spiders

The first thing a spider is supposed to do when it visits your website is look for a file called “robots.txt”. This file contains instructions for the spider on which parts of the website to index, and which parts to ignore. The only way to control what a spider sees on your site is by using a robots.txt file. All spiders are supposed to follow some rules, and the major search engines do follow these rules for the most part. Fortunately, the major search engines like Google or Bing are finally working together on standards[19].

3.7 Blocking the access of Spamming bots

Some bots does not follow or obey robots exclusion protocol implements with “ robots.txt ” on web server and starts crawling the website private / confidential sections as well. Such bots are called spamming robots or bots. Some spamming bots like a nightmare, sometimes eats the whole bandwidth of website. Those can be blocked with Hypertext Access (.htaccess) file programming[19].

3.8 Examples of Web Crawlers

The following is a list of published crawler architectures for general-purpose crawlers (excluding focused web crawlers), with a brief description that includes the names given to the different components and outstanding features[20]:

Bingbot is the name of Microsoft's Bing web crawler. It replaced Msnbot.
FAST Crawler is a distributed crawler, used by Fast Search Transfer, and a general description of its architecture is available.
Googlebot is described in some detail, but the reference is only about an early version of its architecture, which was based in C++ and Python. The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction. There is a URL server that sends lists of URLs to be fetched by several crawling processes. During parsing, the URLs found were passed to a URL server that checked if the URL have been previously seen. If not, the URL was added to the queue of the URL server.
PolyBot is a distributed crawler written in C++ and Python, which is composed of a “crawl manager”, one or more “downloaders” and one or more “DNS resolvers”. Collected URLs are added to a queue on disk, and processed later to search for seen URLs in batch mode. The politeness policy considers both third and second level domains (e.g.: www.example.com and www2.example.com are third level domains) because third level domains are usually hosted by the same Web server.
RBSE was the first published web crawler. It was based on two programs: the first program, "spider" maintains a queue in a relational database, and the second program "mite", is a modified www ASCII browser that downloads the pages from the Web.
WebCrawler was used to build the first publicly available full-text index of a subset of the Web. It was based on lib-WWW to download pages, and another program to parse and order URLs for breadth-first exploration of the Web graph. It also included a real-time crawler that followed links based on the similarity of the anchor text with the provided query.
WebFountain is a distributed, modular crawler similar to Mercator but written in C++. It features a "controller" machine that coordinates a series of "ant" machines. After repeatedly downloading pages, a change rate is inferred for each page and a non-linear programming method must be used to solve the equation system for maximizing freshness. The authors recommend to use this crawling order in the early stages of the crawl, and then switch to a uniform crawling order, in which all pages are being visited with the same frequency.
WebRACE is a crawling and caching module implemented in Java, and used as a part of a more generic system called eRACE. The system receives requests from users for downloading web pages, so the crawler acts in part as a smart proxy server. The system also handles requests for "subscriptions" to Web pages that must be monitored: when the pages change, they must be downloaded by the crawler and the subscriber must be notified. The most outstanding feature of WebRACE is that, while most crawlers start with a set of "seed" URLs, WebRACE is continuously receiving new starting URLs to crawl from.
World Wide Web Worm was a crawler used to build a simple index of document titles and URLs. The index could be searched by using the grep Unix command.
Yahoo! Slurp was the name of the Yahoo! Search crawler until Yahoo! contracted with Microsoft to use Bingbot instead.

3.8.1 Open Source Web Crawlers

DataparkSearch is a crawler and search engine released under the GNU General Public License.
GNU Wget is a command-line-operated crawler written in C and released under the GPL. It is typically used to mirror Web and FTP sites.
GRUB is an open source distributed search crawler that Wikia Search used to crawl the web.
Heritrix is the Internet Archive's archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web. It was written in Java.
ht://Dig includes a Web crawler in its indexing engine.
HTTrack uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL.
ICDL Crawler is a cross-platform web crawler written in C++ and intended to crawl Web sites based on Website Parse Templates using computer's free CPU resources only.
mnoGoSearch is a crawler, indexer and a search engine written in C and licensed under the GPL (*NIX machines only)
Norconex HTTP Collector is a web spider, or crawler, written in Java, that aims to make Enterprise Search integrators and developers's life easier (licensed under GPL).
Nutch is a crawler written in Java and released under an Apache License. It can be used in conjunction with the Lucene text-indexing package.
Open Search Server is a search engine and web crawler software release under the GPL.
PHP-Crawler is a simple PHP and MySQL based crawler released under the BSD License. Easy to install, it became popular for small MySQL-driven websites on shared hosting.
Scrapy, an open source webcrawler framework, written in python (licensed under BSD).
Seeks, a free distributed search engine (licensed under Affero General Public License).
tkWWW Robot, a crawler based on the tkWWW web browser (licensed under GPL).
YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).

CHAPTER 4 CRAWLING ALGORITHMS

In the previous chapter, we mentioned concepts of web crawlers, their working and the component functionality. The most important component of Web Crawler is the Scheduler, whose job is to pick the next URL from the list of un-visited URLs for further processing. In this chapter, we will mention the techniques or algorithms currently used by crawlers to implement the Scheduling Process while crawling the web.

Crawling Algorithms by looking at the basic working of crawlers, it is clear to us that at each stage crawler selects a new link from the frontier for processing. So the selection of the next link from the frontier is also a major aspect. The selection of the next link from the frontier entirely depends upon the crawling algorithm we are using. So the job of selecting the next link from the frontier is something like selection of job by the CPU from Ready Queue (CPU Scheduling) in operating systems[17].

The basic algorithm executed by any web crawler takes a list of seed URLs as its input and repeatedly executes the following steps:

Remove a URL from the URL list
Determine the IP address of its host name
Download the corresponding document
Extract any links contained in it

For each of the extracted links, ensure that it is an absolute URL and add it to the list of URLs to download, provided it has not been encountered before. If desired, process the downloaded document in other ways (e.g., index its content).

Any basic Web crawling algorithm requires a number of functional components:

A component (called the URL frontier) for storing the list of URLs to download
A component for resolving host names into IP addresses
A component for downloading documents using the HTTP protocol
A component for extracting links from HTML documents and
A component for determining whether a URL has been encountered before or not.

The simplest crawling algorithm uses a queue of URLs yet to be visited and a fast mechanism for determining if it has already seen a URL. This requires huge data structures. A simple list of 20 billion URLs contains more than a terabyte of data. [How things] The crawler initializes the queue with one or more “seed” URLs. A good seed URL will link to many high-quality Web sites - for example, www.dmoz.org, www.vmoptions.in or www.wikipedia.org. Crawling proceeds by making an HTTP request to fetch the page at the first URL in the queue. When the crawler fetches the page, it scans the contents for links to other URLs and adds each previously unseen URL to the queue. Finally, the crawler saves the page content for indexing. Crawling continues until the queue is empty[17].

4.1 Desired features of Good Crawling Algorithm:

Speed: If each HTTP request takes one second to complete—some will take much longer or fail to respond at all—the simple crawler can fetch no more than 86,400 pages per day. At this rate, it would take 634 years to crawl 20 billion pages. In practice, crawling is carried out using hundreds of distributed crawling machines. A hashing function determines which crawling machine is responsible for a particular URL. If a crawling machine encounters a URL for which it is not responsible, it passes it on to the machine that is responsible for it. Even hundredfold parallelism is not sufficient to achieve the necessary crawling rate. Each crawling machine therefore exploits a high degree of internal parallelism, with hundreds or thousands of threads issuing requests and waiting for responses.
Politeness: Unless care is taken, crawler parallelism introduces the risk that a single Web server will be bombarded with requests to such an extent that it becomes overloaded. Crawler algorithms are designed to ensure that only one request to a server is made at a time and that a politeness delay is inserted between requests.
Excluded content: Before fetching a page from a site, a crawler must fetch that site’s robots.txt file to determine whether the webmaster has specified that some or the entire site should not be crawled.
Duplicate content: Identical content is frequently published at multiple URLs. Simple checksum comparisons can detect exact duplicates, but when the page includes its own URL, a visitor counter, or a date, more sophisticated fingerprinting methods are needed. Crawlers can save considerable resources by recognizing and eliminating duplication as early as possible because unrecognized duplicates can contain relative links to whole families of other duplicate content. Search engines avoid some systematic causes of duplication by transforming URLs to remove superfluous parameters such as session IDs and by case folding URLs from caseinsensitive servers.
Continuous crawling: Carrying out full crawls at fixed intervals would imply slow response to important changes in the Web. It would also mean that the crawler would continuously re-fetch low-value and static pages, thereby incurring substantial costs without significant benefit. For example, a corporate site’s 2002 media releases section rarely, if ever, requires re-crawling.
Spam rejection: Primitive spamming techniques, such as inserting misleading keywords into pages that are invisible to the viewer—for example, white text on a white background, zero-point fonts, or meta tags—are easily detected. Modern spammers create artificial Web landscapes of domains, servers, links, and pages to inflate the link scores of the targets they have been paid to promote. Search engine companies use manual and automated analysis of link patterns and content to identify spam sites that are then included in a blacklist. A crawler can reject links to URLs on the current blacklist and can reject or lower the priority of pages that are linked to or from blacklisted sites.

Two major approaches used for crawling are: Blind Traversing Approach, Best - First Heuristic Approach

4.2 Blind Traversing Approach:

In this approach, we simply start with a seed URL and apply the crawling process as stated earlier. It is called blind because for selecting next URL from frontier, no criteria are applied. Crawling links are selected in the order in which they are encountered in the frontier (in serial order) One algorithm widely common to implement Blind traversing approach is - Breadth First Algorithm. It uses FIFO QUEUE data structure to implement the frontier. It is very simple and basic crawling algorithm. Since this approach traverses the graphical structure of WWW breadth wise.

4.2.1 Breadth First Crawler

A Breadth-First crawler is the simplest strategy for crawling. This algorithm was explored in 1994 in the WebCrawler as well as in more recent research[21]. It uses the frontier as a FIFO queue, crawling links in the order in which they are encountered. The problem with this algorithm is that when the frontier is full, the crawler can add only one link from a crawled page. The Breadth-First crawler is illustrated in Fig. 4.1. Breadth- First Algorithm is usually used as a baseline crawler; since it does not use any knowledge about the topic, it acts blindly. That is why, also called, Blind Search Algorithm. Its performance is used to provide a lower bound for any of the more sophisticated algorithms.

illustration not visible in this excerpt

Fig. 4.1: Breadth First Crawler[17]

4.2.2 Drawbacks of Breadth First Crawler

In real WWW structure, there are millions of pages linked to each other. The size of the repository of any search engine cannot accommodate all pages. So it is desired that we always store the most suitable and relevant pages in our repository. Problem with Blind Breadth First algorithm is that it traverses URLs in sequential order as these were inserted into the Frontier. It may be good when the total number of pages is small. But in real life, a lot of useless pages can produce links to other useless pages. Thus storing and processing such links in frontier is wastage of time and memory. So we should select a useful page from the frontier every time for processing irrespective of its position in the frontier. But Breadth first approach always fetched 1st link from the frontier, irrespective of its usefulness. So the Breadth First approach is not desirable.

4.3 Best First Heuristic Approach:

To overcome the problems of blind traverse approach, a heuristic approach called BestFirst crawling approach have been studied by Cho et al.[1998] and Hersovici et al.[1998]. In this approach, from a given Frontier of links, next link for crawling is selected on the basis of some estimation or score or priority. Thus every time the best available link is opened and traversed. The estimation value for each link can be calculated by different predefined mathematical formulas. (Based purely on the needs of specific engine) Following Web Crawling Algorithms use Heuristic Approach:

4.3.1 Naive First Approach

One Best First Approach uses a relevancy Function rel() to compute the lexical similarity between the desired key-words and each page associate that value with corresponding links in the frontier. After each iteration, the link with the highest rel() function value is picked from the frontier. That is, the best available link is traversed every time which is not possible in Breadth First Approach. Since any link with highest relevancy value can be picked from the Frontier, most of the Best first algorithms use - Priority Queue as data structure.

illustration not visible in this excerpt

Fig. 4.2: Best First Crawler[17]

4.3.2 Page Rank Algorithm

Page Rank was developed at Stanford University by Larry Page and Sergey Brin (PHD Research Scholars, also founders of Google) in 1996. PageRank was developed as a possible model of user surfing behavior. The PageRank of a page represents the probability that a random surfer (one who follows links randomly from page to page) will be on that page at any given time. Page Rank uses the hyper link structure of Web[22]. It is based on the concepts that if a page contains important links towards it then the links of this page towards the other page are also to be considered as important pages. It considers the back link in deciding the rank score. If the addition of the ranks of all the back links of page is large then the page is having large rank. A simplified version of Page Rank is given below:

illustration not visible in this excerpt

Notations used are:

u and v represents the web pages.
B(u) is the set of pages that point to u.
PR(u) and PR(v) are rank scores of page u and v respectively.
Nv indicates the number of outgoing links of page v.
c is factor applied for Normalization.

In Page Rank, the rank of page P, is evenly divided among its outgoing links. Later Page Rank was modified observing that not all users follow the direct links on WWW. Therefore, it provides a more advanced way to compute the importance or relevance of a web page than simply counting the number of pages that are linking it[23]. If a backlink comes from an important page, then that backlink is given a higher weightage than those backlinks comes from non-important pages. Thus, the modified version of Page Rank is given as:

illustration not visible in this excerpt

Where, d is a damping factor which set its value to 0.85. d can be thought of as the probability of users following the links and could regard (1íd) as the page rank distribution from non-directly linked pages. We assume several web pages T1 … Tn which point to u web page. T1 is the incoming link page to page u and C(T1) are the outgoing links from page T1. Simplified formula is:

illustration not visible in this excerpt

Page Rank algorithm is used by the famous search engine “Google”. Page Rank algorithm is the most frequently used algorithm for ranking billions of web pages. During the processing of a query, Google’s search algorithm combines pre computed Page Rank scores with text matching scores to obtain an overall ranking score for each web page[22].

More recently Page Rank has been used to guide crawlers [Cho et al. 1998] and to assess page quality.

Page Rank algorithm uses Web Structure Mining (WSM) for finding inlinks or backlinks of web page.

4.3.3 Weighted Page Rank Algorithm

In 2004, Wenpu Xing et. al.[24] discussed a new approach known as weighted page rank algorithm (WPR). This algorithm is an extension of Page Rank algorithm. WPR takes into account the importance of both the inlinks and the outlinks of the pages and distributes rank scores based on the popularity of the pages.

WPR performs better than the conventional Page Rank algorithm in terms of returning larger number of relevant pages to a given query. According to author the more popular web pages are the more linkages that other web pages tend to have to them or are linked to by them. A Weighted Page Rank Algorithm - assigns larger rank values to more important (popular) pages instead of dividing the rank value of a page evenly among its outlink pages. Each outlink page gets a value proportional to its popularity (its number of inlinks and outlinks). The popularity from the number of inlinks and outlinks is recorded as Win(v,u) and Wout(v,u).

Win(v,u) given in following equation is the weight of link(v,u) calculated based on the number of inlinks of page u and the number of inlinks of all reference pages of page v.

illustration not visible in this excerpt

Where Iu and Ip represent the number of inlinks of page u and page p, respectively. R(v) denotes the reference page list of page v. Wout(v,u) given in following equation is the weight of link(v,u) calculated based on the number of outlinks of page u and the number of outlinks of all reference pages of page v.

illustration not visible in this excerpt

Where Ou and Op represent the number of outlinks of page u and page p, respectively. R (v) denotes the reference page list of page v. Considering the importance of pages, the original Page Rank formula is modified in following equation:

illustration not visible in this excerpt

Notations used are:

u and v represents the web pages.
d is the damping factor. Its value is 0.85.
B(u) is the set of pages that point to u.
WPR(u) and WPR(v) are rank scores of page u and v respectively.
illustration not visible in this excerptis the weight of link(v, u) calculated based on the number of inlinks of page u and the number of inlinks of all reference pages (i.e. outlinks) of page v.
illustration not visible in this excerptis the weight of link(v, u) calculated based on the number of outlinks of page u and the number of outlinks of all reference pages (i.e. outlinks) of page v.

illustration not visible in this excerpt

Fig. 4.3: Web Interlinked Structure[24]

In our approach, we are proposing to use Weighted Page Rank to guide crawlers and to assess page quality. Weighted Page Rank algorithm uses Web Structure Mining (WSM) for finding inlinks and outlinks of webpage.

4.3.4 Page Rank based on Visits of Links Algorithm

In 2011, Gyanendra Kumar et. al.[25] proposed a new algorithm in which they considered user’s browsing behavior. As most of the ranking algorithms proposed earlier are either link or content oriented in which consideration of user usage trends are not available. They propose in their paper, a page ranking mechanism called Page Ranking based on Visits of Links (PRVOL) is being devised for search engines, which works on the basic ranking algorithm of Google, i.e. Page Rank and takes number of visits of inbound links of web pages into account. This concept is very useful to display most valuable pages on the top of the result list on the basis of user browsing behavior, which reduce the search space to a large scale. In this paper as the author describe that in the original Page Rank algorithm, the rank score of page p, is evenly divided among its outgoing links or we can say for a page, an inbound links brings rank value from base page p. So, he proposed an improved Page Rank algorithm. In this algorithm we assign more rank value to the outgoing links which is most visited by users. In this manner a page rank value is calculated based on visits of inbound links.

The modified version of Page Rank based on VOL is given in following equation: 28

illustration not visible in this excerpt

Notations used are:

u and v represents the web pages.
d is the damping factor. Its value is 0.85.
B(u) is the set of pages that point to u.
illustration not visible in this excerptand illustration not visible in this excerptare rank scores of page u and v respectively.
illustration not visible in this excerptis the number of visits of link which is pointing page u from v.
TL(v) denotes total number of visits of all links present on v.

Page Rank based on VOL algorithm uses Web Structure Mining (WSM) for finding inlinks of webpage and Web Usage Mining (WUM) to find visits of links.

4.3.5 Weighted Page Rank based on Visits of Links Algorithm

In 2012, Neelam Tyagi et. al.[26] discussed that the original Weighted PageRank algorithm assigns larger rank values to more important (popular) pages. Each outlink page gets a value proportional to its popularity (its number of inlinks and outlinks). The popularity from the number of inlinks and outlinks is recorded as recorded as Win(v,u) and Wout(v,u), respectively. Here we proposed an improved Weighted Page Rank algorithm. In this algorithm we assign more rank value to the outgoing links which is most visited by users and received higher popularity from number of inlinks. We do not consider here the popularity of outlinks which is considered in the original algorithm. The advanced approach in the new algorithm is to determine the user’s usage trends. The user’s browsing behavior can be calculated by number of hits (visits) of links.

The modified version based on WPRVOL is given in following equation:

illustration not visible in this excerpt

Notations used are:

u and v represents the web pages.
d is the damping factor. Its value is 0.85.
B(u) is the set of pages that point to u.
illustration not visible in this excerptandillustration not visible in this excerptare rank scores of page u and v respectively.
illustration not visible in this excerptis the weight of link(v, u) calculated based on the number of inlinks of page u and the number of inlinks of all reference pages (i.e. outlinks) of page v.
Lu is the number of visits of link which is pointing page u from v.
TL(v) denotes total number of visits of all links present on v.

We are proposing use of this algorithm for the optimization of crawled data, by finding the never browsed or least browsed data that can be deleted to have quality data in the search engine database, which will lead to enhanced searching in terms of quality and fast searches.

Weighted Page Rank based on VOL algorithm uses Web Structure Mining (WSM) for finding inlinks of webpage and Web Usage Mining (WUM) to find visits of links.

4.3.6 Improvement in Weighted Page Rank based on Visits of Links Algorithm

In 2014, Sachin Gupta et. al.[27] discussed an improvement to Weighted Page Rank based on visits of links algorithm by adding a timing factor i.e. User Attention time (UAT) to the earlier WPRVOL formula and by storing visits of links data directly to search engine server. Here UAT i.e. User Attention Time is the time user spents on that webpage or you can say it is a reading time of web page. This improvement in algorithm finds more relevant information according to user’s query. So, this concept is more useful to display most valuable pages on the top of the result list on the basis of user browsing behavior, which reduce the search space to a large scale for user.

So the Improved version of WPRVOL is given in following equation:

illustration not visible in this excerpt

Notations used are:

u and v represents the web pages.
d is the damping factor. Its value is 0.85.
B(u) is the set of pages that point to u.
illustration not visible in this excerptand illustration not visible in this excerptare rank scores of page u and v respectively.
illustration not visible in this excerptis the weight of link(v, u) calculated based on the number of inlinks of page u and the number of inlinks of all reference pages (i.e. outlinks) of page v.
UAT(u) is the user attention time i.e. the time user spends on that webpage. It can be called Reading Time.
Lu is the number of visits of link which is pointing page u from v.
TL(v) denotes total number of visits of all links present on v.

Weighted Page Rank based on VOL algorithm uses Web Structure Mining (WSM) for finding inlinks of webpage and Web Usage Mining (WUM) to find visits of links.

CHAPTER 5: LITERATURE REVIEW

Carlos Castillo et al.[7] presents a comparative study of strategies for Web crawling. We show that a combination of breadth first ordering with the largest sites first is a practical alternative since it is fast, simple to implement, and able to retrieve the best ranked pages at a rate that is closer to the optimal than other alternatives. Our study was performed on a large sample of the Chilean Web which was crawled by using simulators, so that all strategies were compared under the same conditions, and actual crawls to validate our conclusions. We also explored the effects of large scale parallelism in the page retrieval task and multiple-page requests in a single connection for effective amortization of latency times.

Lawrence Page et. al.[44] proposes the importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a method for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. They compare PageRank to an idealized random Web surfer. We show how to effeciently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.

S. Brin et. al.[22] in their research presents Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of large- scale web search engine the first such detailed public description we know of to date.

Junghoo Cho et. al.[42] proposed in what order a crawler should visit the URLs it has seen, in order to obtain more “important” pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time.

We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. They also experimentally evaluate the ordering schemes on the Stanford University Web. Their results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without.

Ricardo Baeza-Yates et. al.[43] compares several page ordering strategies for Web crawling under several metrics. The objective of these strategies is to download the most "important" pages "early" during the crawl. As the coverage of modern search engines is small compared to the size of the Web, and it is impossible to index all of the Web for both theoretical and practical reasons, it is relevant to index at least the most important pages. They use data from actual Web pages to build Web graphs and execute a crawler simulator on those graphs. As the Web is very dynamic, crawling simulation is the only way to ensure that all the strategies considered are compared under the same conditions. We propose several page ordering strategies that are more efficient than breadth- first search and strategies based on partial Pagerank calculations.

Wenpu Xing et. al.[24] proposed, with the rapid growth of the Web, users get easily lost in the rich hyper structure. Providing relevant information to the users to cater to their needs is the primary goal of web- site owners. Therefore, finding the content of the Web and retrieving the user’s interests and needs from their behavior have become increasingly important. Web mining is used to categorize users and pages by analyzing the user’s be- havior, the content of the pages, and the order of the URLs that tend to be accessed in order. Web structure mining plays an important role in this approach. Two page rank- ing algorithms, HITS and PageRank, are commonly used in web structure mining. Both algorithms treat all links equally when distributing rank scores. Several algorithms have been developed to improve the performance of these methods. The Weighted PageRank algorithm (WPR), an extension to the standard Page Rank algorithm. WPR takes into account the importance of both the inlinks and the outlinks of the pages and distributes rank scores based on the popularity of the pages. The results of our simulation studies show that WPR performs better than the conventional Page Rank algorithm in terms of returning larger number of relevant pages to a given query.

Gyanendra Kumar et. al.[25] proposed a new algorithm in which they considered user’s browsing behavior. As most of the ranking algorithms proposed earlier are either link or content oriented in which consideration of user usage trends are not available. They propose in their paper, a page ranking mechanism called Page Ranking based on Visits of Links (PRVOL) is being devised for search engines, which works on the basic ranking algorithm of Google, i.e. Page Rank and takes number of visits of inbound links of web pages into account. This concept is very useful to display most valuable pages on the top of the result list on the basis of user browsing behavior, which reduce the search space to a large scale. In this paper as the author describe that in the original Page Rank algorithm, the rank score of page p, is evenly divided among its outgoing links or we can say for a page, an inbound links brings rank value from base page p. So, he proposed an improved Page Rank algorithm. In this algorithm we assign more rank value to the outgoing links which is most visited by users. In this manner a page rank value is calculated based on visits of inbound links.

Neelam Tyagi et al.[26] proposed algorithm is used to find more relevant information according to user’s query. WWW consists billions of web pages and hugs amount of information available within web pages. To retrieve required information from World Wide Web, search engines perform number of tasks based on their respective architecture. When a user refers a query to the search engine, it generally returns a large number of pages in response to user’s query. To support the users to navigate in the result list, various ranking methods are applied on the search results. Most of the ranking algorithms which are given in the literature are either link or content oriented. Which do not consider user usage trends. Here, a page ranking mechanism called Weighted PageRank Algorithm based on Visits of Links (VOL) is being devised for search engines, which works on the basis of weighted pagerank algorithm and takes number of visits of inbound links of web pages into account. The original WPR is an extension to the standard PageRank algorithm. WPR takes into account the importance of both the inlinks and outlinks of the pages and distributes rank scores based on the popularity of the pages. So, this concept is very useful to display most valuable pages on the top of the result list on the basis of user browsing behaviour, which reduce the search space to a large scale.

Pavalam S. M et al.[28] states with the advent of internet technology, data has exploded to a considerable amount. Large volumes of data can be explored easily through search engines, to extract valuable information. Web crawlers are an indispensible part of search engine, which are program (proceeds with the search term) that can traverse through the hyperlinks, indexes them, parses the files and adds new links in to its queue and the mentioned process is done several times until search term vanishes from those pages. The web crawler looks for updating the links which has already been indexed. This paper briefly reviews the concepts of web crawler, its architecture and its different types. It lists the software used by various mobile systems and also explores the ways of usage of web crawler in mobile systems and reveals the possibility for further research.

Yaduvir Singh[29] in his paper covers about the different types of web crawler exist in today’s world. His paper also covers the basic fundamental why the web crawler exists and what is the significance of them. He also gives how the web crawler really work and what are the different difficulties in implementing the web crawler and what are the solutions for them.

S. Balaji et al.[30] proposed a framework and algorithm, TOPCRAWL for mining. Web mining systems automatically extract formation from existing web documents. The crawler is an important module of a web search engine. The quality of a crawler directly affects the searching quality of such web search engines. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. Given some URLs, the crawler should retrieve the web pages of those URLs, parse the HTML files, add new URLs into its queue and go back to the first phase of this cycle. The crawler also can retrieve some other information from the HTML files as it is parsing them to get the new URLs. The proposed TOPCRAWL algorithm is a new crawling method which emphasis on topic relevancy and outperforms state-of-the-art approaches with respect to recall values achievable within a given period of time. This method also tries to offer the result in community format and it makes use of a new combination of ideas and techniques used to identify and exploit navigational structures of websites, such as hierarchies, lists or maps. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall whilst maintaining precision.

Dhiraj Khurana et al.[31] proposed in a large distributed system like the Web, users find resources by following hyperlinks from one document to another. When the system is small and its resources share the same fundamental purpose, users can find resources of interest with relative ease. However, with the Web becoming giant, navigation is difficult. WebCrawler, the Web’s first comprehensive full-text search engine, is a tool that assists users in their Web navigation by automating the task of link traversal, creating a searchable index of the web, and fulfilling searchers’ queries from the index. Conceptually, WebCrawler is a node in the Web graph that contains links to many sites on the net, shortening the path between users and their destinations.

Andri Mirzal[32] presents a simple web search engine for indexing and searching html documents using python programming language. Because python is well known for its simple syntax and strong support for main operating systems, we hope it will be beneficial for learning information retrieval techniques, especially web search engine technology.

Animesh Tripathy et al.[33] present web mining architecture of the system and describe efficient techniques for achieving high performance. As WWW is growing rapidly and data in the present day scenario is stored in a distributed manner. The need to develop a search engine based architectural model for people to search through the Web. Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. The crawler is an important module of a web search engine. The quality of a crawler directly affects the searching quality of such web search engines. Given some URLs, the crawler should retrieve the web pages of those URLs, parse the HTML files, add new URLs into its queue and go back to the first phase of this cycle. The crawler also can retrieve some other information from the HTML files as it is parsing them to get the new URLs. In this paper, we describe the design of a web crawler that uses Page Rank algorithm for distributed searches and can be run on a network of workstations. The crawler scales to several hundred pages per second, is resilient against system crashes and other events, and can be adapted to various crawling applications.

Sachin Gupta et. al.[27] proposed an improvement in Weighted Page Rank based on Visits of Links algorithm with User Attention Time and directly storing visits of links data on Search Engine Sever. Searching on the web can be considered as a process of user enters the query and search system returns a set of most relevant pages in response to user’s query. But results returned are not mostly relevant to user’s query and ranking of the pages are not efficient according to user requirement. In order to improve the precision of ranking of the web pages, after analyzing the different algorithms like Page Rank, Page Rank based on VOL, Weighted Page Rank, Weighted Page Rank algorithm based on VOL. We proposed adding one more factor “User attention time” i.e. the total time spent to read the web page to be included in Weighted Page Rank based on VOL algorithm that signifies the importance of a web page for a user and thus helps in increasing the accuracy of web page ranking. Here we are making an improvement in WPRVOL by introducing a timing factor (i.e. user attention time) with it. This time will be the total time spent to read the web page.

Also one more enhancement in WPRVOL algorithm is made by storing the “no of visits on links” data directly on Search Engine server instead of storing it on client’s web server in the form of log which in earlier literature has been followed. The proposed improvement in algorithm finds more relevant information according to user’s query. So, this concept is very useful to display most valuable pages on the top of the result list on the basis of user browsing behaviour, which reduce the search space to a large scale for user.

Lay-Ki Soon et. al.[36] proposed URL signature to be implemented in web crawling, aiming to avoid processing duplicated web pages for further web crawling. The experimental result indicates that URL signature is able to reduce the processing of duplicated web pages significantly for further web crawling at a negligible cost compared to the one without URL signature.

Farha R. Qureshi et. al.[35] proposed, since standard URL normalization was proposed to avoid processing of duplicate web pages but it fails to do so when syntactically different URLs lead to similar web pages. It proves that considering only URL is not enough and needs some additional information. There comes the URL/Body Content signature which not only considered URL but also body text. This works perfectly in reducing processing of duplicate pages.

Saurabh Pakhidde et. al.[36] proposed new crawler architecture. Generally web crawler rejects the page whose url does not contain the search keyword while searching information on World Wide Web. They proposed the architecture which scan the web pages and parse them check for their relevancy by assigning each page a page weight and arrange them in terms of most relevant page first and then second page and so on according to the page weight, so that we may gain more relevant information or site addresses at top of result. Thus we will get more accuracy while searching some information on network.

Prashant Dahiwale et. al.[37] proposed a crawler that follows a completely new crawling strategy to compute the relevance of the page. It analyses the content of the page based on the information contained in various tags within the HTML source code and then computes the total weight of the page. The page with the highest weight thus has the maximum content and highest relevance. This crawler is best suitable where true analysis of data is needed such as business analysis.

Sachin Gupta et. al.[19] proposed World Wide Web is a giant dynamic network and a repository of interconnected documents and other resources, linked by hyperlinks. Web bots or crawlers are used to recursively traverse and download web pages for search engines to create and maintain their web indices. Search engines largely rely on bots to collect information from the web. Moreover, the need of maintaining the up-to-date pages causes repeated traversal of websites by bots. Due to this, the resources like CPU cycles, disk space and network bandwidth etc., become overloaded which may lead to crashing of website and increase in web traffic. To minimize negative aspects of this traffic on websites, the access as well as behaviours of bots may be regulated at an individual web server by implementing the Robots Exclusion Protocol through “robots.txt”. It is a mechanism for www servers to indicate to bots which part of their server should not be accessed. Thus, this protocol aids in controlling the bot's activity. Bots that follow the Robot Exclusion Protocol are called Ethical Bots. Not all bots are benign however, not all bots follow Robot Exclusion Protocol, some simply ignore “Robots.txt”. These bots are called spamming bots, they go through your website looking for web forms and email addresses to send you spam. Also these bots probe your website for security vulnerabilities. Spammers use such bots to scan for email addresses for sending spam. Also these spamming bots takes up an enormous amount of bandwidth. This can cause websites to exceed their bandwidth limit and be taken down temporarily. This is especially troublesome for mirror sites which host many gigabytes of data. To prevent, the bandwidth of website from being eaten up and harvesting of other private information like email addresses to send spam, their access need to be blocked. Access of such bots can be blocked with “ .htaccess File ”.

CHAPTER 6: PROBLEM STATEMENT

Here are some reasons to use a web crawler:

To maintain mirror sites for popular Web sites.
To test web pages and links for valid syntax and structure.
To monitor sites to see when their structure or contents change.  To search for copyright infringements.
To build a special-purpose index. For example, one that has some understanding of the content stored in multimedia files on the Web.

In earlier section, we have provided literature survey. Our literature survey indicates the following:

6.1 Problem Formulation

As World Wide Web is becoming giant day by day and Search Engines crawling Module (Web Crawler) in today’s dynamic world crawls only fraction of web pages from World Wide Web. So a Crawler should observe that the fraction of pages crawled must be most relevant and the most important ones, not just random pages.
Crawling of unimportant data from World Wide Web, leads to server overhead and network payload. Therefore critically affecting Server performance.
General purpose search engines retrieve and download unwanted information.
Making Search Space for user too large to search. Therefore search space for user needs to be reduced or user deserves the most valuable pages on the top of results list, as it will save the user’s network bandwidth and user’s most valuable time.
Execution time of search engine (i.e. the time in which records are retrieved by users) needs to be reduced.
The crawler needs a very large memory space of database for storing page content etc, by removing irrelevant pages, we will be saving a lot of memory space, that will eventually speed up the searches (queries) from the database.

6.2 Objectives

Thus, the problem taken for this research work is divided into some objectives which are as follows:

To Propose Effective Crawling or Enhanced crawling in terms of finding important and relevant pages to crawl using Weighted Page Rank algorithm.
To reduce the overhead on server and to reduce network payload, while crawling the contents.
Optimization of Crawled Data by removing never or least browsed (accessed) web pages from repository using Weighted Page Rank based on Visits of Links algorithm.
To reduce the execution time of search engine by means of removing irrelevant pages in crawling. Reduction in execution time means providing fast searching to users. User does not have to browse longer to get the required information.
To reduce the search space to a large scale for user, by displaying most important pages on top of result list. Here importance of page is evaluated on the basis of user browsing behaviour. Therefore more accuracy of results.

CHAPTER 7: METHODOLOGY

We are proposing new crawler architecture here, which is hybrid of various technologies used earlier as well as new algorithms.

Our crawler like other crawlers starts with a seed url or multiple seeds.
It fetches the page corresponding to that seed url and creates an object of that page and stores it into a pages objects queue.
After that outlinks of the seed page from that object are fetched and their objects are stored in the queue.
Similarly outlinks of the pages from their page objects, found at step 3 are fetched and their objects are stored in queue.
So, this is a recursive process goes on and on, until maximum no of pages have been fetched or up to a threshold depth level or until the required higher level objective is achieved.

Note: Queue in our approach will store the objects of fetched pages. We can also say this queue is a pages objects array.

7.1 Structure of an Object of Page

illustration not visible in this excerpt

Fig. 7.1: Page Object Structure

Page object stores the url, title, meta, content, inlinks, inlinks_count, outlinks, outlinks_count, url_hash, content_hash, domain_age, relevancy, wpr, wpr_vol, email fields.

Url field will store the unique address for a page that is accessible on the Internet.
Title field will store the title of the page.
Meta field will store the content of meta elements of page. s
Content field will store the content of body element (<body>) of page.
Inlinks field will store url of web pages that point towards this page.
Inlinks_count field will store the value of total no. of inlinks of this page.
Outlinks field will store url of web pages that point to other web pages from this page.
Outlinks_count field will store the value of total no. of outlinks of this page.
Url_hash field will store the signature / hash of url of page, generated with md5 hashing algorithm.
Content_hash field will store the signature of body content of page, generated with md5 hashing algorithm.
Domain_Age field will store the age of domain name of page.
Relevancy field will store the relevancy weight or score of page.
Wpr field will store the score or rank of page, computed using Weighted Page Rank (WPR) Algorithm.
Wpr_vol field will store the score or rank of page, computed using Weighted Page Rank based on Visits of links (WPRVOL) Algorithm.
Email field will store the single email address or multiple email addresses if available on that page.

Pages will be fetched to crawl in the breath first approach as displayed in fig. 7.2 , where A is the Seed Url and pages B, C, D are its outlinks. After A, its outlinks B then C and then D will be fetched. Similarly E, F and G are the outlinks of pages B. After D, outlinks of B (E, F, G) will be fetched to process and so on.

illustration not visible in this excerpt

Fig. 7.2: Pages Tree Strucure

7.2 Proposed Architecture of Web Crawler

illustration not visible in this excerpt

Fig. 7.3: Proposed Architecture of Web Crawler

Once we have array of pages objects in computer memory (Queue), next we will remove the redundancy by removing duplicate content, duplicate urls and near duplicate urls using Standard Normalization (STN) of urls and hashing Algorithm. We will use md5 hashing algorithm to find the signature of url and signature of body content of each page. STN of urls will remove redundancy by finding syntactically identical pages. Syntactically identical urls are considered as equivalent urls. After Url Normalization, We have a new page objects queue with canonical or normalized urls that will be supplied for Signature comparison.

In the next step, first we compare the signature of page url from a page object with the signature of all the pages urls earlier stored in database if that particular signature exists in database then page (page object) having that url will be ignored or discarded. If url signature of that object does not exists in database then signature of body content from that page object will be compared with the earlier stored body content signatures of pages in database. If that page content signature exists in database, then that page (page object) will be discarded, otherwise that page object will be stored in new queue, which is an optimized version of earlier queue. We call that queue as optimized queue.

In the next step, we will calculate the Domain Age of urls from those page objects in queue and compare domain age of page with a threshold value. We have taken threshold value for domain is 365 days. If domain age is less than threshold we assumed, then we will check relevancy of page, if relevancy is less than threshold value (which is 3, Prashant Dahiwale et. al. assumed in[37] ), we will discard that page object. Otherwise, if relevancy found greater than threshold, store that page details from its object directly to the database.

Relevancy of page is calculated using the approach of Prashant Dahiwale et. al. used in predicting relevancy of pages in his rank crawler[37].

If Domain age of page is more than threshold, then another score/rank of page will be computed using Weighted Page Rank (WPR) Algorithm[24], if score falls less than WPR_THRESHOLD (threshold value of WPR), we will discard that page object. If score is more than the threshold value, that page will be added to repository. 0.15 is a magic number or minimum threshold, we have set for weighted page rank comparison of pages. 0.15 is minimum WPR value that a page has if it has zero inlink or zero Outlink.

illustration not visible in this excerpt

Fig. 7.4: Block Diagram of WPR comparison Module for pages

While storing the pages in database, this proposed system will send an email containing our Vol-Analytics code (a client side script) to Website Administrator, to add that script in his/her website. Once Web Admin will add our script to their websites, our search engine will start tracking their website visits of links activities.

This Vol-Analytics Script is developed using AJAX, runs on background. This AJAX based Vol-Analytics Script send details to webpage like its own address, its caller address, how many no. of times page got called from the particular resource and its website id to our server where those got tracked and stored in our database.

Our VOL-Analytics Code is:

illustration not visible in this excerpt

To protect the code, Javascript Code Obfuscation can be done. Code Obfuscation is done so that web masters or any other does not try to temper our vol analytics code and malicious attempts towards our server (where all tracking of visits of links will be done and store in database) can be reduced. This Code Obfuscation can be done using the javascript code obfuscator api.

Obfuscated Version of above code is:

illustration not visible in this excerpt

This code uses HTTP Referer for its working. HTTP referer (originally a misspelling of referrer) is an HTTP header field that identifies the address of the webpage that linked to the resource being requested. By checking the referer, the new webpage can see where the request originated [38]. Our code runs at background, sends website id (web id), page url and caller url to our server, where we track those and store those in our database to have visits of links of various web pages from various resources.

At the start, we have harvested the email address from pages. Email address is stored in page Object. Web Administrator will get only one mail for the behalf of a website or single domain.

After 45 days, A cron job on our server will execute one more script which will calculate rank of stored web pages using Weighted Page Rank based on Visits of Links (VOL) Algorithm[26]. This algorithm uses web usage mining (WUM) and can only work if we have Visits of links data available.

If rank of some pages is less than WPR_VOL_THRESHOLD (WPRVOL Threshold value), we will optimize our database by deleting those pages. This Algorithm actually keeps those pages in database, which are being accessed by users. It removes never or least browsed (accessed) pages from database. WPRVOL algorithm considers the user browsing behaviour of web pages over web, removes never accessed pages decided by our threshold.

illustration not visible in this excerpt

Fig. 7.5: Block Diagram of WPR_VOL comparison Module for pages

0.15 is minimum WPR_VOL_THRESHOLD value we kept, that a page has if it has zero inlink or zero outlink or Zero visit (never accessed page).

Cron Jobs are used for scheduling tasks to run on the server. They are most commonly used for automating system maintenance or administration. However, they are also relevant to web application development. There are many situations when a web application may need certain tasks to run periodically.

7.3 Url Normalization or Standard Normalization (STN) of Urls

URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs may be equivalent. Search engines employ URL normalization in order to reduce indexing of duplicate pages. Web crawlers perform URL normalization in order to avoid crawling the same resource more than once[39].

illustration not visible in this excerpt

Fig. 7.6: URL Normalization Process

The standard URL normalization (STN) is one of the tasks performed by web crawler during the process of web crawling. There is a set of predefined activities to be done for converting URLs into canonical format. After the normalization, URLs which are syntactically identical are considered as equivalent, thus reducing the crawling of redundant web pages[35].

We will use Standard Normalization on Urls to remove unnecessary parameters in URL and finding syntactically identical Urls. Finding Syntactically Equivalent or identical urls means removal of redundancy by not crawling again the redundant web pages. These duplicate Urls are removed at the time, when Url Signature comparison is made with earlier stored page urls signatures in database. After URL Normalization, We have a new page objects queue with canonical or normalized urls that will be supplied for Signature comparison. URL Normalization helps us to find near duplicate urls to remove redundant web pages crawling.

Following are some examples of Standard Normalization of Urls:

HTTP://WWW.EXAMPLE.COM is transformed into http://www.example.com.
http://www.example.com:80 is transformed into http://www.example.com.
http://www.example.com/index.html is transformed into http://www.example.com. 49
http://208.77.188.166 is transformed into http://www.example.com (i.e. replacing ip with domain name).
https://www.example.com is transformed into http://www.example.com.
http://www.example.com/bar.html#section1 is transformed into http://www.example.com/bar.html.
http://www.example.com/alice is transformed into http://www.example.com/alice/.
http://www.example.com/display?lang=enarticle=fred is transformed into http://www.example.com/display?article=fredlang=en.
http://www.example.com/display?id=123fakefoo=fakebar is transformed into http://www.example.com/display?id=123 (i.e. removing unnecessary query variables).
https://www.example.com/display? is transformed into http://www.example.com/display (i.e. removing empty query).

7.4 Removing Redundancy with MD5 hashing algorithm

Standard Normalization of Urls is successful to avoid processing of duplicate pages, when they are syntactically identical or equivalent, but fails to do so for syntactically different Urls, which lead to crawling if similar web pages. For e.g. these websites in table have syntactically different Urls, which leads to crawling of similar web pages.

TABLE 7.1 SYNTACTICALLY DIFFERENT URLS
illustration not visible in this excerpt

It proves that considering only Url Normalization to remove redundancy while crawling is not enough and needs other duplicity removal techniques as well.

There comes hashing or signature generation technique, which is very helpful for removing duplicity. To remove crawling of similar pages whose url are different, we will compare the signatures generated by md5 hashing algorithm.

The MD5 message-digest algorithm is a widely used cryptographic hash function producing a 128-bit (16-byte) hash value, typically expressed in text format as a 32 digit hexadecimal number. MD5 has been utilized in a wide variety of cryptographic applications, and is also commonly used to verify data integrity[40].

illustration not visible in this excerpt

Fig. 7.7: Block Diagram of Removing Redundancy with Hash

In our approach, we will first compare the Url signature of every page from page object queue with earlier Urls signatures stored in database, so that not to crawl same pages again. If Url signature exists in database, we will discard that page. Otherwise if Url signature does not exist in database, we will next compare the body content signature of same page object with body content signatures earlier stored in database. If body content signature of page already exists in database, discard that page. Otherwise store that page object in new queue, which is optimized one from earlier queue. In this way with body text signature generation, similar or duplicate pages which are syntactically different will be discarded from recrawling. This works perfectly in reducing processing of duplicate pages.

7.5 How to calculate Domain Age or Age of Webpage / Website

Domain age is the age of a website on the Internet. To calculate domain age WHOIS Protocol is used.

WHOIS (pronounced as the phrase “who is”) is a query and response protocol that is widely used for querying databases that store the registered users or assignees of an Internet resource, such as a domain name, an IP address block, or an autonomous system, but is also used for a wider range of other information. The protocol stores and delivers database content in a human-readable format[41].

Using WHOIS protocol we will send query to the InterNic Server (http://www.internic.net/whois.html), which in return will give us age of domain. Internic is the internet governing body primarily responsible for Internet Domain Name Registration Services.

Our Domain Age Calculation Module gives us the age of website in no. of days (e.g. 2740 days).

We are using domain age concept in our proposed work, because we are using Weighted Page Rank Algorithm and Weighted page rank based on visits of links algorithm in our approach. These algorithms use inlinks and outlink of webpage for calculation of rank of webpage. Inlinks are possible for older aged domain / website only. For new born websites it is hardly possible to have inlinks. But Crawler crawls older age as well as new age websites both. So, before applying WPR Algorithm we will find age of web page, if it is older aged (as per our domain age threshold criteria) WPR algorithm will be applied on it and depending upon the computed WPR Rank comparison with WPR_THRESHOLD, webpage will be stored in database. Otherwise relevancy of new age domain will be calculated and depending computed relevancy comparison with relevancy threshold, webpage will be stored in database. Domain threshold we are assuming in this research is 1 year (365 days). Older domains can get a little favor in edge in search engine ranking as well.

7.6 How to calculate Relevancy of newly age webpage

We use Relevancy Parameter to decide, whether to store the new age page in database or not. This Relevancy Parameter will decide, web page is beneficial to keep in database or not. To decide relevancy we are using the approach of Prashant Dahiwale et. al.[37] mentioned in their relevancy prediction research paper.

To check relevancy, we need a search string to be searched within the page. In our approach, search string will be the last string after the host name in Url.

For Example:

History will be search string in this Url http://www.infopathankot.com/history.htm
Dharamsala will be search string in this Url http://www.himachaltouristguide.com/index.php/kangra/dharamsala
Web Crawler will be search string in this url http://en.wikipedia.org/wiki/Web_crawler
Wheat Oats Upma Recipe will be search string in this Url http://www.chatpatirecipes.com/kids-recipes/wheat-oats-upma-recipe.php
Sports Facilities will be search string in this Url http://arni.in/sports_facilities.html

For the cases where there is no string after host name or domain name, name before TLD (Top Level Domain like .com, .net, .org, .in etc) will be the search string.

For Example:

Pathankot Punjab will be search string in this Url http://www.pathankotpunjab.com
Best Programmer will be search string in this Url http://www.bestprogrammer.com
Easy Life will be search string in this Url http://www.easylife.com
Himachal Tourist Guide will be search string in this Url http://www.himachaltouristguide.com
Indian Pilgrims will be search string in this Url http://www.indianpilgrims.com
Hosting Acres will be search string in this Url http://www.hostingacres.com 53
Arni will be search string in this Url http://www.arni.in

We will create and use data dictionary of words, with the help of which, our system will generate search string Pathankot Punjab from PathankotPunjab string. Best Programmer will be generated from BestProgrammer string with use of Data Dictionary.

Web Crawler works using the source code of the page i.e. the downloader gets the content from a page, parser uses the source code to analyze the tags and contents of the page so as to get the page weight and calculate the degree of relevancy of a page. The analysis part of the source code is as follows:

Let the Total weight of the page be ‘t’ units

illustration not visible in this excerpt

Where T = 4 units, M = 3 units, H = 2 units, B = 1 units

These values are assumed by considering search engines (like google, bing etc) ratings to these tags.

Consider the source code associated with the web page of the URL: http://www.sodiztechnologies.com/web-development.php as given below:

illustration not visible in this excerpt

Search String for this page by our approach will be web development and relevancy score or weight of the page is:

illustration not visible in this excerpt

Which is more than threshold value 3 (i.e. t>3), so we can assume page is relevant and we can store it.

7.7 How to calculate Weighted Page Rank (WPR)

To calculate Weighted Page Rank of Web page, following formula is suggested by Wenpu Xing et. al.[24].

illustration not visible in this excerpt

Where, following are notations used in formula:

u and v represents the web pages.
d is the damping factor. Its value is 0.85.
B(u) is the set of pages that point to u.
WPR(u) and WPR(v) are rank scores of page u and v respectively.
illustration not visible in this excerptis the weight of link(v, u) calculated based on the number of inlinks of page u and the number of inlinks of all reference pages (i.e. outlinks) of page v.
illustration not visible in this excerptis the weight of link(v, u) calculated based on the number of outlinks of page u and the number of outlinks of all reference pages (i.e. outlinks) of page v.

Also illustration not visible in this excerpt

Where Iu and Ip represent the number of inlinks of page u and page p, respectively and Ou and Op represent the number of outlinks of page u and page p, respectively. R(v) denotes the reference page list of page v.

This algorithm works by taking into account the importance of both the inlinks and outlinks of the web pages and distributes rank scores based on the popularity of the pages. The popularity from the number of inlinks and outlinks is recorded as[illustration not visible in this excerpt] and [illustration not visible in this excerpt].

7.8 How to calculate Weighted Page Rank based on Visits of Links (WPRVOL)

WPRVOL stands for Weighted Page Rank based on Visits of Links. To calculate WPRVOL, following formula is proposed by Neelam Tyagi et. al.[26]:

illustration not visible in this excerpt

Following are the notations used in above formula:

u and v represents the web pages.
d is the damping factor. Its value is 0.85.
B(u) is the set of pages that point to u.
illustration not visible in this excerptand WPRvol(v) are rank scores of page u and v respectively.
illustration not visible in this excerptis the weight of link(v, u) calculated based on the number of inlinks of page u and the number of inlinks of all reference pages (i.e. outlinks) of page v.
illustration not visible in this excerptis the number of visits of link which is pointing page u from v.
TL(v) denotes total number of visits of all links present on v.

Also[illustration not visible in this excerpt]

Where Iu and Ip represent the number of inlinks of page u and page p, respectively. R(v) denotes the reference page list of page v.

This algorithm considers the user’s usage trends. The user’s browsing behaviour can be calculated by number of hits (visits) of links[27].

7.9 Improved Searching with WPRVOL

Weighted Page Rank based on Visits of Links can be used at the time of preparing query results for user by search engine module. As WPRVOL works on the basis of user browsing behaviour. So, this concept is very useful to display most valuable pages on the top of the result list, which reduce the search space to a large scale. To support the users to navigate in the result list, our search module will use WPRVOL [illustration not visible in this excerpt]and displays the mostly visited or accessed pages on the top of result list making searching for users more enhanced in terms of quality. With our approach mentioned earlier, by not storing unimportant and irrelevant pages, also by deleting never accesses data, we will be saving a lot of memory space in database that will eventually speed up the searches (queries) from the database. Execution Time of search engine will be reduced to provide fast search results to users.

Note: For the successful execution of our proposed crawler, initially we need to have adequate data in our crawler database, so that crawler can get inlinks for new webpage from that database. One method to fill crawler database at initial stage is to crawl a directory like dmoz.org. Dmoz is a multilingual open-content directory of World Wide Web links. Other Method can be to set our threshold value (WPR_THRESHOLD) equal to 0.15, rather than greater than 0.15.

CHAPTER 8: RESULTS AND DISCUSSIONS

The main objective of the experimental work reported in this file is to evaluate the crawling performance of a web crawler with techniques used like MD5 for Signature generation, Relevancy Computation, Weighted Page Rank Algorithm Weighted Page Rank based on VOL Algorithm. We have implemented our work on VPS Server with machine name crawlnsearch.com and IP address (62.220.163.102), which consists of Shared Processor - Genuine Intel (Intel® Xeon® CPU E3-1230 V2@3.30GHZ, CPU Cores : 4), RAM - 3GB, Hard Disk - 75 GB, Operating System - Linux (Centos release 6.5), Data Transfer Limit - 3TB, Port or Link - 100Mbps. We have implemented our crawler using technologies PHP 5.4.29 Version, MYSQL 5.5.37 Version, AJAX and APACHE Server 2.2.27 Version. We have tested our work on live websites.

8.1 Crawling Results for Seed http://www.vvguptanadco.com

Maximum Pages set to be crawled are 500.
Total Pages Fetched are 500.
No. of Unique Pages found are 106.
No. of Rejected Pages on the behalf of Standard Normalization of Urls are 15.
No. of Rejected Pages on the behalf of MD5 Hashing Algorithm are 13
Total No. of Saved Pages on WPR Threshold Criteria is 25.

illustration not visible in this excerpt

Fig. 8.1: Fetching Pages for vvguptaandco.com

illustration not visible in this excerpt

Fig. 8.2: Calulating Domain age and relevancy for vvguptaandco.com

illustration not visible in this excerpt

Fig. 8.3: Saves Relevant Pages to Database for vvguptaandco.com

illustration not visible in this excerpt

Fig. 8.4: Saving Pages on basis of WPR Criteria for vvguptaandco.com

illustration not visible in this excerpt

Fig. 8.5: No. of unique Pages and md5 rejected pages for vvguptaandco.com

8.2 Crawling Results for Seed http://www.ovhconsulting.com

illustration not visible in this excerpt

Maximum Pages set to be crawled are 500.
Total Pages Fetched are 162.
No. of Unique Pages found are 70.
No. of Rejected Pages on the behalf of Standard Normalization of Urls are 10.
No. of Rejected Pages on the behalf of MD5 Hashing Algorithm are 12
Total No. of Saved Pages on Relevancy basis is 6.
Total No. of Saved Pages on WPR Threshold Criteria is 50.

illustration not visible in this excerpt

Fig. 8.6: Fetching Pages for ovhconsulting.com

illustration not visible in this excerpt

Fig. 8.7: Calulating Domain age and relevancy for ovhconsulting.com

illustration not visible in this excerpt

Fig. 8.8: Saves Relevant Pages to Database for ovhconsulting.com

Fig. 8.9: Saving Pages on basis of WPR Criteria for ovhconsulting.com

illustration not visible in this excerpt

Fig. 8.10: No. of unique Pages found for ovhconsulting.com

illustration not visible in this excerpt

Fig. 8.11: No. of STN rejected and MD5 rejected Pages for ovhconsulting.com

8.3 Crawling Results for Seed http://www.infopathankot.com

Maximum Pages set to be crawled are 500.
Total Pages Fetched are 500.
No. of Unique Pages found are 73.
No. of Rejected Pages on the behalf of Standard Normalization of Urls are 11.
No. of Rejected Pages on the behalf of MD5 Hashing Algorithm are 12.
Total No. of Saved Pages on Relevancy basis is 1.
Total No. of Saved Pages on WPR Threshold Criteria is 60.

illustration not visible in this excerpt

Fig. 8.12: Fetching Pages for infopathankot.com

illustration not visible in this excerpt

Fig. 8.13: Calculating Domain age, Relevancy and saving page on relevancy basis for infopathankot.com

Fig. 8.14: Calculating WPR of Pages for infopathankot.com

illustration not visible in this excerpt

Fig. 8.15: Saving pages on WPR basis for infopathankot.com

illustration not visible in this excerpt

Fig. 8.16: Unique pages for infopathankot.com

illustration not visible in this excerpt

Fig. 8.17: STN and MD5 Rejected Pages for infopathankot.com

8.4 Snapshot of saved Visits of Links

These are visits of links of various web pages from various resources, stored on our server, of those websites who has added our Vol-Analytics Script to their web pages. Our Ajax based Vol-Analytics Script sent details to webpage like its own address, its caller address, how many no. of times page got called from the particular resource and its website id to our server where those got tracked and stored in our database. Here is snapshot of some stored visits of links.

illustration not visible in this excerpt

Fig. 8.18: Visits of Links data on crawlnsearch.com server

8.5 Snapshot of WPRVOL Scores of pages

As per our proposed approach, after 45 days, our server will execute a cron job, with which it will calculate the WPRVOL score of the stored pages in database (i.e. crawled data). If WPRVOL score of any page is less than threshold (i.e. WPR_VOL_THRESHOLD), that page will be deleted from database. Pages with WPRVOL score less than WPR_VOL_THRESHOLD are those pages that are never browsed (accessed). So by deleting such unused data, we are doing an optimization to that crawled data.

Before the execution of VOL Script, no. of saved pages in database are 488. After VOL Script execution unbrowsed pages deleted are: 10 Total Pages left in database are 478.

illustration not visible in this excerpt

Fig. 8.19: VOL enhancement started

illustration not visible in this excerpt

Fig. 8.20: WPRVOL Score Computation part 1

illustration not visible in this excerpt

Fig. 8.21: WPRVOL Score Computation part 2

illustration not visible in this excerpt

Fig. 8.22: Pages retained in database after WPRVOL Threshold check

illustration not visible in this excerpt

Fig. 8.23: Pages deleted in database after WPRVOL Threshold check

8.6 Snapshot of Reduced Execution time while searching

It is a snapshot of reduced execution time of search engine, to provide fast search results to users. Our proposed architecture of crawler when implemented takes 0.0382871627808 seconds which is less than the 0.0808229446411 seconds of general search engine.

illustration not visible in this excerpt

Fig. 8.23: Reduced Execution time for search

8.7 Comparison of Weighted Page Rank (WPR) with Page Rank (PR)

For the comparison of Page Rank (PR) and Weighted Page Rank (WPR), we have used seed infopathankot.com, we got different pages saved results. As the no. of fetched pages grows, WPR is saving lesser pages as compare to PR and it has been proved earlier by Wenpu Xing et. al.[15] WPR finds more quality pages than PR.

TABLE 8.1 SAVED PAGES RESULTS OF WPR AND PR
illustration not visible in this excerpt

illustration not visible in this excerpt

Fig. 8.24: No. of fetched pages vs No of saved Pages

CHAPTER 9: CONCLUSION AND FUTURE SCOPE

9.1 Conclusion

Internet is one of the easiest sources available in present days for searching and accessing any sort of data from the entire world. The structure of the World Wide Web is a graphical structure, and the links given in a page can be used to open other web pages. Web crawlers are the programs or software that uses the graphical structure of the Web to move from page to page. Here, we have briefly discussed about crawlers and algorithms to enhance the quality of crawling process by crawling important and relevant pages. The crawler is an important module of a search engine. The quality of a crawler directly affects the searching quality of search engines. Also another approach has been suggested in this paper, to optimize the crawler crawled data by deleting the never browsed or least browsed pages data from those crawled data. An extended architecture is proposed in this paper to make the crawling process effective.

9.2 Future Scope

For future work, we also plan to add a dashboard facility for the webmasters, who will add our VOL Analytics Script in their websites. With Dashboard, they can see from which resources (like search engine, blogs, forums and websites) visitors are coming to their websites. How many no. of times same page is requested from another page or same resource. Also from which location, browser and Operating System etc page is accessed can be provided. In order to improve the quality of crawling more, work can be done on finding the better threshold values of Weighted Page Rank and Weighted Page Rank Based on Visits of Links algorithms. Our Proposed Crawling Architecture needs a huge amount of Computer Memory for its operation, if that can be reduced that will be a great enhancement in this work.

REFERENCES

Internet World Stats survey report available at - << http://www.internetworldstats.com/stats.htm >>.
Pew Research Center’s Internet and American Life Project Survey report available at - << http://www.pewinternet.org/2012/03/09/main-findings-11/ >>.
Average Traffic a website receives from a Search Engine is available at - << http://moz.com/community/q/what-is-the-average-percentage-of-traffic-from- search-engines-that-a-website-receives >>.
Google can index the content of Flash files available at - << https://support.google.com/webmasters/answer/72746?hl=en >>.
Brian Pinkerton, “WebCrawler: Finding What People Want”, Doctor of Philosophy University of Washington, 2000.
Size of World Wide Web is available at - << http://www.worldwidewebsize.com/ >>.
Carlos Castillo, Mauricio Marin, Andrea Rodrigue and Ricardo Baeza-Yates, “Scheduling Algorithms for Web Crawling” Proceedings of the Web Media LA- Web 2004, 0-7695-2237-8 ©2004 IEEE, Pages 10-17.
Ravikiran Routhu, “Enrichment in Performance of Focussed Web Crawlers”, Master Thesis, is available at - << http://dspace.thapar.edu:8080/dspace/bitstream/10266/1261/3/1261.pdf >>.
Fabrizio Silvestri, “High Performance Issues in Web Search Engines: Algorithms and Techniques”, Ph.d.Thesis: TD 5/04, available at - << http://hpc.isti.cnr.it/_silvestr >>.
General Structure of Search Engine is available at - << http://ieeexplore.ieee.org/iel5/2/34424/01642621.pdf >>.
Working of General Search Engine image is available at - << http://aditichawlaa.blogspot.in/2011/05/search-engine.html >>.
M.P.S.Bhatia, Divya Gupta, “Discussion on Web Crawlers of Search Engine”, Proceedings of 2nd National Conference on Challenges Opportunities in Information Technology (COIT), pp. 227-230, March 2008.
Spider for AltaVista and Standard for Robot Exclusion (SRE) information is available at - << http://whatis.techtarget.com/definition/spider >>.
Controlling what a spider sees on your site available at - << http://www.wpthemesplanet.com/2009/09/how-does-web-crawler-spider-work/ >>.
Pushkar Jha, Riddhi Patel and Aditya K. Sinha, “Analysis of Searching Algorithms in Web Crawling”, International Journal for Scientific Research Development (IJRSD) , ISSN: 2331-0613, vol. 1, issue 2, pp. 254-257, 2013.
Components and Working of Web Crawler Image is available at - <http://www.milkaddict.com/web-crawlers-googlebot/ >>
Sandeep Sharma, “Web-Crawling Approaches in Search Engines”, Master Thesis, is available at - << http://dspace.thapar.edu:8080/dspace/bitstream/10266/620/3/T620.pdf >>.
Junghoo Cho, Hector Garcia-Molina: “Parallel Crawlers”, 7-11 May 2002, Honolulu, Hawaii, USA is available at - << http://ilpubs.stanford.edu:8090/733/1/2002-9.pdf >>.
Sachin Gupta, Sashi Tarun and Pankaj Sharma, “Controlling access of Bots and
Spamming Bots”, International Journal of Computer and Electronics Research (IJCER), ISSN: 2278-5795, vol. 3, issue 2, pp. 87-92, April 2014.
List of Published Crawler Architectures and Open Source Crawlers is available at
<< http://www.wikipedia.org/wiki/web_crawler >>.
Marc Najork, Janet L. Wiener,” Breadth-first search crawling yields high-quality
pages”, WWW10 proceedings in May 2-5, 2001, Hong Kong.
S. Brin, and Page L., “The Anatomy of a Large Scale Hypertextual Web Search Engine”, Computer Network and ISDN Systems, vol. 30, issue 1-7, pp. 107-117, 1998.
Shweta Agarwal and Bharat Bhushan Agarwal, “An Improvement on Page Ranking Based on Visits of Links”, International Journal of Science and Research (IJSR), ISSN: 2319-7064, vol. 2, issue 6, pp. 265-268, June 2013.
Wenpu Xing and Ali Ghorbani, “Weighted PageRank Algorithm”, Proceedings of the Second Annual Conference on Communication Networks and Services Research (CNSR ’04), IEEE, 2004.
Gyanendra Kumar, Neelam Duahn, and Sharma A. K., “Page Ranking Based on Number of Visits of Web Pages”, International Conference on Computer Communication Technology (ICCCT)-2011, 978-1-4577-1385-9.
Neelam Tyagi and Simple Sharma, “Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page”, International Journal of Soft Computing and Engineering (IJSCE), ISSN: 2231-2307, vol. 2, issue 3, pp. 441-446, July 2012.
Sachin Gupta and Pallvi Mahajan, “Improvement in Weighted Page Rank based on Visits of Links (VOL) Algorithm”, International Journal of Computer and Communications Engineering Research (IJCCER), ISSN: 2321-4198, vol. 2, issue 3, pp. 119-124, May 2014.
Pavalam S. M., S. V. Kasmir Raja, Jawahar M., and Felix K. Akorli, “Web Crawler in Mobile Systems”, International Journal of Machine Learning and Computing, vol. 2, issue 4, pp. 531-534, August 2012.
Yaduvir Singh, “Review Paper on Web Crawler”, International Journal of Advance Research In Science and Engineering (IJARSE), ISSN-2319-8354, vol. 1, issue 1, Feb 2012.
S. Balaji and S. Sarumathi, “TOPCRAWL : Community Mining in Web search Engines with emphasize on Topical crawling”, International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME), 978-1-4673-1039- 0/12 © 2012 IEEE, pp. 20-24.
Dhiraj Khurana and Satish Kumar, “Web Crawler : A Review”, International Journal of Computer Science Management Studies (IJCSMS), ISSN: 2231- 5268, vol. 12, issue 1, pp. 401-405 , Jan 2012.
Andri Mirzal, “Design and Implementation of a Simple Web Search Engine”, International Journal of Multimedia and Ubiquitous Engineering, vol. 7, issue 1, pp. 53-60, Jan 2012.
Animesh Tripathy and Prashanta K Patra, “A Web Mining Architectural Model of
Distributed Crawler for Internet Searches Using PageRank Algorithm”, Asia- Pacific Services Computing Conference, 978-0-7695-3473-2/08 © 2008 IEEE, pp. 513-518.
Lay-Ki Soon, Yee-Ern Ku and Sang Ho Lee, “Web Crawler with URL Signature - A Performance Study”, 4th Conference on Data Mining and Optimization (DMO) 978-1-4673-2718-3/12 ©2012 IEEE, pp. 127-130.
Farha R. Qureshi and Amer Ahmed Khan, “URL Signature with body text normalization in a web crawler”, International Journal of Societal Applications of Computer Science (IJSACS), ISSN 2319 - 8443, vol. 2, issue 3, pp. 309-312, March 2013.
Saurabh Pakhidde , Jaya Rajurkar and Prashant Dahiwale, “Content Relevance Prediction Algorithm in Web Crawlers to Enhance Web Search”, International Journal of Advanced Research in Computer Engineering Technology (IJARCET), ISSN: 2278 - 1323, vol 3, issue 3, March 2014.
Prashant Dahiwale, Pritam Bhowmik, Tejaswini Bhorkar and Shraddha Shahare, “Rank Crawler : A Web Crawler with Relevance Prediction Mechanism for True Web Analysis”, International Journal of Advance Foundation and Research in Computer (IJAFRC), ISSN: 2348-4853, vol. 1,issue 4, April 2014.
Information on HTTP_Referer is available at - << http://en.wikipedia.org/wiki/HTTP_referer >>.
Information on Url Normalization is available at - << http://en.wikipedia.org/wiki/Url_normalization >>.
Information on MD5 Hashing Algorithm is available at - << http://en.wikipedia.org/wiki/MD5 >>.
Introduction to WHOIS is available at - << http://en.wikipedia.org/wiki/Whois >>.
Junghoo Cho, Hector Garcia-Molina and Lawrence Page, “Efficient Crawling Through URL Ordering”, Seventh International World-Wide Web Conference (WWW 1998), April 14-18, 1998, Brisbane, Australia.
Ricardo Baeza-Yates , Carlos Castillo, Mauricio Marin and Andrea Rodriguez, “Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering”,Special interest tracks and posters of the 14th international conference on World Wide Web, pp. 864-872, 2005
Lawrence Page et. al.[44] proposes the importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a method for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. They compare PageRank to an idealized random Web surfer. We show how to effeciently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.

Frequently asked questions

What is this document about?

This document is a language preview, seemingly for academic use, focusing on analyzing themes related to search engines and web crawlers. It includes a table of contents, objectives, key themes, chapter summaries, and keywords.

What topics are covered in the table of contents?

The table of contents includes chapters on: Introduction to the Web, Search Engines (types and workings), Web Crawlers (basics and design), Crawling Algorithms, Literature Review, Problem Statement, Methodology, Results and Discussions, Conclusion and Future Scope. It also includes acknowledgements, abstract, and lists of tables, figures, abbreviations, and symbols.

What are the different types of search engines discussed?

The document lists crawler-based search engines, directories, hybrid search engines, and meta search engines.

What are the main steps in how a search engine works?

The main steps include: Gathering (crawling), maintaining a database/repository, indexing, querying, and ranking.

What is a web crawler and what are its alternative names?

A web crawler is a program that visits websites and reads their pages to create entries for a search engine index. It is also known as a web spider, ant, automatic indexer, worm, wanderer, automatic harvester, internet bot, search bot, simple bot, robot, or web scooter.

What are the key components of a web crawler?

The key components are a seed, frontier (processing queue), and parser.

What are some of the design issues related to web crawlers?

Design issues include: What pages to download, how to refresh pages, minimizing the load on visited websites, and how to parallelize the crawling process.

What is a parallel crawler and what are its advantages and disadvantages?

A parallel crawler runs multiple processes simultaneously to maximize download rate. Advantages include scalability, network-load dispersion, and potential reduction in network load. Disadvantages include the risk of multiple copy downloads, maintaining quality, and communication bandwidth requirements.

What are the two main approaches to crawling algorithms?

The two main approaches are Blind Traversing and Best-First Heuristic.

What is Breadth-First crawling?

Breadth-First crawling uses the frontier as a FIFO queue, crawling links in the order they are encountered.

What are some crawling algorithms used in search engines?

Some crawling algorithms mentioned include: Breadth First Crawler, Naive First Approach, Page Rank Algorithm, Weighted Page Rank Algorithm, Page Rank based on Visits of Links Algorithm, and Weighted Page Rank based on Visits of Links Algorithm.

What is the PageRank algorithm?

PageRank analyzes hyperlinks among web pages to assign each web page a numeric weighting, called its "PageRank". It is an algorithm used by Google Search to rank websites in their search engine results.

What is the purpose of the Weighted Page Rank (WPR) algorithm?

WPR is an extension of PageRank that considers the importance of both inlinks and outlinks, distributing rank scores based on the popularity of pages.

What is Page Rank based on Visits of Links (PRVOL)?

PRVOL is a page ranking mechanism that takes the number of visits to inbound links of web pages into account, effectively prioritizing pages based on user browsing behavior.

What is Weighted Page Rank based on Visits of Links (WPRVOL)?

WPRVOL is a ranking algorithm building on WPR that is further enhanced by determining user usage trends via number of link visits.

What are some specific problems the thesis addresses?

The thesis seeks to address: Crawling only a fraction of relevant web pages, server overhead and network payload, over-large search space, execution time of search engines, and memory space utilization.

What solutions does the methodology propose?

The methodology proposes: Effective crawling using Weighted Page Rank, server overhead reduction, optimizing crawled data using Weighted Page Rank based on Visits of Links, reducing execution time, and reducing search space for the user.

What steps are taken during the proposed new crawler architecture?

The steps include: 1. Starts with a seed url. 2. Fetches page and create page object 3. Stores page into pages object queue. 4. Fetches outlinks of the seed page from that object and stores their objects in queue. 5. Recurvisely continues until maximum no of pages has been fetched.

What information is included in page objects?

Page objects store the url, title, meta, content, inlinks, inlinks_count, outlinks, outlinks_count, url_hash, content_hash, domain_age, relevancy, wpr, wpr_vol, and email fields.

What is the purpose of URL Normalization?

URL normalization standardizes URLs to determine if two syntactically different URLs may be equivalent, reducing indexing of duplicate pages.

How is redundancy removed using MD5 hashing?

MD5 hashing creates a unique signature for both the URL and the body content of a page. By comparing these signatures against a database of known URLs, the crawler can avoid crawling duplicate content, even if the URLs are different.

How is domain age calculated?

Domain age is calculated using the WHOIS protocol to query databases that store the registration information for domain names.

What is the role of the "Vol-Analytics" script?

The Vol-Analytics script, embedded on crawled web pages, sends data to the crawler's server, tracking visits and link access behavior to be used in the WPRVOL calculations and database optimization.

How is data optimization done by the crawler?

After a specified period (e.g., 45 days), a cron job calculates the WPRVOL score of stored web pages. Pages with a score below a certain threshold are deleted from the database, effectively removing pages that are rarely or never accessed.

What were the testing parameters used in the 'Results and Discussions' chapter?

The crawler was tested on a VPS server with 4 Intel Xeon CPU Cores E3-1230 V2 at 3.30GHz, 3GB of RAM, 75GB Hard Disk running Linux (Centos release 6.5), 3TB Data Transfer Limit, and 100Mbps Port or Link.

What technologies were used to implement the crawler?

The crawler was implemented using PHP 5.4.29, MYSQL 5.5.37, AJAX, and APACHE Server 2.2.27.

Where was Visits of Links information stored?

The Visits of Links information was stored on the crawlnsearch.com server.

Enhancement in Web Crawler using Weighted Page Rank Algorithm based on VOL

Extended Architecture of Web Crawler

Extracto

ACKNOWLEDGEMENT

ABSTRACT

LIST OF TABLES

LIST OF Figures

LIST OF ABBREVIATION

LIST OF SYMBOLS

CHAPTER 1 INTRODUCTION

1.1 Working of Web

1.2 Problems with Search Engine Crawling

1.3 Thesis Organization

CHAPTER 2 SEARCH ENGINE

2.1 Types of Search Engine

2.1.1 Crawler-Based Search Engine

2.1.2 Directories

2.1.3 Hybrid Search Engines

2.1.4 Meta Search Engines

2.2 Structure and Working of Search Engine

2.2.1 Gathering also called “Crawling”

2.2.2 Maintaining Database / Repository

2.2.3 Indexing

2.2.4 Querying

2.2.5 Ranking

2.4 Examples of Search Engines

2.5 Important about Search Engines

CHAPTER 3 WEB CRAWLERS

3.1 Introduction

3.2 Basic Crawling Terminology

3.2.1 Seed Page

3.2.2 Frontier (Processing Queue)

3.2.3 Parser

3.3 Working of Basic Web Crawler

3.4 Designing Issues of Web-Crawler

3.5 Parallel Crawlers 15

3.5.1 Issues related with Parallel Crawlers

3.5.2 Advantages of Parallel Crawlers

3.6 Controlling the access of Spiders

3.7 Blocking the access of Spamming bots

3.8 Examples of Web Crawlers

3.8.1 Open Source Web Crawlers

CHAPTER 4 CRAWLING ALGORITHMS

4.1 Desired features of Good Crawling Algorithm:

4.2 Blind Traversing Approach:

4.2.1 Breadth First Crawler

4.2.2 Drawbacks of Breadth First Crawler

4.3 Best First Heuristic Approach:

4.3.1 Naive First Approach

4.3.2 Page Rank Algorithm

4.3.3 Weighted Page Rank Algorithm

4.3.4 Page Rank based on Visits of Links Algorithm

4.3.5 Weighted Page Rank based on Visits of Links Algorithm

4.3.6 Improvement in Weighted Page Rank based on Visits of Links Algorithm

CHAPTER 5: LITERATURE REVIEW

CHAPTER 6: PROBLEM STATEMENT

6.1 Problem Formulation

6.2 Objectives

CHAPTER 7: METHODOLOGY

7.1 Structure of an Object of Page

7.2 Proposed Architecture of Web Crawler

7.3 Url Normalization or Standard Normalization (STN) of Urls

7.4 Removing Redundancy with MD5 hashing algorithm

7.5 How to calculate Domain Age or Age of Webpage / Website

7.6 How to calculate Relevancy of newly age webpage

7.7 How to calculate Weighted Page Rank (WPR)

7.8 How to calculate Weighted Page Rank based on Visits of Links (WPRVOL)

7.9 Improved Searching with WPRVOL

CHAPTER 8: RESULTS AND DISCUSSIONS

8.1 Crawling Results for Seed http://www.vvguptanadco.com

8.2 Crawling Results for Seed http://www.ovhconsulting.com

8.3 Crawling Results for Seed http://www.infopathankot.com

8.4 Snapshot of saved Visits of Links

8.5 Snapshot of WPRVOL Scores of pages

8.6 Snapshot of Reduced Execution time while searching

8.7 Comparison of Weighted Page Rank (WPR) with Page Rank (PR)

CHAPTER 9: CONCLUSION AND FUTURE SCOPE

9.1 Conclusion

9.2 Future Scope

REFERENCES