Web Fetcher: A SMS Marketing Solution

Bachelor Thesis, 2013

77 Pages, Grade: A



1. Introduction
1.1. Text Mining
1.2. Information Extraction
1.3. SMS Marketing
1.4. Project description
1.5. Benefits that will come from solution
1.6. Limitations
1.7. Need for Solution
1.8. Outline

2. Basic Concepts
2.1. Web Extraction
2.2. Limitations

3. Literature Review
3.1. Information Extraction
3.2. Learning Extractors from Unlabeled Text using Relevant Databases
3.3. Information Extraction from the Web: Techniques and Applications
3.4. A Survey of Web Information Extraction Systems
3.5. Information Extraction A Survey

4. System Design
4.1. Proposed Model Design
4.2. Flow Chart

5. Implementation
5.1. System Development tool
5.2. System Requirements
5.3. Software Requirements
5.4. Hardware Requirements
5.5. System Description
5.6. Hybrid Approach
5.7. Libraries used in the software

6. Testing
6.1. Test case01
6.2. Test case02
6.3. Test case03
6.4. Test case04
6.5. Results

7. Conclusion & Future Work
7.1. Conclusion
7.2. Future Work

8. References


Table 1: Context Model of System

Table 2: Process Model of System

Table 3: Time Sequence of System

Table 1: Use case view of System

Table : Flow chart symbols and their description

Table 3: Shows the libraries used in the software

Table 7: Shows the Test Case1

Table 8: Shows the Test Case2

Table 9: Shows the Test Case3

Table 10: Shows the Test Case4

Table 11: Shows the error rate of the system in terms of time

Table 12: Shows the actual time required to extract phone numbers


Figure 1: Methodology used for extracting data

Figure : Context Model of System

Figure 3: Process Model of System

Figure 4: Time Sequence of System

Figure 5: Use case Diagram of System

Figure 6: Flow chart of Link Extraction

Figure 7: Flow chart of Webpage Phone Number Extraction

Figure 8: Flow chart of Webpage Phone Number Extraction of Selected Web Pages

Figure 9: Flow chart of Website Phone Number Extraction

Figure 10: Shows to enter url

Figure 11: Shows to extract links

Figure 12: Display all extracted links

Figure 13: Shows to enter url

Figure 14: Shows to extract phone numbers of webpage

Figure 15: Shows to select “Numbers of webpage”

Figure 16: Shows the phone numbers extracted from Punjab public service commission website

Figure 17: Shows to select “Numbers of website”

Figure 18: Shows to select “Numbers of webpage”

Figure 19: Shows to select “Get Phone Numbers”

Figure 20: Display all the phone numbers extracted from Punjab public service commission website

Figure 21: Shows to select “Customize phone numbers”

Figure 22: Shows to enter webpage links

Figure 23: Display extracted phone number from three webpage’s

Figure 24: Shows to save extracted phone numbers

Figure 25: Shows a dialog box to save phone numbers in a text file format

Figure 26: Shows the saved phone numbers in a text file

Figure 27: Shows the “save” option to save the extracted links

Figure 28: Shows a dialog box to save web links in a .csv format

Figure 29: Shows the saved web links in a .csv file

Figure 30: Shows the flow chart hybrid approach

Figure 31: Shows the link extracted from ultimate programming tutorial website

Figure 32: Shows the link extracted from UET Lahore website

Figure 33: Shows the link entered to extract webpage phone numbers

Figure 34: Shows the phone number extracted from hamariweb.com

Figure 35: Shows the phone number extracted from hamariweb.com website

Figure 36: Shows Dialog box to save Phone Numbers to external file

Figure 37: Show the created “phone” text file

Figure 38: Display all the saved phone numbers in a text file



1. Introduction

1.1. Text Mining

Text mining, approximately equivalent to text analytics, also referred to as text data mining. In text mining, patterns are extracted from natural language text rather than databases. Text analysis involves information retrieval, tagging,information extraction, pattern recognition and data mining techniques.

1.2. Information Extraction

The task of extracting organized information from non-organized or semi-organized machine readable documents. Information extraction is different from information retrieval which requires a machine to search a body of information for information objects that match our search query. Information extraction provides low cost solution, flexibility in development and easy adaptation to new domains.

1.3. SMS Marketing

SMS marketing is the need of today’s era. It uses interactive wireless media to provide customers with time and location sensitive, personalized information that promotes goods, services and ideas, thereby generating value for all stakeholders. Traditionally the idea of SMS marketing was quite cumbersome because it was too time consuming and not efficient. For example; if a website visitor wants to search phone numbers from the website then it need all the website pages to be well visited and well read. This requires a lot of time, effort and energy and even then there will be 70-80% chances of mistake in writing down a number correctly.

The main problem which is faced by most of the telecommunication companies are:

1.3.1. To find the targeted customers to send their advertisements to them.

1.3.2. To find the data or information of the interest or relevance.

Now-a-days in the era of SMS-Advertising and Marketing, all the marketing and advertising companies require a complete and cheap solution to improve their businesses. This project will provide a complete solution for their domain in searching numbers and targeted customers from the website.

In this project, the focus will be on determining the best possible Information Extraction technique that aims at providing a single uniform query interface to access multiple information sources. It will extract data from totally unstructured free texts that are written in natural language and present this data in useful format.

1.4. Project Description

The motivation behind this project was to provide an efficient and cost effective solution for SMS marketing.

1.4.1 Project Objectives

This project propose to investigate a SMS Marketing Solution i.e. A software that can search the mobile numbers from target website and can fetch and save these numbers. So, it becomes easy for a person to send bulk messages to the fetched numbers.

1.4.2. Problem Statement

Most of the marketing and advertising companies search user information and numbers manually to precede their SMS-marketing business. But the percentage of error in this case is very large. In addition the cost, time and effort required for searching numbers and writing them manually is also high. This project will provide an immediate solution for all these issues.

1.4.3. Scope

In this research, a data parser will be developed, that allows advertisers to fetch and save the phone numbers from website.

1.5. Benefits that will come from Solution

The project will provide the following benefits:

1.5.1 Will lessen the fetching time of numbers from the whole website.

1.5.2. Will lessen the percentage of error while writing down the numbers.

1.5.3. Will reduce the human effort.

1.6. Limitations

1.6.1. Accuracy rate of fetched data is 70-80%.

1.6.2. Cannot work without internet connection.

1.7. Need for Solution

Today all the businesses want to do more and more in least possible time, budget and resources. This project will provide an efficient solution and long term benefits to its consumer’s.

1.8. Outline

1.8.1. Chapter 1 provides the brief introduction of the project.

1.8.2. Chapter 2 provides the basic concepts used in this project.

1.8.3. Chapter 3 provides the literature review being done so far.

1.8.4. Chapter 4 provides the system design of the proposed system.

1.8.5. Chapter 5 gives the implementation.

1.8.6. Chapter 6 gives the testing of the system.

1.8.7. Chapter 7 consists of conclusion and the future work that can be done in this area.



2. Basic Concepts

Information Extraction (IE) systems investigate unprotected text in order to extract information about pre specified types of entities. It differs from outmoded methods in that it does not rely on search query and on key-word searching. Instead, the aim is to mine information about pre-specified types of entities.

2.1. Web Extraction

A web scrapper is a technique used to find information from websites. It automatically and repeatedly extracts data from un-organized and semi-organized webpage’s. The four main steps of web extraction system are:

illustration not visible in this excerpt

Figure 4: Methodology used for extracting data

2.1.1. Web Interaction

Internet applications can be categorized into two categories:

- Client applications that request information,
- And server applications that respond to information requests from clients. The World Wide Website the Internet client-server application, where people use browsers to access documents and other data stored on Web servers worldwide. The Web Fetcher builds the connection with the website by using HttpWebRequest and HttpWebResponse library. The .NET Framework uses specific classes to provide the three pieces of information required to access Internet resources through a request/response model:
- The Uri class, which contains the URI of the Internet resource which users are seeking.
- The HttpWebRequest class, which contains a request for the resource. - And the HttpWebResponse class, which provides a container for the incoming response. [1]

2.1.2. Wrapper Generation:

A wrapper is a program that identifies the desired data on target pages, extracts the data and transforms it into a structured format in the following steps: [2] Extract Webpage’s from Websites:

The wrapper generator extract all the links of webpage’s from the website by using Document Object Model (A language-neutral interface that allow programs and scripts to dynamically access data from websites) core method .i.e. Document. GetElementsByTagName("Name of tag”). The GetElementsByTagName() method accesses all elements with the specified tag name. When the DocumentCompleted event occurs, the new document is fully loaded, which means users can access its contents through the WebBrowserDocumentCompletedEventHandler(). Extract Phone Numbers by using Regular Expressions:

It extracts the phone numbers by using regular expressions.A regular expression is a distinctive manuscript string for labeling and describing a search pattern. It uses Regex.Match() method which returns the first substring that matches a regular expression pattern in an input string [3]. And the Regular expression success property can be used to find the regular expression patterns in the input strings. If a match is found, then returned Match object's Value property contains the substring from input that matches the regular expression pattern [3]. If no match is found, its value is Empty. The Next Match method is used to find all the specified pattern matches from the input string.

2.1.3. Data Transformation:

After getting the data of particular patterns it refine and filter it to remove alphabets and alphanumeric characters from the extracted data.

2.1.4. Delivering:

After extracting phone numbers it saves them into an external file.

2.2. Limitations:

2.2.1.The use of scrappers is against the terms of use for some websites, but enforcement is not common. Some website masters have defenses against the web scrapper bots. They can block an IP address or use tools that require people to verify themselves as human.

2.2.2.Difference in the structure of web pages cause difficulty in information extraction.



3. Literature Review

Extracting data from the web involves information extraction techniques. Following section describes a review of related techniques in detail.

3.1. Information Extraction (S.Sarawagi, 2008)

S. Sarawagi have layout the survey of Information extraction research over twenty years. The center of this survey is on the nature of mining task, nature of unstructured source, methods used for extraction, input resources and the type of output produced. This survey outlines the information extraction field along five directions; also explains rule base methods, Statistical methods, relationship extraction and management of information extraction systems in detail. The nature of mining task depends on the type of application: enterprise, personal, scientific, or Web-oriented. The enterprise information extraction can be used for tracking news and classified advertisements. The personal information management system can be used to extract users’ personal data. The scientific information extraction can be used to extract paper repositories such as Pubmed, protein names, and their interaction. The web oriented information extraction system can be used to extract Citation Databases, Opinion Databases, Community Websites, Comparison Shopping and Structured Web Searches. The extraction methods are divided into Hand-coded methods which require human experts to define rules for performing the extraction and Rule-based methods which are driven by hard predicates. Statistical methods are more useful when the input sources are noisy. The survey concludes with the fact that In spite of twenty years of research in the field of information extraction the accuracy remains the big problem.

3.2. Learning Extractors from Unlabeled Text using Relevant Databases (K.Bellare et al., 2007)

This research investigates new ways of taking the maximum advantage of databases and text sources to learn extractors automatically. The focus of this research is to maximize the accuracy of extracted information by using baseline model that relies only on the database for supervision. K.Bellare has presented an extraction framework whose building blocks constitutes of linear-chain conditional random field which is an undirected graphical model. Conditional random field comprise of a sequence of output variables linked to form a linear-chain under first-order Markov presumptions. It is skilled to exploit the possibility of output variables given the input variables. Conditional random field (CRF) model increase flexibility to efficiently exploit complex overlapping features of the in­put.CRF model have been studied in detail for example. Missing label linear-chain CRF (M-CRF), gold-standard extractor (GS-CRF), Database linear chain CRF (DB-CRF) and Database Missing label linear-chain CRF (DB+M-CRF) model to increase the accuracy of extracted data. On the basis of research it is found that M-CRF and DB+M-CRF models provide an improvement over other baselines.

3.3. Information Extraction from the Web: Techniques and Applications (A.Yates, 2007)

A.Yates has presented the web applications techniques for information extraction. In order to make information extraction useful the quality of extracted information should be improved. Accuracy, relevance, and the sophistication of meaning are the factors to improve quality of extracted information. The practice of hand-labeled training samples has allowed Hidden Markov Models or Conditional Random Fields to mine information from intricate sentences. Techniques like RESOLVER clustering algorithm make the extracted information beneficial to numerous diverse applications. Function Filter, Coordination-Phrase Filter and Property Weighted Extracted Shared Property Model are three extensions of RESOLVER which exhibits novel techniques for refining the skill to determine the difference between related and identical pairs. The statistical parser Woodward, mark the mined information beneficial for one particular language processing application.

3.4. A Survey of Web Information Extraction Systems (C.H.Chang et al., 2006)

C.H. Chang eta al. [1] have presented the major Web data extraction approaches and parallels them in three dimensions .i.e. The task domain, the automation degree and the techniques used. The principle of the first dimension explains why an information extraction system declines to handle some Web sites of specific structures. The principle of the second dimension measures the degree of automation for information extraction systems. The principle of the third dimension measures the enactment of information extraction systems. The focus of the research is semi-structured documents which have been used for web data.

3.5. Information Extraction A Survey (K.Kaiser et al., 2005)

Information extraction is a text mining technique. Text mining is an approach to trace the data and synthesize it in to information in order to make it useful. The information extraction system can be differentiated on the basis of type of data. There are three types of data: structured data, Semi-structured data and Unstructured Data. The differences between the structure of each document in the web lead to two main problems in the information extraction: “wrapper generation” and “wrapper maintenance”. Due to the burdensome manual generation of extraction rules consummate by knowledge engineers, research has been directed towards automating the information extraction task. Therefore, two approaches can be applied: supervised learning and unsupervised learning. Both information extraction systems and wrappers can be created manually, semi-automatically, or automatically. Application areas for Information extraction and wrapper generation systems are diversified. Product information pages from diverse online retailers are fetched; relevant information is mined and presented in a combined list to the user.



4. System Design

Design of specific system gives viewpoint or appearance of that system. It shows the systematic flow, purpose and functionality of the system. It gives the detailed understanding of the system and improves system performance.

Design has to exhibit following qualities:

- Should be Easy to understand and simple one
- Must explain working of system completely and clearly
- It can cover user needs
- Can be mold it in to any shape and
- Should be Reliable

It also shows particular method, procedures or set of procedures that has been followed for developing any project, particular to the branch of knowledge.

4.1. Proposed Model Design

The design of the research is based on the architectural model, time sequence and use case diagrams.

4.1.1. Architectural Model

It show what lies outside the system boundaries. It shows system and their relation with other system.

4.1.2. Process Model

It shows how the system is developed. It shows the functionality of system in term of business processes.

4.1.3. Time Sequence Diagram

A Time sequence diagram shows sequence of interactions that take place during a particular use case or use case instance.

4.1.4. Use Case Diagram

It shows the interaction of the user with the system. Context model

“Architectural Model”

illustration not visible in this excerpt

Table 4: Context Model of System

Architectural Model

illustration not visible in this excerpt

Figure 5: Context Model of System “Process Model”

illustration not visible in this excerpt

Table 6: Time Sequence of system


Excerpt out of 77 pages


Web Fetcher: A SMS Marketing Solution
Catalog Number
ISBN (eBook)
ISBN (Book)
File size
2306 KB
fetcher, marketing, solution
Quote paper
Maria Khalid (Author)Huma Siddiquie (Author), 2013, Web Fetcher: A SMS Marketing Solution, Munich, GRIN Verlag, https://www.grin.com/document/275844


  • No comments yet.
Read the ebook
Title: Web Fetcher: A SMS Marketing Solution

Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free