Analysis of e-Commerce Market Trend using Text Mining

Forschungsarbeit, 2017

5 Seiten, Note: 3.93

Gratis online lesen

Analysis of e-Commerce Market Trend Using Text Mining

Aditi Agarwal1 Mohit Supe1 Raviteja Ayyagari1


The chief objective of this paper is study and understand the current trends in which the e- commerce business is behaving in the market by looking at certain html files. We achieve this by dealing with different trends in the text data like text visualization, text mining techniques thus analyzing the theme on which a text has been composed by reading a few html files from a local folder.

1. Introduction

There are abundant resources of news available through dif- ferent sources. It can either be available on the e-news pa- per that can be read on the internet or hard copies of the same newspaper, or articles available online. All these re- sources come under text. If the talk is about analyzing nu- merical or categorical data, we are at a very safe side and we can take a chance of working on the data. But the situ- ation here is different all together.

We are dealing with loads of data that is presented to us in the form of text, which definitely has numbers but not the traditional rows and columns that make things easy for us. It has got paragraphs of text that just brings in the most complicated issues. Luckily for us, statistics has been very lenient in allowing us to use the conventional methodolo- gies of machine learning and data mining that we generally use for standard data, that can be applied for text data as well(6).

So, why all this drama about text processing? Are we not happy reading an article just like we read a Sydney Shel- don novel and enjoy the essence of it? Why are we going overboard with this text processing and text mining all of a sudden? Why is this deep learning so important and why has it got the future in its hands? Well, there is always a sense of dissatisfaction for man in terms of what he gets and what he can get more out of it. The same is the situa- tion here. He always has thought of something more than just reading the novel.

Analysis of e-Commerce Market Trend Using Text Mining

- Collects the most frequent words/ terminology from the text and visualizes them and performs clustering.
- Understands the trends in the text, makes use of statis- tical, machine learning techniques for implementing an analysis of the text behavior.
- Make use of Natural Language Processing (NLP) technique which is the ideal methodology for text data (which is unstructured).
- Converts unstructured text data into numerical data, again on which the classical data mining technique is applied.

2. Related Work

Text Mining is a form of data mining technique that can enable organizations to deduce any kind of possible impor- tant business related inputs from data that is derived from text which can be of the form of articles, emails, tweets or Facebook posts or a job notification on linkedin. It is a her- culean task to mine text data, i.e. unstructured data with the help of Natural Language Processing (NLP), Machine Learning or any statistical modeling technique, nonethe- less, there are some discrepancies with NLP. So often, the NLP is the last stop we would like to visit for the sake of text mining. It will have irregularities which are attributed its not so perfect syntax and it contains ambiguities with respect to language related issues and multiple entrees of observations(12).

The primary advantage of adopting a text analytics soft- ware can help with the transformation of words and phrases in the text data into numerical data that can be mapped to structured data in a relational database and then it is com- pared with the conventional data mining methods. By fol- lowing a repetitive approach, any company can accomplish the utilization of text analytics in order to build a knowl- edge of content-specific observations like sentiment, inten- sity, humor etc. to name a few. However, text analytics is not widely implemented across different industries. So the end results and the thoroughness of the analysis may be different from one to the other.

Usually, to perform the operation of classification success- fully, i.e. to check if a classifier element is classified aptly, we need to mandatory train it with some pre-classified doc- uments from every category, just to make sure that it is in a position to generalize the model it has learnt from the pre- vious documents and apply the same model to accurately classify the documents that are not seen.(2)

There have been many researches that have got the lime- light as far as text data mining or in short text mining is concerned. The conclusion from a research states that text mining is primarily is accountable for organizing the incon- sistent pattern that has been notated in the human language which in other words can be termed as Natural Language. Since an scores of humans are some or the other way in contact interaction with each other through text, we do not have to facility to deal with the structured form of data. Hence, we have the text mining technique, which is by far the ideal methodology to tackle the situations. Considering the other techniques, NLP is trending to be one of the most happening fields in terms of research(9).

The primary objective of the NLP is to understand the how the machines are evaluating the information from the nat- ural language of the humans and perform models with ac- curacy of surgical precision. It is a very healthy habit of conveying a message which is very well structured and has some sensible information, with the implementation of not so well organized and unstructured data. Another research states that text mining studies the information/ data from where the data needs to be extracted understanding that it can be used for specific tasks. It is quiet possible that the text mining is considering to involve NLP process into the model to make sure that the human language is success- fully evaluated and the unstructured data patterns are well organized. With the advancements in the technology with every other moment, the concept of text mining will find its application to successful implementation and this justifies why people are opting for text mining.

There were many subject matter experts who have demon- strated various techniques that involved text data mining and different its methodologies. These techniques were an accomplishment and graciously acknowledged across the world of data mining such that they have been defined to prove their impactfulness of the possession of the meaning- ful information. The research that was carried out detailed about the circumstances where each of the application of technology is put to its efficient use for a wide range of the users. Scores of business oriented firms are interested to experiment the text mining technique on the data that has them tagged on different platforms likewise social media, news articles, journals etc. Performing an analysis with the help of clustering of the different information so that the companies can take business decisions that can create a positive impact on their organization(17).

3. Data

The data as we have discussed earlier is a set of e- commerce related article files that are stored either in the format of htm or html and are read from the local folder. The web pages are different articles that have been col- lected from different articles like The Economist, Business Computing World, insider media, etc.

Analysis of e-Commerce Market Trend Using Text Mining

We have extracted the 19 web pages in the form of htm/ html files and have stored in the local drive. Our target is to extract the text from these html files and perform text processing techniques on the text derived then on. There is a tremendous data pre-processing procedure involved for this purpose which shall be discussed in further sections.

illustration not visible in this excerpt

4. Methodology

We are following simple techniques of data mining, which can be applied to text data as well. Before that let us have a look at the data pre-processing technique of how we can parse the html into text files. So, let us discuss each and every technique in detail.

Firstly we have all the htm/html files stored in a local folder. Using the xpathsapply, htmplParse functions in R, we cre- ate a htmltotext function and then use the lapply and sapply functions to convert these html files to text data and then clean out the non-ASCII characters.

Then next step that comes out into the picture is the cre- ation of a corpus for the application of text mining. By a corpus, we mean a collection of different text from differ- ent sources into one consolidated text. Once we have the corpus ready, we need to clean the text for removing un- wanted words and phrases, numbers, any punctuation that present.

We use the removeWords and stopword feature of the tm map function in R to remove the stopwords in English, which are the words like a, an, the, and, or etc. Hence, those are the words of least importance to us and we can remove those words. Once have removed the stopwords, we work on the removal of the numbers, which can be accomplished using the same tm map function and the removeNumbers feature since we do not need the numbers while performing text mining.

The moment have just the text data devoid of all the punc- tuation, stowords and numbers, we convert the text into a Document Term Matrix using the TermDocumentMatrix() function in R, which groups the words and their frequen- cies that correspond to a certain document. If we look at Table 1, we can see the top 10 frequently occurring terms from all the documents.

Once, we have the document term matrix, we work on the hierarchical clustering with the help of a dendrogram. So, when we are trying to perform clustering using a dendro- gram, we are counting on the dissimilarities between ob- jects by the application of Euclidean Distance(8).

So, we subset the data for the least frequent words and perform a cluster analysis of those non frequent terms and check if they can be clustered into a group and if they can be a part of the final analysis.

illustration not visible in this excerpt

Table 1. Partial Term Document Matrix with the most frequent Terms

If we look at the Figure 1, the words that are in the cluster are the least frequent words that appear less than 10 times in the entire 19 html files. So, the words giving and results are grouped together since they are more similar compared to the word powered. On a similar analogy, the group pow- ered and giving have more similarity to powered compared to the word problem. So we group them on the second level. We iterate the same process until we have clustered all the data points, which in this case is the words. So, Fig- ure 1 refers to the dendrogram of hierarchical clustering.

illustration not visible in this excerpt

Figure 1. Dendrogram using the euclidean distance

The next step is to find the words with the highest fre- quency and set the data for bar plot to analyze the primary keywords of all the documents and then perform a word- cloud. To get the the words with highest frequency, let us reconsider the corpus that we created by removing all the stopwords, numbers, etc. The corpus that we created is in the form of a matrix, so we have to convert it into a dataframe, that consists of the frequent words in the de- creasing order of the frequency at which they are repeating.

Analysis of e-Commerce Market Trend Using Text Mining

illustration not visible in this excerpt

Table 2. Partial Table for the most frequent words

Figure 2. Bar Plot For the Top 10 Frequent Words

We design the bar plot by just taking the top ten most fre- quent words out of the entire words that we have in the corpus. We can also have an understanding of the frequent words with the help of a tabular representation. If we look at the Table 2, it is evident the word amazon We can ob- serve from the bar plot in the Figure 2 that amazon is the word that appears most number of times in all the docu- ments combined. The next word that appears most number of times is business. SO this way, we can visualize the be- havior of text in a particular set of text articles using a bar plot.

We then focus on the wordcloud. A wordcloud is simply an accumulation of words from a specific text where the words size is in the decreasing order of the frequency in which they appear, which is depicted in the form of an im- age. For performing the wordcloud in R, we make use of the wordcloud function, which will help us getting a word- cloud. We have a condition in this regard that the word should appear in the documents at least 15 times. So the minimum frequency for any word to appear will be 15.

If we look at Figure 3, we can understand that among all the documents, amazon is the most frequent word followed by business and online being the second and third most fre- quent words. This gives the business users an idea of what kind of keywords they can focus on when they are targeting customers. Ideally, this methodology of placing keywords, works big time, since the customer will find the right set of keywords while looking for a specific article related to any e-commerce space.

illustration not visible in this excerpt

Figure 3. Wrodcloud with the most frequent words with a mini- mum frequency of 15

5. Future Work

There is a lot of deep level analysis that is taking place in text analysis. We have neural networks that is being ap- plied to the text, we have methods called as pronoun pro- cessing so that we can create an analysis of the personality nature, emotion taxonomies that help in the detection of the emotion in which the text has been written and perform an imperative analysis(16).

Just like we are discussing about big data, we have big text, which is growing even greater than the big data. So, we can make use of the deep learning techniques for text sum- marization that helps the organization come up with sum- marized headlines that can target the customers going by the then market trend and make them drive towards the ar- ticle or the advertisement. This will give the organization a thorough idea of what exactly the customer is looking at.

We can also look at generating feedback systems, which find their applications at different domains like student Analysis of e-Commerce Market Trend Using Text Mining feedback systems, appraisal management systems, where in the text that has been fed in by different sources can be analyzed according to the criteria for specific person and generate the apt feedback that can help in the growth of the individuals.

6. Conclusion

The text analysis has made it evident that any article drives solely based on the keywords. So, it is always a good prac- tice to keep a track of the most frequent words that appear in the text articles and use them as the keywords accord- ingly so that the customers can have an understanding of what to focus on and it gives the organization an idea of where the e-commerce market is headed.

In this analysis, which comprised of converting 19 e- commerce related html text files into text files and then performing text analysis, we have found that the most fre- quent keywords is amazon, that appears 241 times in all the articles. So, if any organization has to sustain itself in the market, it has to follow closely monitor the primary keywords based on which any of its advertisement or pro- motional campaign needs to be prepared.


1 Saint Peter’s University, Jersey City, NJ, USA. Correspon- dence to: Robert Finn <>.

words that have been used in the novel or the article(4).

That is where the concept of text mining has arrived from. It is all about makings things simple and easy. Imagine a situation wherein, a person who is interested in buying a book from the thriller genre. But he doesn’t know if the book he selected at random is from his choice of books. So he reads the premise of the book and understands the cat- egory of the book that he is looking for. All this while, he has been looking at the text mining or the text summariza- tion of the book, which is a bound copy. But for an online article, which comprises of close to 10 pages, how will one know what the summary or what category or what field the article relates to? That is why the concept of text mining comes into consideration(11).

So, how do we define Text Mining? According to the Text Mining Handbook (3), Text mining is a new and exciting area of computer science research that tries to solve the cri- sis of information overload by combining techniques from data mining, machine learning, natural language process- ing (NLP), information retrieval, and knowledge manage- ment.

Also, considering any average company which deals with medium volume of business, we have the percentage of the data of unstructured to structured data of around 80% to 20%. So, ideally most of the data for any business user is in the form of text data which in turn is in the form of busi- ness proposals, Proof of Concepts, Request for Proposals, etc. So, this also is a very important argument that goes in favor of text mining gaining its popularity in the modern technological world.

Sources of text data can range anywhere from a class home- work of an under graduate student to the research proposal of a new business idea by a company like Microsoft. There is data from Facebook posts, twitter tweets, news articles, editorials,

Text Mining in brief, does the following:

- Removes all the unnecessary stuff, like, punctuation, stopwords and all the words that mean no significance to the text.

1 S. Ananiadou, D. B. Kell, and J. I. Tsujii. Text min- ing and its potential applications in systems biology. Trends in biotechnology, 24(12):571-579, 2006.

2 Vishwanath Bijalwan, Vinay Kumar, Pinki Kumari, and Jordan Pascual. Knn based machine learning approach for text and document mining. Interna- tional Journal of Database Theory and Application, 7(1):61-70, 2014.

3 Robert Dale, Hermann Moisl, and H. L. Somers, edi- tors. Handbook of natural language processing. Mar- cel Dekker, New York,NY, 2000.

4 Jiawei Han, Chi Wang, and Ahmed El-Kishky. Bring- ing structure to text: mining phrases, entities, topics, and hierarchies. In Pat Langley, editor, Proceedings of the 17th International Conference on Machine Learn- ing (ICML 2000), pages 1968-1968, New York, NY, 2014. ACM.

5 M. Hu and B. Liu. Mining and summarizing cus- tomer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge dis- covery and data mining, pages 168-177. ACM, Au- gust 2004.

6 B. J. Jansen and M. Resnick. An examination of searcher’s perceptions of nonsponsored and spon- sored links during ecommerce web searching. Jour- nal of the Association for Information Science and Technology, 57(14):1949-1961, 2006.

7 Z. M. Jusoh and G. H. Ling. Factors influencing consumers’ attitude towards e-commerce purchases through online shopping. International Journal of Humanities and Social Science, 2(4):223-230, 2012.

8 Roman Klinger, Robert Pesch, Theo Mevissen, and Juliane Fluck. Text mining in full text articlesme- thodical and represenation issues. Nature Precedings, 2009.

9 R. LaRose. On the negative effects of ecommerce: A sociocognitive exploration of unregulated online buy- ing. Journal of ComputerMediated Communication, 6:3, 2001.

10 Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge dis- covery in data mining, pages 198-207. ACM, August 2005.

11 W. W. Moe and P. S. Fader. Dynamic conversion be- havior at e-commerce sites. Management Science, 50(3):326-335, 2004.

12 Francisco Villarroel Ordenes, Babis Theodoulidis, Jamie Burton, Thorsten Gruber, and Mohamed Zaki. Analyzing customer experience feedback using text mining: A linguistics-based approach. Journal of Ser- vice Research, 17(3):278-295, 2014.

13 K. B. Patel, J. A. Chauhan, and J. D. Patel. Web mining in e-commerce: Pattern discovery, issues and applications. International Journal of P2P Network Trends and Technology, 1(3):40-45, 2011.

14 A. H. Tan. Text mining: The state of the art and the challenges. In Proceedings of the PAKDDWorkshop on Knowledge Disocovery from Advanced Databases, 8:65-70, April 1999.

15 Chih-Hsuan Wei, Hung-Yu Kao, and Zhiyong Lu. Pubtator: a web-based text mining tool for assisting biocuration. Nucleic Acids Research, 41(W1):518- 522, 2013.

16 B. Xie, Q. Ding, H. Han, and D. Wu. mircancer: a microrna-cancer association database constructed by text mining on literature. Bioinformatics, 29(5):638- 644, 2013.

17 C. Zott, R. Amit, and J. Donlevy. Strategies for value creation in e-commerce:: best practice in europe. Eu- ropean Management Journal, 18(5):463-475, 2000.

5 von 5 Seiten


Analysis of e-Commerce Market Trend using Text Mining
ISBN (eBook)
728 KB
Analysis, e-Commerce, Market, Text Mining, Business
Arbeit zitieren
Raviteja Ayyagari (Autor:in), 2017, Analysis of e-Commerce Market Trend using Text Mining, München, GRIN Verlag,


  • Noch keine Kommentare.
Blick ins Buch
Titel: Analysis of e-Commerce Market Trend using Text Mining

Ihre Arbeit hochladen

Ihre Hausarbeit / Abschlussarbeit:

- Publikation als eBook und Buch
- Hohes Honorar auf die Verkäufe
- Für Sie komplett kostenlos – mit ISBN
- Es dauert nur 5 Minuten
- Jede Arbeit findet Leser

Kostenlos Autor werden