The content on the internet is increasing at a very fast speed. So the concern for conversion of unstructured data to structured data also increases. Thousands of data types are available on the web nowadays. There are many formats for storing the data which include text documents, images, excel spreadsheets and many more. The need to spend less effort and to get more data has led to increase in the research in this field.
The purpose of our project is to fetch the data from the different file formats such as .pdf, txt, doc, xls, etc. After fetching the data from a file, transformation is performed on this fetched data and it is exported into the database. Installation of a special format editor is not required on your machine. This will serve as a tool for accessing files containing different formats irrespective of the editors they support.
Table of Contents
ABSTRACT
INTRODUCTION
LITERATURE REVIEW
SYSTEM ARCHITECTURE
CONCLUSION
REFERENCES
ABSTRACT
The content on the internet is increasing at a very fast speed. So the concern for conversion of unstructured data to structured data also increases. Thousands of data types are available on web nowadays. There are many formats for storing the data which include text documents, images, excel spreadsheets and many more. The need to spend less effort and to get more data has led to increase in the research in this field. The purpose of our project is to fetch the data from the different file formats such as. pdf, txt, doc, xls, etc. After fetching the data from a file, transformation is performed on this fetched data and it is exported into the database. It will not require any kind of special format editor on your machine. This will serve as a tool for accessing files containing different formats irrespective of the editors they support.
INTRODUCTION
It is very difficult for the human beings to manually prepare summaries of large documents, to extract keywords, key phrases of large documents. In order to simplify these problems, a tool is required. This tool aims at extracting English files alone. Extraction of very large files is very easy using it with the use of parallel processing. Parallel processing is the technique in which a file whose’s size is very large is divided into smaller parts and each part is processed in parallel fashion with respect to the other part. It enables faster execution of the file. The file which consists of 5 pages is considered a large file in this context. After the file gets extracted, a repository of the data is created. Later on many different functionalities can be performed on it which include sentence extraction, keyword extraction, metadata extraction, key phrase extraction, summarization which provides a short form of the entire content which has been extracted. The extraction of the text from the file includes three basic steps which are analyzing the unstructured data, locating specific pieces of the text and filling in the database. The process of text extraction can be supported by many APIs. Apache Tika has been used in it. The basic task of text extraction is transforming the data and storing it for further use in a database.
LITERATURE REVIEW
Naive-Bayes Methods[1]-There is a classification function which decides whether a sentence is suitable for extraction or not based upon a naive bayes classifier. Another technique using it included the presence of uppercase words and sentence length also. Then the n top level sentences extracted using it were used for making the summary.
K-means clustering[3]- The entire file is divided into sentences. These sentences serve as points on Cartesian plane. Frequency of each word is calculated using term frequency. Based upon the term frequency, sentence score of each and every sentence is calculated. These sentence score are used for uniquely representing the coordinates. The coordinates serve as input for the clustering algorithm. After the application of this k rounds of this algorithm on the coordinates generates k cluster centres. Classification of each sentence is done into different clusters based upon scores computed for each sentence. The condensed form is created using the cluster which contains the most sentences. Summary is generated by placing sentences in the same order as they appear in the original file. This approach yields better results as compared to human written summaries. Abstractive summaries are those which are created by rephrasing the information in the file.
Graph theoretic approach[2]-In this approach after the stop word removal process, the sentences are represented as unique coordinates in the unidirected graph. Each node represents a sentence. The two sentences are interconnected by a edge if they have similarity above a threshold level. This approach yields two results. One is that the nodes which are not connected with any of other nodes form distinct topics covered in the document. The second result specifies sentences of greater significance. The nodes which have more nodes connected to them have high preference to be included in the summary since those nodes share information with many other nodes.
Complex Natural Language Analysis Methods[1]-A lexical chain(sequence of related words) is defined. Consider a example for this. Geeta bought a Duster. She loves the car. Here car refers to duster. It forms a lexical chain. It can happen at word sequence level also. To find out the lexical chains, three steps are followed. Selecting a set of the candidate words. For each of the candidate word, find a chain based upon relatedness among members of chain. Inserting the word in the chain if it is found and updating it accordingly.
Position method[1]-In this method the position of the sentence in the document was considered. Usually the text in the document follows a defined structure. The sentences of greater importance usually occur at certain specific locations for example titles, introduction etc. But as the structure of the documents varies from one another. This cannot be used as a suitable method.
SYSTEM ARCHITECTURE
illustration not visible in this excerpt
FIGURE 1
The file which needs to be extracted is browsed by the user. The “upload” button is pressed to upload the file to the server. Then the size of the file is checked. If the file contains more than 5 pages, then it is regarded as a large file and it will undergo parallel processing otherwise it will be executed as a whole. The extracted content is segmented into the sentences and the sentence count is created. Removal of stopwords is used for generating keywords and key phrases. Stop words include special characters, punctuation marks as well as the words which are repeated many a times but do not have much importance such as is, like, are, not etc. The links, images and the metadata are also extracted from the file.. Auto summarization is done based upon the keywords present in the sentences. Description of the modules is explained below.
Home Page-In this page the user will upload the file and click on the“upload” button.
Result Page-In this page different tabs for different options is provided. The options include metadata, auto summary, keywords, key phrases, plain text. Button is provided under auto summary tab by clicking on which auto summary will be generated.
Metadata-The values and attributes of the metadata will be displayed in it.
Plain text-It displays all the sentences present in the file as well as number of times each sentence occurred in the file. Sentence alone with the count is provided in form of a table.
Links-It will display all the links present in the file. If the file doesn’t have any link, then it will display blank space.
Keywords-These are the important words present in the file. Firstly all the stop words are removed from the content. Term frequency and inverse term frequency approach is followed in it which measured how frequently a term occurs in the file. The frequency of the words is calculated and the most frequently occurring terms are fetched. Those are the keywords. Stemming algorithm is followed in it.
illustration not visible in this excerpt
Figure 2
Images-This tab will display the images present in the file. If no image is present then it will display nothing.
Key phrases-These are the keywords which contain more than one word. They are extracted using the same approach as keywords.
Auto Summarization-In this process the fetched data is segmented into sentences. The keywords and key phrases are used for generating the summary. The sentences are rated based upon presence of keywords and key phrases. Accordingly the sentences are arranged in the summary.
CONCLUSION
The biggest challenge for text extraction and auto summarization is to extract data from different semi structured sources including databases and the web pages in the proper format, size and time. Another challenge is to extract files belonging to different languages such as Hindi, Urdu, French etc. The text summarization should not be too small and it should not have redundancy.
REFERENCES
[1]Dipanjan Das and André F. T. Martins “ A Survey on Automatic Text Summarization” Language Technologies Institute Carnegie Mellon University, November 21, 2007
[2] Vishal Gupta and Gurpreet Singh Lehal “ A Survey of Text Summarization Extractive Techniques” Journal of emerging technologies in web intelligence, Volume 2, Number 3, August 2010
Frequently asked questions
What is the purpose of the "Data Extraction from Unstructured Sources" project?
The project aims to automatically extract data from various file formats (pdf, txt, doc, xls, etc.), transform it, and store it in a database without requiring specific format editors. It addresses the challenge of converting unstructured data from the internet into a structured format, making it easier to access and analyze.
How does the system handle large files?
For files larger than 5 pages, the system uses parallel processing. This involves dividing the file into smaller parts and processing each part simultaneously, resulting in faster extraction.
What functionalities are performed after file extraction?
After extracting the data, the system supports functionalities like sentence extraction, keyword extraction, metadata extraction, key phrase extraction, and summarization.
Which API is used for text extraction?
Apache Tika is used for text extraction.
What are some of the methods used for text summarization, as mentioned in the literature review?
The literature review discusses methods like Naive-Bayes, K-means clustering, graph theoretic approaches, complex natural language analysis (lexical chains), and the position method.
How does the K-means clustering method work for text summarization?
The K-means clustering method divides the file into sentences, calculates sentence scores based on term frequency, and uses these scores as coordinates for clustering. The cluster with the most sentences is used to create a condensed summary, maintaining the original sentence order.
What is the graph theoretic approach to summarization?
This approach represents sentences as nodes in a graph, with edges connecting sentences that have similarity above a threshold. Nodes with more connections are considered more significant and are prioritized for inclusion in the summary.
What is a lexical chain?
A lexical chain is a sequence of related words or phrases in a text (e.g., "Geeta bought a Duster. She loves the car." – "car" refers to "Duster").
What happens when a user uploads a file to the system?
The user uploads the file, and the system checks its size. If it's a large file (more than 5 pages), parallel processing is applied. The extracted content is segmented into sentences, stop words are removed to generate keywords and key phrases, and metadata, links, and images are extracted.
What options are available on the Result Page?
The Result Page provides tabs for different options, including metadata, auto summary, keywords, key phrases, and plain text.
How are keywords and key phrases extracted?
Keywords and key phrases are extracted by removing stop words and using term frequency and inverse term frequency (TF-IDF) to identify the most frequently occurring and important terms.
How is auto-summarization performed?
Auto summarization involves segmenting the fetched data into sentences and rating sentences based on the presence of keywords and key phrases. The sentences are then arranged in the summary according to their rating.
What are some of the challenges in text extraction and auto summarization?
Challenges include extracting data from different semi-structured sources (databases, web pages) in the correct format, size, and time. Extracting text from different languages and ensuring that the text summarization is comprehensive without redundancy are also challenges.
- Citar trabajo
- Jasneet kaur Saini (Autor), Sonal Chandani (Autor), 2016, Text extraction and auto summarization, Múnich, GRIN Verlag, https://www.grin.com/document/321771