This thesis proposes a general pipeline architecture for one-on-one dialogues extraction from many different IRC channels to extend the state of art work for the Ubuntu IRC channel. Further more, this thesis takes the advantage of the results from the pipeline and evaluates ESA on different extracted dialogues.
The power of an intelligent program to perform its task well depends primarily on the quantity and quality of knowledge it has about that task. Advanced techniques and applications in Artificial Intelligence are highly depending on data which at the same time getting highly increased and are available over the web. However, for a computer to be able to manipulate information, the latter should be in a form that makes it easy for a computer to manipulate. That is, many available unstructured data need to be collected and post-processed in order to create structured information from the unstructured ones. Recent advances in Data-Driven Dialogue Systems made use of the Ubuntu published IRC channel conversations to extract one-on-one dialogues to use in Deep Learning methods. A best response task performed by a Dialogue System can make use of a trained model on such dialogues. In addition, techniques in Natural Language Processing like Semantic Analysis had a remarkable progress, Wikipedia-Based Explicit Semantic Analysis (ESA) is an example, where the problem of interpretation was improved for both Polysemy and Synonymy.
Table of Contents
Chapter 1: Introduction
1.1 Motivation: Why is the Topic so Important?
1.2 Thesis Overview
1.3 The Problem and Contribution
1.4 Outline of The Thesis
Chapter 2: Background
2.1 Dialogue Systems
2.1.1 Introduction
2.1.2 Data-Driven vs Other Design Approaches
2.2 McGill Ubuntu Dialogue Corpus
Chapter 3: Methods and Techniques
3.1 Natural Language Processing (NLP)
3.1.1 Introduction
3.1.2 Wikipedia-Based Explicit Semantic Analysis (ESA)
3.2 Deep Learning
3.2.1 Why Deep Learning?
3.2.2 Deep Neural Networks: Definitions and Basics
3.2.3 RNN and LSTM Networks
Chapter 4: Data Collection: Six IRC Channels
4.1 Ubuntu
4.2 Lisp
4.3 Perl6
4.4 Koha
4.5 ScummVM
4.6 MediaWiki
Chapter 5: General Pipeline for Dialogue Extraction: IRC-VPP
5.1 Pipeline Architecture
5.2 Components and Configurations
5.2.1 IRC Channel Crawler
5.2.2 Raw IRC Cleaner
5.2.3 Dialogue Extraction
5.3 Post-Processing Algorithms
5.3.1 Message Extraction
5.3.2 Recipient Identification
5.3.3 Dialogue Extraction and Hole-Filling
5.3.4 Relevant Messages Concatenation
5.4 Annotating IRC-VPP Dialogues Datasets
Chapter 6: Experiments and Evaluation
6.1 Pre-Training Datasets Statistics
6.2 IRC-VPP Software vs McGill Software
6.3 RNN/LSTM/ESA Results
Chapter 7: Conclusion and Future Work
Research Objectives and Themes
This thesis aims to maximize the utility of unstructured, domain-specific online conversations for training intelligent dialogue systems. The primary research goal is to design and implement a versatile pipeline, referred to as IRC-VPP, which automates the collection, cleaning, and extraction of one-on-one dialogue data from various Internet Relay Chat (IRC) channels. Furthermore, the work evaluates the integration of this pipeline with deep learning and semantic analysis techniques to improve dialogue response accuracy.
- Development of a universal, flexible pipeline (IRC-VPP) for cross-domain IRC data extraction.
- Implementation of post-processing algorithms to handle unstructured conversation logs and recipient identification.
- Evaluation of deep learning architectures (RNN and LSTM) on multi-domain conversational datasets.
- Integration of Wikipedia-Based Explicit Semantic Analysis (ESA) with neural networks to enhance semantic interpretation.
Excerpt from the Book
3.1.2 Wikipedia-Based Explicit Semantic Analysis (ESA)
In general, the issue of computing the semantics of natural languages showed that humans interpret and recognize the relatedness of a text automatically in the background without noticing that. That is because humans have a common sense of the world. On the contrary, machines do not have such common sense. It has become clear that in order to process a natural language, computers require access to large amounts of common sense and domain-specific knowledge [6]. However, previous work on semantic relatedness was based on purely statistical techniques that did not make use of background knowledge [7] or it was depending on lexical resources that contain a limited knowledge [3]. The idea of ESA was originally introduced by Evgeniy Gabrilovich and Shaul Markovitch [8] to give the ability to the machine to semantically analyze and interpret textual data using explicitly defined common senses knowledge from Wikipedia knowledge base, where each Wikipedia article describes and concept.
Summary of Chapters
Chapter 1: Introduction: Discusses the importance of data-driven dialogue systems and introduces the need for automated, domain-specific data collection and post-processing.
Chapter 2: Background: Provides an overview of dialogue systems, their evolution, and the specific state-of-the-art work by McGill University on the Ubuntu IRC channel.
Chapter 3: Methods and Techniques: Details the natural language processing techniques, specifically Wikipedia-based ESA, and the deep learning architectures (RNN/LSTM) used in the thesis.
Chapter 4: Data Collection: Six IRC Channels: Demonstrates the data collection phase across various IRC channels, detailing their unique HTML structures and data formats.
Chapter 5: General Pipeline for Dialogue Extraction: IRC-VPP: Explains the design and functionality of the IRC-VPP software, including its components, configuration parameters, and post-processing algorithms.
Chapter 6: Experiments and Evaluation: Presents and discusses the statistical results of the extracted datasets and evaluates the performance of the deep learning models.
Chapter 7: Conclusion and Future Work: Concludes the thesis by summarizing the success of the generalized pipeline and suggests potential future research directions.
Keywords
Dialogue Systems, Deep Learning, Natural Language Processing, IRC-VPP, Explicit Semantic Analysis, ESA, Data Collection, Recurrent Neural Networks, RNN, Long Short Term Memory, LSTM, Post-Processing, Ubuntu, IRC, Information Extraction
Frequently Asked Questions
What is the primary focus of this thesis?
The thesis focuses on automating the process of collecting and post-processing unstructured human-human conversational data from various Internet Relay Chat (IRC) channels to facilitate the development of domain-specific data-driven dialogue systems.
What are the main thematic fields covered?
The work spans several fields including Natural Language Processing (NLP), Deep Learning (specifically RNN and LSTM networks), data pipeline architecture design, and semantic analysis using Wikipedia-based ESA.
What is the core objective of the research?
The objective is to generalize the pipeline architecture for dialogue extraction so that it is not limited to a single domain (like Ubuntu), but can instead adapt to different IRC channel formats and provide structured data for neural network training.
Which scientific methods were employed?
The research employs automatic crawling techniques, novel post-processing heuristics for log cleaning and dialogue extraction, and evaluates machine learning performance using Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) models.
What is discussed in the main body of the work?
The main body covers the theoretical background of dialogue systems, detailed methodology for data collection and extraction (the IRC-VPP pipeline), and comprehensive experimental evaluations of deep learning models on multiple datasets.
How would you characterize this work through keywords?
Key terms include Data-Driven Dialogue Systems, IRC-VPP, Deep Learning, Explicit Semantic Analysis (ESA), and automated data pipeline generation.
What is the purpose of the Hole-Filling algorithm mentioned in the pipeline?
The Hole-Filling algorithm is designed to capture conversational segments where users do not explicitly address recipients, which are common in IRC logs, thus significantly increasing the volume of usable training data.
Why is ESA combined with deep learning in this thesis?
ESA is integrated with neural network models to improve semantic interpretation and relatedness estimation, aiming to boost the accuracy of "best response" selection compared to using deep learning methods in isolation.
- Arbeit zitieren
- Ahmed Abouzeid (Autor:in), 2017, General Pipeline Architecture for Domain-Specific Dialogue Extraction from different IRC Channels, München, GRIN Verlag, https://www.grin.com/document/365283