Part One - What is NLP -
NLP tasks before 1960
NLP tasks after 1960
Part Two - The societal impact factors of NLP -
Underexposure and its negative impact on balanced data
Part Three - Major challenges for online research ethics -
What is math about? Simply spoken it is about uniting separate parts and bringing together something new. To understand, it is important to understand the meaning of each part to create further meaning of the whole. Where is the difference to language? To properly understand a language it is important to split it into smaller units and to understand the meaning of each unit is necessary and helpful to understand the whole language. The mathematical formula is 1+1= language.
Firstly, this seems to be confusing but lately, logical. Language does consist of smaller units which, once brought together, build up to a new system, a whole language. To understand a language it is therefore indispensable to understand logical connections but it is not necessary to be a math genius. The statement ’Linguistic is the math philology’ can be found to be true in any direction. Today’s most striking field of linguistics, Natural Language Processing (NLP), combines the ability to think logically and to analyse language in an encompassing manner. The following story is displayed to raise attention on the questions of: How do we use NLP? Is it really necessary? And hopefully, by the end of the story, the reader will answer this questions with ”Yes, indeed.“
Tom is a student who needs to write a term paper. He is supposed to write about math and language but he has no clue on what to write about. This is why Tom opens up the google website and types in relevant key words for his search ”maths & language“ into the search field. Suddenly, the presented results are exactly what he was searching for. What Tom does not know jet, the results are in the perfect order because of an application of Natural Language Processing which is about searching for relevant information according to the user’s query and to display it. Tom is happy and starts to research.
Now he is about to write the term paper. The writing programs checks Tom’s spelling or even suggest other useful words. Tom knows about that useful tool and uses it consequently. What he does not know by now is that he writes so easily because of an application of NLP.
Tom opens google because he is searching for an article. He types in: ”Natural La…“ and the words predicted for his search show up. It is Natural Language Processing, NLP facilitated his search. He finds the one and only striking article for his term paper but it is about 300 pages and the due date is tomorrow. What can he do? He uses the QuickSummaryTMTool which creates an automatically summarisation of the article with the help, obviously, of an NLP application. Tom is so happy about the tool that he texts his best friend via WhatsApp Messenger. He writes ”I am so happy“ and the App suggests him an emoji which, based on Sentiment Analysis, is the perfect emoji so Tom adds it. The Sentiment Analysis is also an application of NLP.
Tom works on his laptop while his best friend calls him. Tom really needs to finish the term paper and does not answer the call. To put his best friend off he says ”Siri, write Marc and tell him that I call him back later.“ Siri says ”Okay Tom“ writes the message and sends it to his friend. Once again, an application of NLP called Speech Recognition. After two hours of consequentially typing in every word Tom needs a breaks and double clicks the fn button on his laptop. He starts to dictate the text for his paper and by the help of Speech Sythesis, an NLP application he can rest simultaneously while working on his task. Even though, Tom did not even know about Natural Language Processing it helped him to simplify and accelerate his work and was still able to rest a few minutes while his work was done for him.
The aim of this paper is to give a brief introduction on what is Natural Language Processing (NLP) and further, to define several challenges NLP has to face due to online data bias. The challenges which concern the field of technology as well as they influence the social impact form a work frame for the overlaying field of ethical challenges in online data which are going to be displayed in this paper. Not only existing challenges but also future solutions will be a subject of discussion.
Part one - What is NLP -
What is NLP and why is it that Natural Language Processing is the interface of Computer Science, Computational Linguistics and Artificial Intelligence. Simply spoken, the goal of NLP is to construct computer systems capable of analysing nature languages.
The subfields of NLP reaches from Machine Translation (MT) to Question Answering (QA) to Conversational Systems: i.e. customer service for the phone, Siri, Alexa.
NLP is allocated to retrieval information.
In 1950, Natural Language Processing started a merger of artificial intelligence and linguistics. In the beginning, there was a clear distinction between NLP and Text Information Retrieval (IR) which uses ”highly scalable statistic-based techniques to index and search large volumes of text efficiently.”1 In the following years NLP and IR converged and the often failed word-for-word translation improved. Today the famous translation from ’the spirit is willing, but the the flesh is weak’ to ’the vodka is agreeable, but the meat is spoiled’1 is luckily a product of toddler-stages.
In 1956, Chomsky raised attention to the Backus-Naur Form (BNF). BNF represents programming-language syntax while specifying on ’context-free grammar’2 (CFG). The BNF specification can be illustrated as a set of derivation rules that validate the program code syntactically. The basis of the regular expressions which are used to specify textosearch patterns build restrictive grammar. In the 1970s, so called lexers, lexicaloanalyser generators and parser generators utilised grammar. The task of a lexer is to transform text into tokens; a parser validates a token sequence. The aim of those generators is to simply the implementation of programming-language. The generators’s input consists of regular expressions and BNF specifications. Simultaneously, while using the input and generating a code they lookup tables which determine lexing/parsing decisions. A brief explanation of how programming-language is created follows to ensure an adequate understanding. Programming languages are designed with a restrictive CFG variant, an LALR (1) grammar which means that the grammar follows the rules of a Look-Ahead parser with a Left-to-right processing a Rightmost derivation3 (bottom-up). The LALR (1) parser firstly scans the text from leftotooright and operates during that process bottomoup which means that it builds compound constructs from simpler ones. To make a parsing decision he uses a look-ahead of a single token.
The vastly large size of natural language, its unrestricted nature and its ambiguity urged researches to overthink previous approaches. Resulting in the year 1980, when Klein summarised the main points for new programming-rules to ensure a fundamental reorientation with respect to statistical NLP4. Those main point were firstly, a replacement of deep analysis with simple and robust approximations. Secondly, more rigorous evaluation was needed. Thirdly, machine-learning methods which used probabilities came to account. And lastly, large text corpora was used to train machineolearning algorithms.This vast amount of data improved the results in practice because the more data is used, the more representative the result it.
NLP tasks before 1960
Lower-level NLP tasks employ 6 main functions. In the following each function will be named in combination with possible complications, according to Spyns5.
To begin with, the detection of sentence boundaries is a fundamental task which becomes complicated if abbreviations and titles are included in the text. If items are listed or templated utterances they complicate the process additionally.
The second task for NLP is called tokenization which describes the task to translate text into tokens. Difficulties appear during their identification within a sentence when token boundaries such as hyphens or forward slides are included.
Thirdly, ’POS-tagging’ in a low-level task which means Part-of-speech assignment to individual words. Often, homographs and gerunds such as verbs which end in ’ing’ but which are also used as a noun impedes the task massively. Sometimes it it necessary to decompose compound words. This process is describes as morphological decomposition and employs lemmatisation as a subotask. The task of spell-checking as well as indexing or searching call for morphological decomposition. Fifthly, shallow parsing also called chunking needs to be listed to lower-level tasks of NLP.To shallow parse means to identify phrases from constituent part-of-speech tagged tokens.
Lastly, the problem of segmenting text in meaningful groups called problem-specific segmentation should be listed as well as a lower-lever task NLP needs to face.
NLP has to face numerous higher-level tasks which will not be explained in detail but should be listed here as well according to Haas6. The identification of spelling/grammatical errors as well as the identification of specific words or phrases and identifying them, called Named entity recognition (NER) cause problems classified as high-level tasks for NLP. Further, to identify whether a named entity is present or absent, called negation and uncertainty identification as well as the extraction of relationships can be counted as problematic in vastly large text corpora. Lastly, the information extraction(IE) can cause problematics because the task is not only to spot problematic information but also to transform it into structured forms. To sum it up, several problematic tasks appear when the goal is to analyse the text corporate adequately and to generalise to answers.
NLP tasks after 1960
When NLP was in its toddler-stage the massive amount of anonymous corpora were used to meet the target which was to enrich the linguistic system. Nowadays, the outcome of NLP has a direct effect on individuals, as you may have noticed in the beginning of the term paper. But NLP not only employs data fed to the algorithm manually by researchers and collected from various studies. Today, in the 2000s, the century of social media, the data is collected from social network platforms such as Facebook or Twitter. While using those platform seemingly voluntarily provision of user’s data help researched to broaden the corpora. This is why a look at the wider social impact NLP may be necessary.
Back then, when Facebook was not even invented, NLP research had not directly involved human subjects. Today, the situation has changed dramatically due to the increase of social media where a direct connection to research and human subjects can be pointed out. In fact, the human subject seems to be the one who feds the system with data. Data science now affects people in real time.
A well known study in 2008 pointed out what already has become NLP’s daily business when they said ”a new public dataset based on manipulations and embellishments of a public social network site, Facebook“7. Internet researchers increasingly recognised that collecting digital trace data creates challenges for further developed ethical codes.
Therefore it seems necessary to reconsider the application of ethical considerations to research. The higher goal should always and only refer to balancing the potential value of conducting while preventing the exploitation of human subjects.8 This is when the subject of NLP-language comes to an accounts. It should be a proxy for human behaviour not a weapon. Human beings use language to express themselves, to communicate with each other and most of them do this on the internet by using NLP applications. Language can therefore be seen as a powerful instrument. There seems to be an underlying network which includes individuals as well as the society. The binding principle is the social impact of NLP or in other words the NLP application the individuals use to communicate. Three main relationships bind the network’s participants which can be labeled as failing to recognise group membership, implying the wrong group membership and overexposure8. In following those three relationship binding principles will be explained to illustrate the social impact of Natural Language Processing in detail.
Part Two - The societal impact factors of NLP -
Any data set carries demographical user’s information the so called demographical bias. For further readings the term ’overfitting’ needs to be explained more properly. ”Overfitting refers to a model that models the training data too well. Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.“10 In other words, the assumption that human nature is so universal that finding on one group can be applied to another group as well. The demographical bias which is carried in the data anticipates that the omnipresent universal demographic information can be taken for granted. The typical demographic information is based on standard western-country attributions such as educated, rich, industrialised, and democratic research participants (WEIRD11 ). This phenomenon means for NLP that the model implicitly assumes all language to be identical to the one it was trained on. This can cause serious consequences as exclusion or demographic misinterpretation. The falsified data lead to wrong assumptions which represents an ethical problem in itself. If the universality and objectivity of scientific knowledge12 is threatened this way the society’s increasing demand for validated methods as basis for making important decisions falls short. Imagine an NLP application which bias is solely trained with standard language technology, a language spoken mainly by white Americans. What if for instance a citizen of Latino descent wants to use the application but it failed to meet his needs in language? The hidden ambiguity creates demographical differences instead of wiping them out. The user-friendly technology would then state by its algorithm which user would be able to benefit from it. This is not fair and therefore a user friendly technology needs to be designed by considering all type of demographic differences. A lack of awareness as previously shown can cause exclusion of people and should be avoided by creating awareness of the mechanism in NLP research and development. Therefore the over-represented group in the training data needs to be downsampled and potential counter-measures to demographic bias should be applied.
Automatic inference of user attributes can be helpful and has the power to be a practical benefit if it assumes the right things. Wrong assumptions can cause trouble. While the wrong gender applied to our birthday’so-benefit-code email makes us smile and we forget about that in the next second an error in predicting our sexual orientation or religious views can cause more trouble. To ensure less errors, correct data must be given which can be created by changing the target variable. The user is more likely to get no answer than to get a wrong one.
This is the only relationship which originates from research design instead of algorithm. Usually, the effect of overexposure appears when mainstream attention increases on a specific topic. Such overexposures may lead to availability heuristics13. This phenomenon can be explained by the term recall. If a certain event can be recalled because the participant experienced it negatively or if the participants had vast knowledge about specific things, the inference that the events must be more important than other events appears. The heuristics become ethically charged with negative emotions, violence, or bad behaviour, and they are more strongly associated to specific ethnicities. Imagine a bias in which red-hair women would still be damned for every misery which happens around their house. That utterly outdated stereotypical picture of red-hair women as the ’the bad witch’ would automatically ban people with this hair colour as more likely to commit a crime, as more likely to neglect their children, as more likely not to pay the rent in time. Everything suggested needs to be categorised as cruelly generalised and would base solely on the assumptions the bias consists of. Therefore, it needs to be ensured that the bias is trained in an ethical way without any exclusions or generalisations.
Underexposure and its negative impact on balanced data
Like previously stated, NLP mainly focusses on Indo-European data rather than on small entities or languages. The focus on this main data-group tends to an imbalance in the available amounts of labeled data. Mostly, NLP tools are geared toward English. This has created an underexposure to typological variety. Morphology and syntax of English are not as complex as in other languages. Because of the fact that English is one of the most widely spoken languages in the world it opens up the biggest market for NLP tools.
To sum it up, NLP techniques can reinforce prescriptive linguistic norms when they are degraded to non-standard language. Further, they can be seen as a useful tool for detecting fake news and also for generating them.14 What should be considered is the fact that Natural Language Processing can enable morally questionable practices. It has the power to raise awareness and it can lead the discourse on it in an informed manner.
- Quote paper
- Szahel Kumke (Author), 2019, Societal Impact Factors and Major Challenges for Natural Language Processing, Munich, GRIN Verlag, https://www.grin.com/document/1112090