This thesis comparatively investigates factors for customers satisfaction in voice commerce and e-commerce to assess the emphasis customers place on factors in both channels. Voice commerce is a newly evolving electronic commerce channel where customers communicate with dedicated systems on smart speakers, mobile phones or other devices using their voice, in order to find and order products.
This thesis identifies customer satisfaction predictors that potentially differ between both channels: convenience and transaction process efficiency are based on previous research on chatbot and digital assistant expectations. In the area of recommendations, recommendation personalization (the degree of personalization of product recommendations) is identified from previous research. The construct of recommendation complexity has been created, which is the degree of detail and amount of information recommendations are presented. Differences in this domain of computer-human-interaction are explained by media richness theory, an application of neuro-ergonomics.
Data was collected through a survey conducted on the crowdsourcing platform Amazon MTurk. The sample consisted of 178 US consumers that had purchased goods using both e-commerce and voice commerce. Structural equation modeling (SEM) was used as well as multiple regression analysis for statistical hypotheses testing. Two SEM models were created for each voice commerce and e-commerce and both models were compared to investigate comparative hypotheses.
This research enables product managers to recognize which factors of customers satisfaction differ from those in e-commerce. While developing their voice commerce strategy and system design, managers should emphasize convenience factors such as ease of use and ease of understanding, as well as an efficient transaction process.
Contents
1 Introduction
1.1 Background
1.2 Outline and Scope
2 Literature Review and Hypotheses Development
2.1 Overview
2.2 Voice Commerce
2.3 Brief History of the Conversational Interface
2.4 Differences in Electronic Commerce Channels
2.4.1 Comparison Framework
2.4.2 Technology Dimension
2.4.3 Value Dimension
2.4.4 Media Richness
2.5 Voice Unimodality
2.5.1 Speech Input
2.5.2 Speech Output
2.6 Customer Satisfaction Factors
2.6.1 Recommendation Complexity
2.6.2 Recommendation Personalization
2.6.3 Convenience
2.6.4 Transaction Process Efficiency
3 Research Methodology
3.1 Study Design and Measurement Scales
3.2 Data Collection
3.2.1 Acquisition using MTurk
3.2.2 Screening Survey
3.2.3 Main Survey
3.2.4 Quality Control
3.3 Respondent Characteristics
3.3.1 Age and Gender
3.3.2 IT Affinity
3.3.3 Usage Frequency and Technologies
4 Data Analysis
4.1 Sample Size Considerations
4.2 Descriptive Statistics
4.3 Reliability
4.4 Principal Component Analysis
4.5 Structural Equation Modeling
4.5.1 Measurement Models
4.5.2 Structural Models
4.5.3 Empirical Findings
4.6 Moderation
4.7 Multiple Regression Analysis
5 Considerations and Recommendations
5.1 Discussion
5.2 Implications for Practice
5.3 Implications for Research
5.4 Limitations and Design Issues
5.5 Directions for Future Research
A Survey Items
List of Figures
2.1 Hierarchy of selected commerce channels
2.2 Typical spoken dialog turn
2.3 Typical voice commerce process
2.4 Channel comparison model
2.5 Research model (e-commerce)
2.6 Research model (voice commerce)
4.1 Structural model (e-commerce)
4.2 Structural model (voice commerce)
List of Tables
2.1 Comparison of selected commerce channels attributes
2.2 Proposed hypotheses
3.1 Screening results (N=1896)
3.2 Sample demographics and ordinal scales (N=178)
3.3 Devices and technologies used for voice commerce
4.1 Descriptive statistics (N=178)
4.2 Evidence of reliability
4.3 Rotated component matrix
4.4 KMO and Bartlett tests
4.5 Factor loadings
4.6 Factor correlation matrix
4.7 Measures for model fit evaluation
4.8 Basic hypotheses results
4.9 Post-hoc power analysis
4.10 Comparative hypotheses results
4.11 Moderation analysis
4.12 Multiple regression analysis
A.1 Demographic information items
A.2 Model items
Abstract
Purpose - Voice commerce is a newly evolving electronic commerce channel where customers communi- cate with dedicated systems on smart speakers, mobile phones or other devices using their voice, in order to find and order products. This thesis comparatively investigates factors for customers satisfaction in voice commerce and e-commerce to assess the emphasis customers place on factors in both channels.
Originality - To my knowledge after exhaustive literature investigation, this is the first study to sci- entifically analyze customer satisfaction factors in voice commerce as well as the first study to compare voice commerce and e-commerce.
Design/methodology/approach - I identified customer satisfaction predictors that potentially differ between both channels: convenience and transaction process efficiency are based on previous research on chatbot and digital assistant expectations. In the area of recommendations, I identified recommendation personalization (the degree of personalization of product recommendations) from previous research. I also created the construct of recommendation complexity, which is the degree of detail and amount of information recommendations are presented. Differences in this domain of computer-human-interaction are explained by media richness theory, an application of neuro-ergonomics. I collected data through a survey conducted on the crowdsourcing platform Amazon MTurk. The sample consisted of 178 US consumers that had purchased goods using both e-commerce and voice commerce. I used structural equa- tion modeling (SEM) as well as multiple regression analysis for statistical hypotheses testing. I created two SEM models for each voice commerce and e-commerce and compared both models to investigate comparative hypotheses.
Findings - Customers have higher expectations in convenience for voice commerce than they have for e-commerce. Transaction process efficiency significantly influences satisfaction in voice commerce, but not in e-commerce.
Practical implications - This research enables product managers to recognize which factors of cus- tomers satisfaction differ from those in e-commerce. While developing their voice commerce strategy and system design, managers should emphasize convenience factors such as ease of use and ease of understanding, as well as an efficient transaction process.
Keywords: Voice Commerce, E-Commerce, Chatbots, Recommender Systems, Customer Satisfaction, Media Richness
Chapter 1
Introduction
1.1 Background
Since their introduction in 2014, the use of intelligent virtual assistants based on smart speakers like Amazon Alexa, Apple HomePod, Microsoft Cortana and Google Home is increasing (Broder, 2018). Moar (2017) estimates that there are currently 450 million voice assistant devices in the US, expected to reach 870 million by 2020. These systems make it possible to conduct a “zero-click” purchase in business to consumer (B2C) commerce scenarios. Communicating with the assistant using only their voice, customers can formulate search queries and confirm purchase actions without the need to use common visual or typing interfaces. E-commerce experts label this scenario "voice commerce" and expect it to be one of the most important innovations to shape the next years of e-commerce development (McTear, 2017; Luger and Sellen, 2016; Chopra and Chivukula, 2017). These systems involve nat- ural language processing (NLP), intent recognition, speech synthesis, recommender systems and artificial intelligence (AI) technologies (Brandtzaeg, 2017; Luger and Sellen, 2016).
There has been a long-standing research interest in customer satisfaction and loyalty factors for e-commerce applications (Yang, 2015; Colla and Lapoule, 2012; Srinivasan et al., 2002; Fuentes-Blasco et al., 2010). Similar research on e-commerce via mobile devices (m-commerce) is also present (Ngai and Gunasekaran, 2007; San Martín et al., 2012; Sohn et al., 2017; Wang and Liao, 2007; Li and Yeh, 2010; Lin and Wang, 2006; Marinkovic and Kalinic, 2017). So is research on differences of m-commerce and e-commerce and respective customer satisfaction factors (CSF) (Cao et al., 2015; Maity and Dass, 2014; Choi et al., 2008), on commerce using con- versational text-based interfaces (AbuShawar and Atwell, 2016; Peng et al., 2016; Hostler et al., 2005; Ben Mimoun et al., 2017; Mahmood et al., 2014) and on rec- ommender systems (Li and Karahanna, 2015; Mahmood et al., 2014; Liang et al., 2006). Specific research on e-commerce in a human-to-AI voice-based scenario is, however, sparse. Research related to customer satisfaction factors in voice commerce is entirely missing from current literature, as well as research aiming at possible dif- ferences in customer satisfaction factors between e-commerce and voice commerce. Similar to m-commerce in comparison to e-commerce, voice commerce is subject to special restrictions and presents different opportunities and value proposition to cus- tomers. Therefore, it is likely to present satisfactory factors different from those of classic e-commerce and m-commerce as well as different emphasis on shared factors.
To make fact-based decisions for successful voice commerce software design and implementation, managers need to know which factors influence customer satisfac- tion. Wang and Liao (2007) argue that "user satisfaction, system use and perceived usefulness have been the most widely used surrogate constructs of system success". While many CSF for e-commerce applications are known, it is difficult to ascer- tain factors for voice commerce from current literature. Therefore, the question addressed in this thesis is: How do factors influencing customer satisfaction in B2C voice commerce differ from those in e-commerce?
1.2 Outline and Scope
To identify customer satisfaction factors which are likely to differ between e-commerce and voice commerce, I first conduct a literature review on differences between these channels. I will then review research for customer satisfaction factors related to differences identified. Based on this review, I will develop a customer satisfaction model for these factors, consisting of eight directional and four comparative hypothe- ses. Following this, I will create and describe a research design and methodology to quantitatively and empirically validate this model for both e-commerce and voice commerce. This will involve a survey among voice commerce and e-commerce users. I will then discuss reliability of the data collected, analyze the data using structural equation modeling followed by a regression analysis and present findings. I used IBM SPSS 21.0 and AMOS 21.0 for all statistical analyses. The thesis concludes with considerations and recommendations, discussing theoretical and practical im- plications for management. Finally, I present limitations and give directions for future research opportunities.
This thesis focuses explicitly on those factors that differ in e-commerce and voice commerce. It will analyze factors for customer satisfaction from the customer’s per- spective. Therefore, the thesis is not concerned with adoption factors or barriers, but on post-adoption satisfaction. Technology-wise, it will not dive into details about natural language processing and speed synthesis, intent recognition, dialog design, or human interaction with AI. The thesis will also focus on the central search-and-purchase scenario. Voice assistants in a service context, as for example investigated by Peng et al. (2016) or Chakrabarti and Luger (2015), or other ac- tivities surrounding commercial applications are not part of my analysis. Factors related to voice commerce usage in business to business commerce, as opposed to business to consumer, are not investigated. This thesis will also largely ignore so- cial and psychological effects of human-AI interaction as for example covered by Qiu and Benbasat (2009) or Komiak and Benbasat (2006). Interaction quality and dialog management metrics are also not part of the thesis.
Chapter 2
Literature Review and Hypotheses Development
2.1 Overview
In this section, I elaborate on terms and constructs used in two research models and give a brief overview about the history of conversational user interfaces. I will then investigate research on differences between e-commerce channels and specifically investigate research on the voice medium in human-computer interaction scenarios. Finally I develop hypotheses for CSF that I expect to differ between both channels. I will evaluate research on chatbot for CSF and also draw on research on mobile commerce and e-commerce, then apply knowledge of human voice processing and capacity.
I conducted a preceding literature analysis using several scientific search engines. Initially, I analyzed only those journals released after 2007 and included in the VHB- JOURQUAL3 rating (VHB, 2015). Since the research domain is relatively young, additional papers about this topic may appear in conference proceedings instead of journals. In a second step, I added journals and proceedings published with impact factor scores close to or larger than 1.0. Papers were drawn from the domains of electronic commerce, human computer interaction, psychology, neuroscience, recom- mender systems and NLP. Because of the scarcity of studies on voice commerce and chatbot commerce, I will draw heavily from two recent studies in nearby research areas: on general use of chatbots (Zamora, 2017) and virtual assistants (Luger and Sellen, 2016). One one hand, I investigated these qualitative studies to find value propositions and expectations of conversations systems. On the other hand, I inves- tigated the effects and limitations of speech output based on working memory and media richness theories. Finally, I prepare a model for empirical verification. The resulting research model does not strive for exhaustiveness, but to identify selected satisfaction predictors that potentially differ in voice and e-commerce. With respect to the proposed model, twelve hypotheses are examined. I highlight hypothesis statements using "H" followed by a number.
2.2 Voice Commerce
Electronic commerce (e-commerce) describes commerce conducted over electronic media. For example, Kwon and Sadeh (2004) define e-commerce as "the use of the internet to facilitate, execute, and process business transactions". However, public and research use the term mainly for electronic commerce conducted via computers and laptops, as opposed to mobile devices (Kwon and Sadeh, 2004), although these devices also use the internet. Researchers label the latter scenario mobile commerce or m-commerce (Cao et al., 2015; Ngai and Gunasekaran, 2007), defined as "a subset of all e-commerce transactions" (Kwon and Sadeh, 2004).
As another subset, conversational commerce utilizes NLP in electronic commerce (Ben Mimoun et al., 2017; Agrawal et al., 2017). Such interfaces can be either text-messaging or voice recognition systems (Mahmood et al., 2014). One form of conversational commerce are commercial chatbots (Shawar and Atwell, 2005; Hill et al., 2015). The actual interaction is text-based, in which both human and machine generate written text to convey information (AbuShawar and Atwell, 2016). Some commercial chatbots can also display product images and other visual information (Horzyk et al., 2009). Animated or embodied agents (sometimes also called avatars) are conversational systems that provide a visual representation of the virtual agent in addition to a text or speech interface (Ben Mimoun et al., 2017; Burgoon et al.,
2016; Qiu and Benbasat, 2009). These agents can also be employed in commerce scenarios (Ben Mimoun and Poncin, 2015). Luger and Sellen (2016) use the term conversational agent for an "emergent form of dialogue system that is becoming increasingly embedded in personal technologies and devices". Preece et al. (2017) and Miner et al. (2016) use the term to describe "computer programs designed to respond to users in natural language, thereby mimicking conversations between people". Callaghan et al. (2018) use the term virtual assistants to refer to conversa- tional agents embedded in smart speakers, and mention "Apple Siri, Amazon Echo (Alexa), Google Assistant and Microsoft Cortana" as examples.
Galanxhi and Nah (2004) define voice commerce as electronic commerce involving "computerized voice technologies: speech recognition, voice identification, and text- to-speech". We can conclude that voice commerce is a subset of conversational commerce (see figure 2.1), is differentiated by using voice as a medium and that it involves the use of a virtual conversational agent. These agents, in form of a digital assistants can be used on smart speakers, smartphones and other mobile or immobile devices, even smart TVs (Moar, 2017; Lee and Choi, 2017), but could also be stand-alone applications. Potentially, voice commerce could be combined with visual representation for improved system usability and user experience (Vassallo et al., 2010; Luger and Sellen, 2016). It could also be combined with embodied representations, as Luger and Sellen (2016) suggest. Interestingly, they explain that "all users described seeking visual confirmation of complex tasks". Currently however, voice commerce is assumed to be mostly used on digital assistants in smart speakers, where the system provides no visuals. Therefore, this thesis will only concern itself with voice commerce in the narrow sense, i.e. systems purely based on voice and without any visual representation. I will also investigate voice commerce usage through my survey.
Abbildung in dieser Leseprobe nicht enthalten
Figure 2.1: Hierarchy of selected commerce channels
To give an example, consider the following purchase process via a virtual assis- tant: We assume that the customer wants to order a bottle of port wine using a voice commerce system. The smart speaker device is activated when the user utters the following sentence: "Computer, please order me a bottle of ten years old port wine by Pinto". This represents the search input. The smart speaker is initially activated by the wake-word "Computer" (Callaghan et al., 2018), records the voice input and derives a the intent to order a bottle of wine. It uses a recommendation engine to search for "bottle of ten years old port wine by Pinto". Given the return of any results, it generates a voice recommendation: "I have found a bottle of Pinto port wine 10, would you like to buy it?". If the user says "yes", the system orders the product without further input. Address and payment data are potentially saved previously in the User’s account.
In contrast to simple voice input based on dialogue trees, these systems can understand intents in many different forms: "Computer, buy ten years old port wine from Pinto" or "Computer, i need some ten-year Pinto port wine". Additionally, the system could further inquire the user about certain product attributes such as price, volume, color etc., depending on product type, subsequently refining its recommendations. Figure 2.2 shows the technical process of each dialog turn. The wake-to-purchase process is shown in figure 2.3.
Abbildung in dieser Leseprobe nicht enthalten
Figure 2.2: Typical spoken dialog turn adapted from Yang et al. (2012)
Abbildung in dieser Leseprobe nicht enthalten
Figure 2.3: Typical voice commerce process
2.3 Brief History of the Conversational Interface
The conversational interface as such has a longstanding history (Luger and Sellen, 2016). An older term for voice-enabled systems is spoken dialog system, "a com- puter system which supports human-computer conversations in a restricted domain" on a turn by turn basis (Yang et al., 2012; McTear, 2017). The first spoken dialog systems and chatbots were created in the 1960s (McTear, 2017), a famous exam- ple is the chatbot ALICE from 1995 (Shawar and Atwell, 2005). However, these older implementations were "extremely brittle", would crash in most cases of unex- pected input, worked only for limited purposes and used specialized hardware and platforms (McTear, 2017). Advances in natural language processing as well as in machine learning (in the field of AI), have enabled more mature devices to enter the mainstream (Sarikaya, 2017; Agrawal et al., 2017; Bastianelli et al., 2017; McTear, 2017). They allow for naturally-varied input, more flexible conversations and a high word-recognition performance (Hirschberg and Manning, 2015; Callaghan et al., 2018). Additionally, McTear (2017) points out that nowadays customers are used to conversational interfaces in form of text messengers. Likewise, powerful and mobile hardware is much more common. Gašić et al. (2017) confirm that "the emergence of virtual personal assistants is generating increasing interest in research in speech understanding and spoken interactions with machines". Technical problems with semantics, context, and knowledge however, still persist (Hirschberg and Manning, 2015).
2.4 Differences in Electronic Commerce Channels
2.4.1 Comparison Framework
Based on existing literature, this chapter will analyze differences in the following commerce channels: e-commerce, m-commerce, chatbots and voice commerce. Cer- tainly, mobile commerce, voice commerce and e-commerce are not comparable in all their characteristics. However, all aim to sell goods, need to present products, feature human-to-machine interaction using different interfaces, rely on internet communi- cation, take time (i.e. transaction costs) and can be more or less complicated or easy. For a channel comparison framework, I draw on a study by Cao et al. (2015), who investigated differences between m-commerce and e-commerce and the effects on consumer’s behavior. They group differences in five customer-centric dimensions (compare figure 2.4), which they verified empirically. I adjusted the concept of con- venience to mobility, because the respective section in the study by (Cao et al.,
2015) is mainly concerned with convenience through mobility. The mobility con- cept also exists in many other comparative e-commerce studies (Huang et al., 2016; Tsalgatidou and Pitoura, 2001; Choi et al., 2008; Maity and Dass, 2014; Wu and Hisa, 2008). I also draw on media richness theory as used by several researchers in an e-commerce context (Chen et al., 2009; Maity and Dass, 2014; Brunelle and Lapierre, 2008) and extend the comparison framework by this attribute. Table 2.1 shows a comparison of selected attributes in commerce channels. End user device has been generalized to "interface".
Abbildung in dieser Leseprobe nicht enthalten
Figure 2.4: Channel comparison model adapted from Cao et al. (2015)
2.4.2 Technology Dimension
The technology dimension focuses on user interface, device and communication net- work (bandwidth). Studies by Lu and Yu-Jen Su (2009) and Kim et al. (2009) also identify the user interface as well as the bandwidth (Choi et al., 2008) as important differences.
Cao et al. (2015) describe typical e-commerce "end-user devices are personal computers with large screen, rich audio and video, standard key board and sufficient power supply". In case of m-commerce, the user interface has a "small screen, in- complete text input keyboard and limited power supply". Chatbot commerce differs from e-commerce and m-commerce mainly in the use of a conversational interface. Common to all chatbots it the use of a textual interface (Krämer et al., 2009), e.g. through a chat (Brandtzaeg, 2017). Voice commerce is the only commerce channel customers can use without mechanical interaction with a device. When used via current digital assistant systems, it does not provide the customer with any form of visual representation (Luger and Sellen, 2016). In commerce, this affects the prod- uct recommendation overview and detailed inspection, as well as visualization of the payment and delivery process. The interface also distinguishes voice commerce from chatbots: commerce chatbots can indeed provide visual product recommendations, images and descriptions (Zamora, 2017).
Abbildung in dieser Leseprobe nicht enthalten
Table 2.1: Comparison of selected commerce channels attributes
Communication networks in e-commerce are typically broadband type with high transmission speed, whereas "communication network has the limited bandwidth and lower transmission speed", resulting in lower communication network perfor- mance in m-commerce (Cao et al., 2015). For chatbots, transmission speed of course depends on whether they are used on mobile devices or personal computers. The same is the case for digital assistants and for voice commerce (Brandtzaeg, 2017; Luger and Sellen, 2016).
2.4.3 Value Dimension
The value dimension focuses on convenience in terms of mobility, personalization and risk. Cao et al. (2015) contend that m-commerce features unique attributes of ubiquity, mobility, and localization, which have no equivalent in web-based e- commerce (Cao et al., 2015; Tsalgatidou and Pitoura, 2001). These factors lead to "temporal and spatial convenience" (Cao et al., 2015). E-commerce depends on the stationary internet connection and are restricted by fixed location (Cao et al., 2015). To participate in voice commerce, most customers currently mainly use virtual assistants on smart speaker devices (Moar, 2017). While some smart speakers can be carried around in theory, most are intended to be kept at home and are too heavy to carry conveniently. In a study on virtual assistants on smartphones, Chopra and Chivukula (2017) found that users were "exposed and awkward" to speak to a digital assistant outside their home. This could mean that users mainly interact with digital assistants at fixed locations, predominantly from home. It could also mean that the device is usually connected to a local Wi-Fi network with high bandwidth. Chatbots on the other hand, can be used by customers on the move or at home (Chopra and Chivukula, 2017).
Personalization in m-commerce is fueled "from the relationship between the mo- bile device and the user" , also called personal identity (Cao et al., 2015). Addition- ally, location-awareness also sets apart m-commerce and improves personalization. Liao et al. (2005) also identify personalization and Kim and Xu (2007) mention instant connectivity as unique attributes of m-commerce than in e-commerce. Mo- bility, localization and personalization were also identified by Wu and Hisa (2008). Therefore, Cao et al. (2015) attribute a greater personalization proposition in m- commerce. Chopra and Chivukula (2017), Zamora (2017) and Chai et al. (2001) report that personalization is a prevalent and important attribute on chatbot sys- tems as well. Through ongoing user interaction, chatbots can create user profiles very efficiently. The same potentially applies to conversational agents, although cur- rent systems still need to improve in this area (Luger and Sellen, 2016). For voice commerce, I assume that the personalization proposition is similar to chatbots: preference elicitation through dialogue interaction could be handled very effectively and efficiently.
When participating in electronic commerce, customers are subject to financial risk (potential monetary loss owing to fraud) and privacy risk (disclosure of private information). Cao et al. (2015) argue that "the ubiquity, mobility, personal identity and localization natures of mobile environment greatly increase the possibilities of the individual data exposure". Especially with female customers, perceived risk is found to have a negative effect on channel perception (Cao et al., 2015). Zamora (2017) mentions that customers also perceive social risk and financial risks when dealing with chatbots. For conversational agents, Luger and Sellen (2016) report that most users do not trust the system with "complex or socially sensitive tasks", indicating privacy and financial risks. For voice commerce systems, users could perceive social and monetary risks accordingly. However, lacking research on this topic may indicate a research gap.
2.4.4 Media Richness
Maity and Dass (2014) have applied media richness theory to e-commerce and m- commerce. Media richness "is a set of objective characteristics such as feedback (cues) and communication capability, language variety, and personal focus, which determine a channel’s ability to communicate richness of information" (Maity and Dass, 2014). In a shorter definition, it is the "ability of a medium to carry informa- tion" (Chen et al., 2009). Lower media richness in m-commerce is mainly based on the limitations of the small mobile interface (small screen, inability to show com- plex information) and attention constraints: "These differences limit the extent of communication, feedback and personal focus capabilities that are possible on m- commerce" (Maity and Dass, 2014). So far, media richness theory has not been applied on voice commerce or chatbots. For similar textual interfaces such as chat messaging, richness has been found to be lower than visual and auditory interfaces (Otondo et al., 2008). Assuming a commercial chatbot with a textual interface and the ability to show product images and short descriptions, such as described by Horzyk et al. (2009), its media richness could range between m-commerce and e-commerce, again depending whether used on an personal computer of mobile de- vice. I will investigate unique aspects and media richness of the voice medium in the following chapter.
2.5 Voice Unimodality
2.5.1 Speech Input
The exclusive use of the voice interface is the most prevalent differentiator for voice commerce. I therefore investigated research concerning human information process- ing for visual and textual information compared to audio and speech. As a first indicator, Krämer et al. (2009) found that multimodal virtual agents with text and voice functions were rated less positively than text-only or voice-only agent inter- faces, with the text interface being perceived as the most efficient and usable. In terms of benificiality and speed, users rated the text interface higher than speech. According to Suh (1999), based on McGrath and Hollingshead (1993), the voice interface ranks between text and video systems in media richness. In the voice commerce scenario however, it is crucial to distinguish between voice input and voice output (from the perspective of the device), and to investigate both directions separately.
Various researchers deem speech input for computer applications to be very effi- cient. Ruan et al. (2018) found that English speech input is almost three times faster than touchscreen keyboard text input. A study by Rebman Jr. et al. (2003) found that via speech input, users "were able to generate more than twice as much text in the same amount of time". In a qualitative study specifically on chatbot conver- sation, English-speaking users subjectively rated speech modality as more efficient than typing while Indian-speaking rated it only slightly more efficient (Zamora, 2017).
Users also mentioned that "speaking to a chatbot is best when the user is multi- tasking, hands or eyes are occupied, or while they are moving and unable to be stationary" (Zamora, 2017). Accordingly, Saliba (2001) found that an audio interface facilitates multi-tasking, especially if the other task is of visual nature. Fan et al. (2005) label speech interfaces in mobile devices as facilitators of ubiquity, i.e. "the ability to retrieve information and conduct transactions from virtually any location on a real-time basis". However, voice commerce is mostly used at home (see section 2.4). Thereby, voice commerce ubiquity would be limited to the home environment respectively the actual device range.
On the other hand, "typing to a chatbot is best when the activity is complex, includes a confirmation step, or requires logic", as these situations require a form of control that users can better achieve via textual input (Zamora, 2017). The same study also reports that chatbot "examples were often related to routine daily tasks that require little trust or human logic and result in low consequences if failures occur", confirmed by Luger and Sellen (2016) for conversational agents: "Having failed at more complex tasks, the CA was often relegated to performing very basic tasks such as setting reminders". This is explained by limited capability and in- telligence of current systems (Luger and Sellen, 2016). Users also mentioned that speech input was not appropriate for "sensitive topics such as financing or social media content" (Zamora, 2017). Luger and Sellen (2016) found that "the major- ity (of users) were unlikely to engage in conversational-style interactions when in public". In conclusion, speech input seems to be more efficient than textual input especially in low-complexity situations, and users tend to prefer its use in non-public environments.
2.5.2 Speech Output
Visual and textual user interfaces support the human memory and lower cognitive load (Schmutz et al., 2010; Felfernig and Gula, 2006). If a device conveys information to the human user using speech only, this means that the user needs to keep all rele- vant information in memory. Neuroscience postulates the existence of human short term and working memory. Short-term memory only refers to the short-term stor- age of information, whereas working memory allows for the manipulation of stored information (Baddeley, 2010; Diamond, 2013). Early research on the human short- term memory indicated that it is severely limited: Miller (1956) initially determined that the average human could only remember about five to nine information items, later found to depend on other factors such as modality and complexity (Wolters et al., 2009) as well as on recipient’s age (Towse et al., 2000). Experiments also show that the amount of information items held decreases significantly as the amount of information per item increases (Miller, 1956; Commarford et al., 2008; Cowan, 2001; Alvarez and Cavanagh, 2004). Working memory and its limitations are already im- portant considerations in the design of e-commerce applications (Chattaraman et al., 2011; Lee and Benbasat, 2003). Research by Hong et al. (2004) on the subject of ideal product presentation effectiveness in e-commerce found that "specifically, the image-text presentation mode and the list information format were found to outper- form the text-only presentation mode and the array information format respectively in terms of shorter information search time, better recall of brand names and prod- uct images, and more positive attitudes towards the screen design and using the website". Schmutz et al. (2010) confirm this result. Wolters et al. (2009) argue that users can better process long lists visually: "Since all options can be scanned as often as the user wishes, they do not need to be remembered", especially if the options are easy to scan, recognize, and digest (Zaphiris et al., 2007). Using eye tracking, Schmutz et al. (2010) researched the process of users investigating several products in a visual product listing, comparing description, attributes and prices by rapidly jumping to different positions on the screen. Facilitating web site design is found to have "positive effects on users’ perceived cognitive load" (Schmutz et al., 2010). Auditory information is particularly problematic if the users have to "reason about the options" (Wolters et al., 2009). Sharit et al. (2003) argue that the number of auditory options should be limited. One study found the optimal number of options in spoken dialog systems to be four (Wolters et al., 2009). Both Möller et al. (2007) and Wolters et al. (2009) designate perceived cognitive load/cognitive demand as a negative predictor of user satisfaction. Overload of the working memory, on the other hand, leads to increased rates in errors and forgotten information (Saito and Miyake, 2004).
While some research suggests that humans react faster to auditory stimulants (Shelton and Kumar, 2010), various research shows that humans in fact process visual information faster and more efficient that auditory information, especially information with high complexity (Cohen et al., 2009; Brady et al., 2008; Saults and Cowan, 2007). Rayner et al. (2009) state that generally "listening can approach but probably not reach" the speed of reading. Cohen et al. (2009) found that "audi- tory recognition memory is inferior to visual recognition memory". In a multimedia learning scenario, "replacing visual text with spoken text resulted in lower retention and transfer scores" (Tabbers et al., 2004). Chattaraman et al. (2011), referring to Jin (2009), point out that "empirically, textual modality has been found to be more effective than auditory modality in communicating greater source expertise, informational value of advertising messages, and social presence". Humans can also comprehend synthesized speech slower than natural speech due to additional cog- nitive processing costs caused by "poorer clarity, odd intonation and rhythm", also moderated by age and hearing ability (Paris et al., 2000; Reynolds and Givens, 2001; Jones et al., 2007). The addition of visual cues would enhance speech comprehen- sion due to "strong redundancies between visual and auditory properties of speech and the brain’s sensitivity to this crossmodal correspondence" (Jaekl et al., 2015; Molholm et al., 2002; Lee and Benbasat, 2003). Maity and Dass (2014) also argue that "a mobile channel with audio/video capabilities is richer than a mobile channel with text-only capabilities".
These results suggest lower media richness in auditory information representation compared to a combination of visual and textual (i.e. e-commerce), but higher richness than text. According to Suh (1999), a decrease in media richness leads to an increase in human cognitive costs. Maity and Dass (2014) found that consumers prefer channels with medium media richness (e.g. e-commerce) to carry out more complex decision-making tasks and low media richness channels such as m-commerce for low-complexity tasks. This is an indicator that customers to some extent choose the sales channel based on the perceived complexity of the individual transaction.
We can conclude that the voice interface provides a more efficient input of infor- mation (which is subject to some restrictions), but a less efficient output of infor- mation when compared to visual commerce channels, especially in high-complexity scenarios. The strengths of increased input efficiency and multi-tasking opportuni- ties should therefore be limited by complexity of the transaction. Accompanying voice input with visual output may offset the disadvantages of visual output, but also potentially limit multi-tasking possibilities.
2.6 Customer Satisfaction Factors
2.6.1 Recommendation Complexity
In e-commerce practice, recommender systems are "tools to aid decision-making that analyze a customer’s previous online behavior and recommend products to meet their preferences" (Guo et al., 2018). Li and Karahanna (2015) define ac- cordingly that "recommendation systems are broadly referred to as a "web-based technology that explicitly or implicitly collects a consumer’s preferences and recom- mends tailored e-vendors’ products or services accordingly". For example, systems can employ personalization via collaborative or content-based filtering or use pref- erence elicitation frameworks (Christakopoulou et al., 2016). According to Wang and Benbasat (2008) recommendation agents are either tools “to facilitate users’ decision making by providing advice on what to buy based on user-specified needs and preferences” or, in a more limited conception, avatars that use animation and human voice to present recommendations (Li and Karahanna, 2015). Recommenda- tion agents perform more akin to salespeople; "product-advising function originally performed by salespeople is being increasingly taken over by software-based product recommendation agents" (Qiu and Benbasat, 2009).
Voice commerce and chatbots use conversational recommender systems, "a form of recommender system that can refine user preference through conversational mech- anism" (Baizal et al., 2017), focusing on search and personalization activities (Mah- mood et al., 2014). Conversational recommender systems converse with users to learn their preferences and incorporate feedback from users (Christakopoulou et al., 2016; Llorente and Guerrero, 2012). Qiu and Benbasat (2009) present a system which uses an anthropomorphic interface combined with voice input.
As Hostler et al. (2012) show in their case study, e-commerce systems classically show recommendations on the side or bottom of the website, corresponding to ei- ther the product the customer currently views or based on the items in the shopping cart. Recommendation agents, on the other hands, can start without any contex- tual information by asking the user a number of questions, calculating preferences and then search, filter and present recommendations. Their intention is to provide assistance and support for customer decision making (Wang and Benbasat, 2005). In their study, Wang and Benbasat (2005) present an example of such system with a text-based interface: Users answer questions about digital cameras (how their in- tend to use them, how many pictures they take, how far the subjects are from the lens, etc.), then the system presents them selected results that fit the user’s prefer- ences. Christakopoulou et al. (2016) present such a system in a conversational form, a different example is the eBay ShopBot (eBay, 2018), a commercial chatbot based on the Facebook messenger to search and filter eBay offers.
[...]
-
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X.