Peculiarities in Test Management of Voice User Interfaces (VUI)

In the test management of voice user interfaces, several challenges arise in software testing. There are differences between testing voice user interfaces and graphical user interfaces. This is mainly due to the variety of input options through spoken or written language. Also, it is not clear how to measure the accuracy of voice user interfaces. To eliminate this circumstance, requirements and acceptance criteria are determined from the outset. In addition, it is difficult to find appropriate verbal utterances for a test that encompasses the breadth of verbal possibilities. Finally, it must be clarified how tests with the variety of statements can be automated in such a way that they are value-added and risk-oriented. These points are described in this article and possible solutions are shown at the same time.

Software testing and the requirements of voice user interfaces

Tools for quality assurance in agile software development

Requirements in agile development teams are described by:

Definition of Ready – conditions for starting the task
Definition of Done – Conditions for completing the task
User Stories - Content of the task
Scenario(s) - Points to consider in the task
Acceptance Criteria – conditions for acceptance of the task

The software requirements are described by the team from the customer's point of view and noted in user stories. At the same time, acceptance criteria and the definition of done are noted and become requirements.

Test execution on voice user interfaces

There are differences in input for voice user interfaces (VUI). The speed of the speech input has an impact on the accuracy of the speech recognition just like the volume, the articulation or the pitch. This decides on the success or failure of the input and thus of the application.

Although spelling mistakes also occur when entering text, in my experience these are felt less strongly than the deviation. Spelling mistakes tend to be attributed to the users, and if language characteristics are insufficiently translated into text, the application is blamed.

Find test utterances for voice user interfaces

Include language proficiency and language style for test data

The influencing variables mentioned also affect the compilation of test data. Test data sets should represent all influencing variables as representatively as possible. This creates the challenge that the distribution of the influencing variables in the target group is generally not known. Since the weighting of the parameters to be considered differs between the test population and the population in the field, deviations in the recognition accuracy are to be expected. These influencing factors must be taken into account as mandatory parameters for the selection of acoustic test data.

gender
regional origin
age
educational qualification

Observing these criteria enables as many sections of the population as possible to participate in the voice application. Some of these parameters can be defined before the test data is collected, while others can only be determined once the recordings are available. If the desired distribution of the parameters is not met, additional recordings are required so that the assumed distribution of the influencing factors is finally available in the test corpus.

Approaches to generating test utterances

The implementation of software tests is as good as the impartiality of the people involved. The goal of software testing is to demonstrate software quality. Hence the principle "Tests create trust". For this purpose, further parameters must be taken into account for the generation of test data so that the bias of the test team is not reflected in the test statements.

For a complete coverage of these variables, the influencing variables in question syntax, articulation variants and number of language styles are multiplied. The variants can be obtained in the laboratory, through free field studies or by specifying application scenarios in field studies.

No matter how the test utterances were obtained, follow-up work is required. The recordings obtained are to be listened to. They have to be written down correctly, the metadata for a test utterance such as language, age, pronunciation, etc. have to be noted and the recognition results desired for the purpose of use have to be recorded. This effort is multiplied for each language to be won.

Select speech recognizer test methodology

Static Testing Techniques

For the software tests, it must be clarified whether static or dynamic test techniques are to be used.

For static testing techniques, the focus is on formal or informal code reviews and checking the documentation and documents. There is no difference here to software testing of graphical user interfaces. See.

Dynamic testing techniques

In dynamic testing techniques, the DUT is executed, again there is no difference between graphical user interfaces and voice user interfaces. Speech recognition can be implemented in different levels of integration.

First, speech recognizers can be tested interactively with individual utterances. This is possible in the development phase, as a unit test and also for later test phases. This often results in a more empirical approach to the test during execution.

Next there is the executability with written inputs. Text files with different formulations are used to check recognition models. The recognition results then show which hypotheses the language model recognized. In this way, discrepancies at the syntactical level can be identified even more precisely and checked regularly. Because the text files exist as a corpus, these tests can detect changes in quality. Because written language is faster to process than spoken language, tests can be performed quickly and more cheaply than with audio recordings. It must be clarified whether the speech recognizer used can deal with written texts.

Finally, voice recordings can be played. In this way, larger collections of audio recordings can be transferred to an instance of the speech recognizer. The recognition results are logged at the same time. The automated execution is suitable for release tests and regression tests to show the quality achieved.

These test techniques can be set up either with the end device or with the API of the speech recognizer.

Non-Functional Tests

Non-functional tests check the load behavior of the speech recognition and the speech output for different integration levels.

The ability to perceive ambiguous expressions, at least in the chosen domain, and to respond appropriately.

Features of software ergonomics in particular must be taken into account in the non-functional tests. Self-declaration, assistance, conformity to expectations and error tolerance are particularly in focus.

The robustness of speech recognition with background noise, with unusual articulation and are to be planned regularly.

Regulatory framework conditions for the storage location of application artifacts must be checked.

Testing of voice user interfaces

Reproducibility in the test execution

The reproducibility of the system behavior is of immense importance for the test execution. Voice user interfaces and graphical user interfaces differ enormously in this aspect. Therefore, when testing speech recognition, it is best to test with speech recordings. This is the only way to increase the chance of reproducing the system behavior. When using voice recordings for the test execution, the triggering event is identical in all repetitions of the test execution and can thus reproduce the system behavior.

If it is not possible to import voice recordings, at least the text for the test statements in the test case must be documented and the same people must also be used to carry out the test. As a result, however, much of the variance in the linguistic features is then lost.

Automation of test execution

If the speech recognizer used offers the possibility of using text files or audio recordings en masse for recognition, use must be made of this. The construction of a test corpus for mapping linguistic diversity is an essential and necessary prerequisite for this. Maintaining the test corpus, however, increases the test effort overall. The test corpus must be maintained in such a way that the individual revisions can be mapped to the software status of the tested application.

As in all cases of automation, it is important to consider whether the effort involved in configuring the automation, its implementation and evaluation is less than manual implementation. Automated tests are to be used to prove that software quality has been achieved, while manual tests are usually the method of choice for new functions and when it comes to rapid commissioning. As in any test, repeating tests prevents new bugs from being found.

Frequency of testing

Each testing phase includes testing and troubleshooting, new features and existing features. This results in the following test scope for each software delivery

error post test
functional test
regression test

Test automations are to be provided for the regression test. To avoid test fatigue, the regression tests must also be renewed and modified regularly.

Test evaluation for voice user interfaces

The reporting for the test execution for voice user interfaces differs insignificantly from the usual test reporting. Expected and actual behavior is compared. A challenge lies in the evaluation of the recognized hypotheses of the speech recognizer. The recognition accuracy of the sentences in the test corpus is to be shown for the purpose of the statement. The intention of the statement must already be described during the creation of the test corpus.

The evaluation of the speech recognition results is based on statistical information. This means that for a specific test utterance from the test corpus, there is a certain probability that this sentence will be interpreted as expected. This point differs from graphical user interfaces, where there is only pass or fail. Indicate the probability of correct detection for the test parameters used.

If the test was not carried out on the end device, a qualified description of the transferability of the test results under laboratory conditions to the conditions on the end device is also required.

Testing process of language user interfaces (summary)

The voice user interface testing process differs from graphical user interface (GUI) testing in a number of ways. This is caused by the statistical distribution of test results, which are less deterministic than GUI tests.

The different influences on the language must be taken into account before and during the test data collection so that the distribution of the influencing variables on the test data corresponds to the expectations assumed in the target system. Test data can be collected in laboratory, free or guided field studies.

Voice recordings should preferably be used during the test so that the test results can be reproduced.

Metrics are to be used for the test evaluation, which describe the degree of probability of recognition measured against the expected results per test parameter.

QA Testing Board

Search This Blog