Flippy Cards!

Posted 5. February 2018 by INLYRA - 9 min read

Our first voice-driven game for Alexa.

Flippy Cards is a game of pairs for Alexa.

Though voice-driven, the concept is easy to visualize. Items are placed behind colored cards for you to remember. The cards are rearranged and you try to recall what we matched for each card color.

Of course, the colors are spoken along with their match, so you probably do need to visualize.

Technical Description

Flippy is our first game for Alexa. There were a few key challenges for this project:

  • Alexa voice model development
  • Testing
  • Generate and store game cards
  • Teach the user the game
  • Track game state and user interactions
  • Scoring

We'll touch on these challenges and how we approached them.

Voice Model

The voice model represents the primary UI for our application. We like to think of VUI development in terms of an MVI (Model View Intent) pattern,

Intent observes the User, Model observes the Intent, View observes the Model, the User observes the View.

The distinction may seem arbitrary, but becomes more significant for nuanced interactions with the user. As we'll touch on later, the decisions about routing User interactions to "controller" interfaces are more dynamic than MVC suggests, where modern variants insist on one to one mapping between view action and controller code (or presenter code.) This is not how we reason about VUI driven application design. There are several big reasons for this, summarized in two top points:

  • Natural language interactions by their very nature exhibit conceptual overlap (quick example: the word "yes" may indicate acceptance, refusal, inquiry, sarcasm, cynicism, anger, and so much more!)
  • Vendor machinations w.r.t ML NLP decisions are hidden from us, but all inputs are driven largely by these decisions

Whenever we process a User utterance, we must re-establish the context for their interaction, and reproduce the corresponding state of our application. The catch is that the entire surface-area of our voice-model is like the rendered DOM for an SPA route. The User may continue the previous interaction, or spontaneously begin another interaction, at any level of the command structure. To put this into perspective, if the User is engaged in a game, with the next utterance they may wish to end the game, begin another game, modify the parameters of the current game, seek help, or terminate the skill post haste. Or continue with the game. This level of interactivity is expected, and represents natural control flow for any modern web app. But for a voice UI, attempting parity can easily achieve parody. For example, when VUIs rely heavily upon confirmation to establish the context of the User intent, the results are less than ideal. Such an application prompts the User to confirm that every interpreted action is the intended one. Interactions with these UIs are wearing and tedious, and rapidly become altogether unpleasant.

We can build more natural interactions with the User. While this has been discussed exhaustively elsewhere, here is what we think we can add:

  • Carefully distinguish the linguistic elements of command utterances.
  • Build training modes into your application to establish lingua franca with the User.
  • Use dialog to refine intent.
  • Simplify.

We'de like to be clear that we are not recommending restrictive utterances and exotic keywords. Rather, we think that you should establish lexical distance between command utterances and dialog. For example, avoid over-loading terms in your model, avoid homophones, loan-words and so forth.

That said, we'll dig into the intents defined for the Flippy voice model. First, the Alexa built in intents we're incorporating:

Intent Purpose
AMAZON.CancelIntent Cancel the current operation
AMAZON.FallbackIntent Catch-all
AMAZON.HelpIntent Help with the current operation
AMAZON.NextIntent Next
AMAZON.PreviousIntent Previous
AMAZON.RepeatIntent Repeat the last prompt
AMAZON.StartOverIntent Top menu
AMAZON.StopIntent Stop the application

Possibly AMAZON.FallbackIntent deserves additional explanation, but it's primary purpose is to catch Unhandled requests and prompt the User in a helpful way. The rest is largely self-explanatory, we hope you'll agree.

Looking at our custom intents:

Intent Purpose
AnswerIntent Quiz phase interactions
DemoIntent Training modes
OptionsIntent Game parameters and options
QuizIntent Game initiation

Some of the intents are named, shall we say, hopefully, and this can be misleading. Here is another way to think about the voice interactions:

Name Intention
AnswerIntent Declaration, retort
DemoIntent Request or command ("give me," "may I have") keyword driven ("demo," "example")
OptionsIntent Request or demand change, "update, modify"
QuizIntent Request or demand commencement, "go, begin"

The slots that act as NLP markers for these intents often overlap. Moreover, the intent may capture User utterances that surpass the intended scope. For example, AnswerIntent will capture the User making declarations, "yes," "no," "whatever," or answering a quiz question, "it's x!." The scope of these utterances will extend beyond the quiz-response phase of the skill, and thus the application must distinguish the correct course of action using the session context and reconstituted state.

Testing

Since we apply BDD, unit testing is part and parcel of every commit. Good assertion libraries and coverage tools help immensely, but we won't go into that. The primary challenge for testing this application was that the ASK SDK does not (currently) accommodate [offline] integration or end to end tests. Simple tools, such as those for generating synthetic Alexa request body JSON that is accurate and full-featured, were not available, and we had to create them. Simulating more complex interactions, such as those for Alexa dialog delegation or progressive response, also required custom tools, though at the time of this writing there have been some improvements in this regard. Scripted interactions with the ASK API (dev console) using the CLI were immensely helpful to building our test suite.

The serverless approach contributes it's own set of challenges in this regard. While we can deploy into staging and sandbox accounts, going further to fully "mock the wire" was important for our process. Docker, localstack, and similar tools made this possible.

Generating the Game

We wanted to provide an element of dynamic content for the game, such that the same card content would not be returned ad nauseam. To do this, we generated a dictionary of color words, and used a dictionary of english nouns for card content. Nouns still needed to be categorized into suitable groups like "animal," "constellation," "food" and so forth. We followed a standard text classification paradigm, cleanse data inputs, stem, label, gradient descent, to establish predictors. In a future article, we'll discuss the approaches we explored to produce an effective probabilistic model, specifically looking at the bag of words model and convolutional neural networks.

Using our categorized nouns and selecting a randomized color to pair, we generate the "game deck" for play. Certain unexpected outcomes were found immediately, such as humorous, offensive, and confusing combinations of colors and nouns in the generated games. You're curious, so here is a hint, (a bashfully obscured reference?)

Blue - Brown Bear
Pink - Monkey
Black - Taco

This presents an additional engineering challenge, one that might be suitable for deep-learning solutions. We've seen good examples using LSTM, but have left this as an exercise for future work. In the interim, we have used blacklisting and scaled back the corpus used for game generation.

Tutorial

Flippy provides a "demo" state that teaches the User game interactions. When this mode is invoked, the User will be guided through a simplified version of the game where canned responses are suggested. Since this mode is interactive, all of the supported intents and utterances are available. Effectively this results in a simplified "shadow" skill that mirrors the full-featured one. This leads us back to the importance of "states" within the application.

State

The application is "stateless," this simply means that the entire context and state of the game must be reconstructed for every request. This is readily done using persistent data storage and the User and session identifiers that Alexa services will maintain for all requests and continuous interactions. We also use Alexa session attributes, but limit their scope.

Reconstructing state is important for our application, as the invocation of intents has different significance for each state. For example, AnswerIntent invokes different business logic during Demo, Training, and Quiz phases. This is one of the areas where voice driven applications differ substantially from traditional ones. The significance of context and inference in language cannot be overstated. As a result, the MVC pattern and it's immediate derivatives can become unwieldy. You may counter that MVI is punting on the interface that connects User voice input and application business logic. We would say that this is still best called "controller" code, but controller routing is determined by a step-function whose input is our application finite state-machine and the intent request. Coming back to our ongoing metaphorical comparison with web-first applications, this allows us to model the views associated with intents in much the same way an SPA models page views, but also to include meta-controls like those available from the browser and referential links within a page. The use of "back" or "forward" or "refresh" are not exclusive to any web-app, but they are "relative" to the page views within the app. Alexa does not provide an equivalent concept for these meta-controls, so we have to provide them independently. On the other hand, we are given the AMAZON.xyz stubs, and there are command phrases that AVS will execute without invoking the active skill; you might argue these features are metaphorically equivalent. The big difference is that the skill developer must provide the entirety of the implementation for these stubs. To beat a dead horse, so to speak, consider AMAZON.RepeatIntent. In the browser-based model, this class of functionality would be provided by the platform or framework, while with AVS, we must implement this functionality from scratch, and if we want to use the stubbed intent, we must do so within the very restrictive parameters of the provided interface, which provides no dialog, no slot definitions, and no intent exchange.

Scoring

Finally, scoring the User on the accuracy of their answer. To put this challenge into perspective, consider that the User utterance is often inaccurately interpreted. If the card is "red rock lobster" the input may occasionally amount to "bread rack mobsters." If we score this input verbatim, the User will receive no points, and they will become dissatisfied with the skill. We want to explore approximate string matching, as well as morphological analysis, to judge the User response. We can use NLP techniques to improve scoring. Let's look at fuzzy matching, Levenshtein distance, Jaro-Winkler distance, stemming and metaphone comparison.

Some quick notes to begin. Scoring is weighted, where the subject noun of the expected answer is the considerable element. Success is determined by an accumulator that must achieve a given threshold for success. The scoring algorithm proceeds as follows; comparisons are case-insensitive:

  1. Substring analysis: does the User response contain the expected string. The inverse is also tested.
  2. Until success or completion, apply recursively to unique strings derived of the singularized input, the stemmed roots of the input
  • Jaro-Winkler similarity: prefix-biased edit distance is evaluated
  • Metaphone comparison: test the phonetic similarity between User response and expected answer
  • Levenshtein distance: use when symbol counts are relatively small

The result is that "bread rack mobsters" will score as a successful match to "red rock lobster", but "red rum monster" will fail. Of course, several tunable thresholds are in play, and we can configure these during runtime to produce better results as exposure grows. This is another area where AI can continue to improve User experience.