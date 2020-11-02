Focus on research: Emer Gilmartin, Adapt

Emer Gilmartin is a PhD student at Adapt, the Science Foundation Ireland research centre for digital content. In this interview she talks about how AI is producing the next generation of language tutors.

Tell us about your academic journey to date.

I started in mechanical engineering for reasons currently unknown to myself, ran away to Spain and taught English, where I became interested in the teaching of languages. In the 2000s I worked in language provision for migrants at a campus company on Trinity, Integrate Ireland, where we provided tuition, curricula and courses and teacher training for adult and child refugees. During this time, I went back to study linguistics and speech and language technology. This was before the advent of Siri, etc. That experience pulled me back into the language learning area with Ketong Su, who is a fantastic developer. We started to toy around building language learning applications, based on spoken dialogue technology. Last year, we started a spin out of the Adapt centre, Below Horizon, where we work on language learning applications for the Chinese market. This year we pivoted into the Irish market, where we are developing materials and applications for primary and secondary schools. We also work with refugees, migrants and asylum seekers on a non-profit basis.

Language researchers in Ireland often talk about the problems of accents and intonation. How do you tackle this problem?

In Ireland, we have a range of accents which bear little resemblance to British and American accents used in most language applications. Then there is the material. The average migrant does not need to know how to book a hotel room. What they really need to know is how to work through an encounter with social services or a job interview or how to chat to their neighbour.

My academic life is about ‘casual conversation’ what we do when we’re not trying to do anything practical. The core of the problem is that if you are talking in a classic task-based interaction – say ordering a pizza – once you’ve got the information of size, toppings etc over its ‘job done’ and there’s a very obvious outcome. In casual talk you want to form a relationship between the user and the system the ‘how’ becomes vital. If you imagine talking about football with your neighbour, I have zero information on it but I should be able to carry on a conversation and that alone is building a social bond between me and my neighbour.

We have very good outcomes with machine learning – eg chatbots – but it is becoming difficult to get the kind of data we need to successfully generate and understand non-task conversations. I think it’s something the big companies are going to have to address. We’re never going to be able to base convincing social talk using billions of public forum exchanges as models.

Being genuine is central to getting accurate voice and conversational data. Do people mind being recorded?

They do and they don’t. I recently got my first new television for years and it’s interesting in that the Alexa on it is ‘push to talk’ – it’s not always on. People are self-conscious when they push a button, when it’s always on there’s more of a trend towards becoming ‘ourselves’ in the presence of the tech. This is great from the perspective of building more usable technology but on the other hand the security concerns are huge.

There is an element of personalisation with Below Horizon’s AI tutors.

Personalisation is one of the things that does not have to be bolted on to dialogue. It becomes personalised. Forming a person’s name, adopting a dialogue rhythm, this history of dialogues you’ve had before with a specific person, all of this will inform future interactions to fit better with your interlocutor. Personalisation is effectively built into good conversation user interface design. That’s where the science really becomes an art – writing and scripting dialogue, choosing the right data to make a conversation feels like there is another person there, like there is co-presence.

Beyond vocabulary what other cues are you looking for in dialogue?

A number of features and factors influence what we are giving out in a dialogue. You can go from working on recognition of affect – for example if I talk in a monotone a machine can pick it up and detect whether there is some sort of depression involved. People tend to use particular registers and systems can be built to meet those, so you can make the interaction more complex.

Other things we give away in our conversations are our attitudes, our emotions, our tone of voice. In customer service there is interest in working out when someone is getting frustrated enough that you should hand them over to a human agent. That’s where it really does become an art. A good conversation interface designer is almost like a script writer, they will create the dialogue framework to give that feeling of co-presence and that comes down to the number of words in a phrase, the pitch curve of a phrase and things like pausing. It’s interesting in that a lot of the computer science work for a long time was based on well-formed formulae, which is a perfected view of language. You shouldn’t hesitate, you shouldn’t have long informative phrases. When you run a long phrase through a synthesiser, humans stop breathing because we tune our breathing to each other. When you look at data from true interpersonal conversations you see a lot of repetition, a lot of pausing, a lot of what may seem like hesitation. These have all been proven to get messages across and form that bond, but in fact they are the things your English teacher would have told you not to do in an essay. Understanding what is for a conversation of sound human is vital in the design stages and this has been missing quite a bit. It’s not really personalisation because dialogue is inherently personal and inherently personalised.