Focus on research: Emer Gilmartin, Adapt

Getting AI to talk socially

Life

Emer Gilmartin, Adapt

2 November 2020

Emer Gilmartin is a PhD student at Adapt, the Science Foundation Ireland research centre for digital content and CEO of BelowHorizon. In this
interview she talks about how AI needs to become social.

Tell us about your academic journey to date

I started in mechanical engineering for reasons currently unknown to myself, ran away to Spain and taught English, where I became interested in language learning. In the 2000s I worked at a campus company at Trinity, Integrate Ireland, where we provided tuition, materials, and teacher training for language support for adult and child refugees. During this time, I went back to study linguistics and speech and language technology, focussing on causal or social conversation – the language we use when we’re just hanging out. I also started to toy around with learning applications based on spoken dialogue technology with Ketong Su, a brilliant developer.

Last year, we started a spin out of the Adapt Centre, Below Horizon, initially creating language learning applications for the Chinese market. This year we pivoted into the Irish market, where we are developing materials and applications for primary and secondary school students. We also help provide free online language learning for migrants in Ireland on a non-profit basis through ListenHere.

Language researchers in Ireland often talk about the problems of accents and intonation. How do you tackle this problem for language learning?

In Ireland, we have a range of accents which differ greatly from the British and American accents used in most language applications. Then there is the material. The average migrant does not need to know how to book a hotel room. What they really need to know is how to work through an encounter with social services or a job interview or how to chat to their neighbour. For our non-profit work, we crowdsource audio and video from people in Ireland – and use it to build activities on relevant topics. Our synthesis is Irish accented, and we can tune our pronunciation tutor to suit.

There is an element of personalisation with Below Horizon’s AI tutors.

Personalisation is one of the things that does not have to be bolted on to dialogue. Knowing a person’s name, adopting a dialogue rhythm, the history of dialogues you’ve had before with a specific person, all of this will inform future interactions to fit better with your interlocutor.

Personalisation is effectively built into good conversation user interface design. That’s where the science really becomes an art – writing and scripting dialogue, choosing the right data to make a conversation feel like there is another person there, like there is co-presence. It’s not really personalisation because dialogue is inherently personal and inherently personalised.

You mention causal conversation? How does this impact spoken dialog technology?

My academic life is all about modelling ‘casual conversation’ – what we do when we’re not trying to do anything practical. The core of the problem is that if you are talking in a classic task-based or transactional interaction – say ordering a pizza – once you’ve exchanged the information on size, toppings and price its ‘job done’ and there’s a very obvious outcome. In casual talk you want to form a social relationship between the user and the system, so the ‘how’ becomes vital. If you imagine a brief chat with your neighbour where they tell you about a football match, you may have zero information or even interest in football it but you will still carry on the conversation to be social. These conversations are not transactional – they build the social ‘glue’ between people, and systems need to be able to converse socially, particularly in domains where we want to build a relationship of trust with users – for example, healthcare and education.
Being genuine is central to getting accurate voice and conversational data.

Do people mind being recorded?

They do and they don’t. I recently got my first new television for years and it’s interesting in that the Alexa on it is ‘push to talk’ – it’s not always on. People are self-conscious when they push a button, when it’s always on there’s more of a trend towards becoming ‘ourselves’ in the presence of the tech. This is great from the perspective of building more usable technology but on the other hand the security concerns are huge.

Beyond the words themselves, what other cues are you looking for in dialogue?

A number of features and factors influence what we are giving out in a dialogue. For example if I talk in a monotone, some systems can pick it up and infer whether there is some sort of depression involved. Other things we give away in our conversations are our attitudes, our emotions, our tone of voice. In customer service there is longstanding interest in working out when someone is getting frustrated enough that you should hand them over to a human agent. There are also factors which signal whether we’re in a formal situation or a more casual conversation.

A good conversation interface designer is almost like a script writer, they will create the dialogue framework to give that feeling of co-presence and that comes down to the number of words in a phrase, the pitch curve of a phrase and things like pausing. It’s interesting in that a lot of computational linguistics work for a long time was based on well-formed written text – a perfected view of language, with long phrases and no hesitations. In reality, conversation is built up of short phrases – when we can’t read back or rewind, we’re relying on limited working memory. Conversation is tuned to our affordances. This is not just memory, when you run a long phrase through a synthesiser, humans stop breathing, because we tune our breathing to each other and the machine doesn’t breathe. When you look at data from true interpersonal conversations you see a lot of repetition, a lot of pausing, a lot of what may seem like hesitation. These have all been proven to get messages across and form that vital interactive bond, but are often things your English teacher would have told you not to do in an essay. Understanding what makes a conversation sound human is vital in the design stages and this has been missing quite a bit.

The main problem for machine learning approaches is the type of data used. We have very good outcomes on simple conversation with machine learning – eg chatbots – but it is becoming difficult to get the kind of data we need to successfully generate and understand true casual conversations. I think it’s something the big companies are going to have to address. We’re never going to be able to model convincing social or indeed task-based talk just using billions of words of public forum exchanges as data. Twitter and Reddit data is readily available, but doesn’t really reflect how people speak to each other, and it all comes down to GIGO – you can’t model real conversation without real conversational data..