Focus on research: Dr Naomi Harte, Adapt

Life

Naomi Harte, Adapt

18 October 2016

Dr Naomi Harte is Associate Professor in Digital Media Systems, Electronic & Electrical Engineering at the Adapt centre for digital content research based in Trinity College Dublin. She has published over 60 peer reviewed papers in her specialist areas. For the past two years, Dr Harte has been involved in a major collaboration with Google Chrome and YouTube, leading to multiple patent applications and publications. Here she discusses her interest in speech recognition technology and similarities between academia and working in a start-up.

Having come from an electronic engineering background, where did the interest in speech recognition come from?
In third and fourth year in college I studied a subject called ‘digital signal processing’ (DSP). I learned all about how you can manipulate information-bearing signals using clever mathematical techniques. These techniques are used by engineers in a broad range of applications from wireless communications to video processing, and also in speech analysis.

The idea of merging these analysis techniques with computer-based speech recognition – a task we perform so easily as humans – was very attractive and a big challenge at that time. So when I got the opportunity to do a PhD in the area of speech recognition in Queens University in Belfast, I had no hesitation.

Some of your more recent projects look at ‘audio-visual fusion’. Could you tell us what that is and how you’ve been working with it?
Modern speech recognition technology like Siri, Cortana or Echo works by analysing the audio, or sound, from what you say. This technology has come a long way in the last five years, and works really well in quiet conditions. Have you ever tried using it in a noisy crowded room, though? Performance will fall apart because of the noise.

As humans, in similar conditions we start to rely more and more on visual cues to compensate for poor audio conditions. We start to unconsciously lip-read to help us understand what a person is saying. In my work at Adapt, I have been exploring what visual information a computer could take from a person’s face to learn to lip-read. The question is, how can you fuse this visual information with the audio information you already have?

This area of audio-visual fusion is a significant challenge but has the potential to increase the usability and reliability of speech recognition.

Between voice, gesture and swipe we’re using more ways to interact with devices than ever before. What do you think is the next step?
The next step, as I see it, is to make devices that don’t just process the words that you say, but have the ability to truly understand what you mean by them. By this I mean devices that can sense your emotional state when you are speaking or figure out just how engaged you are in a conversation.

The integration of speech understanding with speech synthesis to create a realistic voice that can speak back and genuinely hold a human-like conversation is what we are working towards in this filed. This move towards natural interaction is a major step in delivering technology like digital assistants that people would actually find useful in their homes.

Having worked with Google on Chrome and YouTube, how does the process of working with large multinationals differ from working with start-ups? Do you miss being your own boss?
Having worked in my own start-up and collaborated with companies like Google since returning to academia, I have to say that being an academic is the closest you can come to being your own boss, with all the positives and negatives that brings with it.

Working with large multinationals is very motivating and you get quite a thrill seeing your ideas eventually incorporated into their technology. They of course have the security to also work with research ideas that may take several years to become useful in products.

Start-ups have more pressing timelines and can’t expose themselves to as much risk. Academic freedom, the core idea that as academics we have the freedom to pursue research to advance knowledge in the wider sense, makes me feel I am still my own boss too. It’s important to strike a balance between the research that will clearly be useful in the next decade and research which may underpin new developments 50 or 100 years from now, but we just don’t yet know how.

Prof Naomi Harte will give a talk titled Understanding Speech: Don’t Just Listen! at the Adapt Centre’s Intelligent Systems Showcase in Croke Park on 25 October. The event is aimed at industries looking to achieve more through digital transformation. Register online at https://adaptshowcase.eventbrite.ie.

Focus on research: Dr Naomi Harte, Adapt

Sign up for the Technology Minute

Support our advertisers

Listen to Tech Radio

Most Popular