Speaker Recognition: From Fundamentals To Practice

Abstract: My voice tells who I am. No two individuals sound identical because their vocal tract shapes and other parts of their voice production organs are different. With speaker verification technology, we extract speaker traits from speech samples to establish speaker’s identity. Voice is a combination of physical and behavioral biometrics characteristics. The physical features of an individual’s voice are based on the shape and size of the vocal tract, mouth, nasal cavities, and lips that involve in producing speech sound. The behavioral aspects, which include the use of a particular accent, intonation style, pronunciation pattern, choice of vocabulary and so on, are associated more with the words or lexical content of the spoken utterances. This tutorial aims to equip the participants with the fundamentals and techniques that are used in the most basic to state-of-the-art speaker recognition system.

Kong Aik Lee

Presenter: Kong Aik Lee, Senior Principal Researcher, Biometrics Research Laboratories, NEC Corp., Japan. He received his B.E. (First Class Honours) in Electrical Engineering from University Technology Malaysia (UTM), Malaysia in 1999 and Ph.D. from Nanyang Technological University (NTU), Singapore in 2005. From 2006 to 2018, He was a research scientist at the Human Language Technology Department, Institute for Infocomm Research (I2R), A*STAR, Singapore. His current research interests include speaker recognition and characterization, multilingual recognition and identification, speech analysis and processing, machine learning and digital signal processing. He also serves as an Editorial Board Member for Elsevier Computer Speech and Language, and Associate Editor for IEEE/ACM Transactions on Audio, Speech and Language Processing. He is a senior member of IEEE.

Speech-Centered Mobile Health Monitoring

Abstract: Human health is complex, variable, and changes in health are difficult to predict. Despite these challenges, healthcare providers and researchers are keenly interested in accurately anticipating health changes so that care can be appropriately allocated. However, accuracy is hampered by infrequent measurement and interpersonal variability. Measurements are costly for both patients and providers and can often only be obtained infrequently, while interpersonal variability alters the relationship between measurement and outcome. Even in the best cases, measurement practices may obfuscate critical behavioral patterns. Recent advances in paralinguistic recognition offers low-cost solutions to bring measurement outside of clinical environments, providing insights into how behaviors expressed in daily life are associated with health and wellness. In this tutorial, I will describe speech-centered paralinguistic detection approaches, highlighting human emotion expression, and link these approaches to health monitoring systems. I will provide an overview of health domains that are currently being explored, highlighting similarities across symptomatology. Next, I will provide an overview of current methods in paralinguistic detection and the accuracy of these approaches. I will then provide an overview for existing in the wild sensing strategies across a number of health and wellness domains. Finally, I will discuss open challenges and questions.

Emily Mower Provost

Presenter: Emily Mower Provost, Assistant Professor in Computer Science and Engineering, University of Michigan, US. She received her B.S. in Electrical Engineering (summa cum laude and with thesis honors) from Tufts University, Boston, MA in 2004 and her M.S. and Ph.D. in Electrical Engineering from the University of Southern California (USC), Los Angeles, CA in 2007 and 2010, respectively. She is a member of Tau-Beta-Pi, Eta-Kappa-Nu, and a member of IEEE and ISCA. She has been awarded a National Science Foundation CAREER Award (2017), a National Science Foundation Graduate Research Fellowship (2004-2007), the Herbert Kunzel Engineering Fellowship from USC (2007-2008, 2010-2011), the Intel Research Fellowship (2008-2010), the Achievement Rewards For College Scientists (ARCS) Award (2009-2010), and the Oscar Stern Award for Depression Research (2015). She is a co-author on the paper, “Say Cheese vs. Smile: Reducing Speech-Related Variability for Facial Emotion Recognition,” winner of Best Student Paper at ACM Multimedia, 2014, a co-author of the winner of the Classifier Sub-Challenge event at the Interspeech 2009 emotion challenge, and a co-author of an honorable mention paper at ICMI 2016. Her research interests are in human-centered speech and video processing, multimodal interfaces design, and speech-based assistive technology. The goals of her research are motivated by the complexities of human emotion generation and perception.

Generalized Additive Modelling To Analyze Time-Series Data

Abstract: In the speech sciences, many datasets are encountered which deal with dynamic data collected over time. Examples include diphthongal formant trajectories and articulator trajectories observed using electromagnetic articulography. Traditional approaches for analyzing this type of data generally aggregate data over a certain timespan, or only include measurements at a fixed time point (e.g., formant measurements at the midpoint of a vowel). In this tutorial, I will introduce generalized additive modeling, a non-linear regression method which does not require aggregation or the pre-selection of a fixed time point. Instead, the method is able to identify general patterns over dynamically varying data, while simultaneously accounting for (non-linear) subject and item-related variability. An advantage of this approach is that patterns may be discovered which are hidden when data is aggregated or when a single time point is selected. A corresponding disadvantage is that these analyses are generally more time consuming and complex. This tutorial aims to overcome this disadvantage by providing two lectures and associated hands-on lab sessions on generalized additive modeling applied to articulography data (to illustrate one-dimensional non-linear patterns), and ERP data (to illustrate non-linear interactions of two variables).

Martijn Wieling

Presenter: Martijn Wieling, Associate Professor, University of Groningen, Netherlands. Martijn Wieling (1981) obtained his PhD in 2012 cum laude at the University of Groningen in the Netherlands with his dissertation “A Quantitative Approach to Social and Geographical Dialect Variation” (promotors: Prof. John Nerbonne and Prof. Harald Baayen). After having spent a year at the University of Tübingen (funded by a 1-year k€60 Dutch Science Foundation grant) where he gained experience using electromagnetic articulography, in 2013 he returned to Groningen with a 4-year k€250 personal Dutch Science Foundation grant) where he investigated second language acquisition using electromagnetic articulography. In 2015, he was promoted to a tenured position as Assistant Professor. In 2015, he became a member (since 2018: vice-chair) of the 50-member Dutch Young Academy of the Royal Netherlands Academy of Arts and Sciences, a group of 50 top young scientists from the Netherlands from all disciplines. In 2016, Wieling won the European Young Research Award, amongst other reasons for the international dimension in his work. Wieling has over 70 peer-reviewed publications, with frequent publications in journals such as Journal of Phonetics, Language, Computer Speech and Language and JASA. Particularly of note is a recently accepted 50-page tutorial on generalized additive modeling in Journal of Phonetics with Wieling as sole author (http://www.martijnwieling.nl/files/GAM-tutorial-Wieling.pdf). Wieling has frequently taught invited courses on generalized additive modeling (e.g., at Cambridge, McGill, Oldenburg, Pisa, Toulouse) and has (co-)taught courses at the LSA Institute (2015, Chicago) and ESSLLI (2018, Sofia, Bulgaria). His courses are generally well-received: the LSA-course was graded with a 9.4 (on a scale from 1 to 10).

Forensic Transcription: How The Law Gets It Wrong, And How ASSTA Expertise Should Be Involved In Setting Things Right

Abstract: Covert recordings are used as forensic evidence in many Australian criminal trials. Due to poor recording conditions, they are often indistinct, even unintelligible, to those lacking prior knowledge of their content. Following a 1987 High Court ruling, transcripts created by detectives (deemed ‘ad hoc experts’) are admitted as ‘assistance’ to juries in making out what is said in indistinct covert recordings. Because this ruling, and related practices, have been developed without consultation of the linguistic sciences, they incorporate a number of anomalies, only recently uncovered by forensic phonetics. Numerous cases of actual and potential injustice have been identified. he question now is, how to create a fairer and more scientifically valid process for evaluating indistinct forensic recordings. This tutorial presents an overview of the problems, then moves on to consider what is needed for a viable solution. Emphasis is placed on the need for effective collaboration between phonetic science and the law: while phonetics offers much relevant knowledge, the forensic context creates a range of issues that have not yet received sufficient scientific attention. The tutorial ends with discussion regarding what new research projects are needed.

Helen Fraser

Presenter: Helen Fraser, Adjunct Associate Professor, University of New England, Australia. Helen studied linguistics and phonetics at Macquarie University and the University of Edinburgh, then taught phonetics and related topics for many years at the University of New England. She has been involved in forensic case work since 1993, and, following experience in a troubling case in 2000, has pioneered the research field of forensic transcription, and uncovered many case of actual and potential injustice. More recently she has been focused on bringing about reform of current legal practice, so as to ensure out courts reach reliable interpretation of indistinct covert recordings used as evidence in criminal trials. Please find more background at forensictranscription.com.au.