describes speech production as a 2 stage process: a sound source is modified by the vt, acting as a filter, to produce distinct speech sounds
systems that vibrate when they are stimulated [e.g. pendulum] freq depends on length of string so longer the string slower the movement
the reinforcement or prolongation of sound by refelection from a surface or by the synchronous vibration of a neighbouring object
prefered freq responses of that system [resonant freq] + because this is diff for every ound referred to as a filter
taken through air filled column to the lips and any freq componenets near these peaks [resonant freq] get boosted + pass through the system well
resonances of vt + spaced uniformly in freq when the vocal tract has a constant width along its length [case for the Schwa vowel] + are numbered [from low to high freq] F1, F2, F3 etc
changes position of formant freq
separate control of the 2 parts of the system: source [pitch: which is the buzzing from glottal source] and filter [vowel type]
we have indeoenedeont control of the 2 parts of the system; we can adjust the rate of vocal-fold vibration to change voice pitch + shape of vt to change vowel identity
all consonants involve narrowing or constriction of vt [somehwere along its length] + many consonants involve many/abrupt changes in vt shape [can't sing a consonant]
point of max constriction along the vocal tract [there are many places where that can happen]
if constriction is narrow enough to rpoduce air turbulence, a hiss like sound is produced + modified by shape of vt
if airflow is completely blocked & abruptly released, a brief burst of noise is produced + modified by shape of vt
bilibial [ba] + labiodental [fa] + dental + aiveolar [da] + retroflex + palato-aiveolar + palatal [sa] + velar [ga]
place or articulation + manner of articulation + voicing
refers to the diff in acoustic characteristics, such as speaking rate, intensity, & affect, that exist between diff speakers, which can pose challenges for technologies like Automatic Speech Recognition
measured from the glottis to the lips, is a crucial factor in speech production + where sound produced at the larynx (or syrinx in birds) is filtered and shaped into speech
varies considerably between men + women: longer the vt of speaker the lower the freq of the formants produced when arriclating a particular vowel + acoustic properties of consonants affected
ratio of formant freq to one another [hence why we listeners can undetsand men women and children talking about the same thing]
does not complete its movements towards idealized positions for vowel articulation [e.g. less movement from the centre of the mouth on the front-back + high-low dimensions]
makes all vowels acoustically more similar to the neutral [Schwa] vowel 'uh' + it reflects inertia of the articulators +becomes more marked as the rate of speech increases
particularly important for consonant production
the articulation of 2 or more speech sounds together so that one influences the other
articulations of neighbouring phonemes interact, vowel 'kee' is produced with front tongue position + spread lips whilst vowel 'koo' produced with back tongue posiiton + rounded lips [initial stop is produced diff in anticipation of following vowel]
in parallel, one segment of speech signal may carry info about more than one phoneme at once [non-linear process]
visual representation of speech produced by a sound spectograph - in free-flowing speech the freq spectrum is almost continuously changing
x axis = freq + y axis = time, and how much energy there is at any particular freq in time is shown by how dark the trace is [so dark is where energy is concentrated]
recordings of natural utterances
modified natural utterances
synthesize artificial speech-like stimuli
you get lots of people to speak various bits of text and you record them then look for common features that seem reliably to be present when diff people speak particular phonemes or syllables in particular contexts
take natural recordings but this time to deliberately modify them in some way - so you deliberately distort or eliminate some of the features present in speech then measure impact that has on intelligibility of the speech [to listeners-essentially if you eliminate a critical part of speech it isn't intelligible]
most widely used + most flexible - deliberately synthesize speech like stimuli [advantage - you can chose to simulate a particular subset of features-then establish whether that is capable of supporting intelligibility] again we use listeners yo measure effectiveness
refers to clarity with which speech can be understood by a listener - a measure of how well a person's speech can be perceived + interpreted based on ability of listener to recognize individual words, sounds + sentences
most widely used method of assessing intelligibility - you have a panel of listeners and you play them a set of recordings that are carefully crafted they contain a list of items [such as sentences, words, or nonsense syllables]
in an articulation test - the % of items correctly perceived
we have roughly 40 speech sounds in english - these are represented in these lists in roughly the same proportion they would be in english language
the scores that people get vary with the type of list you give them - highest scores for sentence list, middle scores for word lists and lowest for nonsense words -
sentence can be understood fully even if every word if not perceived correctly when presented in isolation [context of sentence also provides info about whats being said]
normal convo can be carried out without too much difficulty in conditions that would give 50% articulation scores on typical word lists [ if you give people a list of isolated words and they get 50% of them right you can have a decent convo with them]
noise with a flat spectrum - its got roughly equal energy across the whole of the range audible freqs - equal energy spread across freq and is a hiss like sound
20 - has no effect on intelligibility
0 - gives borderline interligibility
speech is intelligible - if the noise comes from a diff direction or is interrupted, or fluctuates [periodic gaps in it] when noise is more intense than the speech [so we're robust to the noise]
no particular freq region is essential for speech recognition
transimission systems [telephone systems] doesn't do a perfect job at reproducing all content of og signal and only pass some of freqs and the rest are filtered
systems transmit all freq above a secific freq and freq below get cut out [low pass does the opposite]
3.2 kHz wide
if you overload an amplifier - it chops off peaks and troughs at a venue - produces highly unnatural but still intelligible speech [has to be really severe for you to not understand aat all]
speech is often intelligible even when the acoustic cues are highly degraded cause speech wave contains multiple cues to iits message - so anyone distortion destroys some cues but others remain that can allow speech recognition in adverse listening conditions
an acoustuc 'cartoon' of normal speech - you start with a spectogram of real utterance you track each formant and reaplace with a pure tone whistle movement that follows the trace of that formant + changes only in freq & level
most people when hearing this unnatural sounding version of speech can begin to understand these stimuli only after learning that they are degraded speech
all the acoustic complexity of real speech - although synthesised speech can sound very unnatural
experiments using synthetic speech have shown that the 1st three [esp F1 + F2] most important for vowel recognition
inversely proportional to vowel height [higher tongue position the lower the 1st formant freq]
proportional vowel frontedness [front vowels have a high F2 whilst low vowels have a low F2]
cross-speaker variation + undershoot
must be normalized by perceiver to take account of diferences in vocal tract length between men, women + children
when a person doesnt move ther tongue to the ideal position in the mouth during conversational speech [so F1 + F2 are not far enough to make it easier for listener to disinguish diff vowels from eachother]
started with a handdrawn spectogram and then change that into synthetic speech like sounds that corresponded to it
variation of plosive burst then diff individual vowels followed
so dependent on freq of plosive burst + freq of vowel formants following
whenever you're producing speechsound [phoneme] currently you're already preparing for the next one [so they interact]
if you have a rapid change in freq in the beginning of 2nd formant (F2 transition) that can generate the percept of a stop consonant vowel syllable tone [e.g hat vs hot]
as much as phonemes sound like a bead of words one after another, real life speech involves complex interactions between neighbouring things we're articulating
relating acoustic features at several diff points in time as well as at diff points in the freq spectrum
the McGurk effect - made recordings of bii-syllabke both auido and visual recordings and deliberately mismatched them and then played them at the same time
McGurk effect shows how important visual cues are in speech perception, although syllable was acoustically the same each time, seeing the articulatory movements could alter what was heard
expectation can influence speech intelligibility - the rules of language constrain the possible identities of the speech signal far more than most people realize [i.e. Christmas always occurs in the month of.....]
most languages tend to restrict possible combinations of phonemes - 'ngees' are impossible constructions in the english language + when you see 'sh...p' you'll assume 'sheep' not 'shoop'
meaning [semantics] + context [speaker identity / subject of convo = allows you to infer what words might come next]