Invented by Ozlem Kalinli-Akbacak, Sony Interactive Entertainment Inc

The market for a speech recognition system that uses machine learning to classify the phone’s posterior context and estimate boundary is experiencing significant growth and innovation. This technology has the potential to revolutionize the way we interact with our devices and the world around us. Speech recognition systems have been around for several decades, but recent advancements in machine learning and artificial intelligence have propelled this technology to new heights. Traditional speech recognition systems relied on rule-based algorithms that matched speech patterns to predefined templates. However, these systems often struggled with accuracy and were limited in their ability to adapt to different contexts and accents. The introduction of machine learning techniques has revolutionized speech recognition systems by allowing them to learn and improve over time. By training the system on vast amounts of data, it can now recognize and understand speech patterns with remarkable accuracy. This has opened up a world of possibilities for applications such as virtual assistants, transcription services, and voice-controlled devices. One particular area of focus within the speech recognition market is the classification of the phone’s posterior context and the estimation of boundaries. The posterior context refers to the sounds that occur before and after a particular phoneme or speech sound. By accurately classifying the posterior context, the speech recognition system can better understand the intended meaning and context of the speech. Estimating boundaries is another crucial aspect of speech recognition. Boundaries refer to the points where one word or phrase ends and another begins. Accurately estimating boundaries is essential for transcription services, voice-controlled devices, and any application that relies on understanding spoken language. The market for speech recognition systems that incorporate machine learning to classify the phone’s posterior context and estimate boundaries is experiencing rapid growth. Companies are investing heavily in research and development to improve the accuracy and performance of these systems. Additionally, advancements in hardware, such as powerful processors and cloud computing, have made it possible to deploy these systems on a large scale. The potential applications for this technology are vast. Virtual assistants, such as Siri, Alexa, and Google Assistant, are already using speech recognition systems to understand and respond to user commands. Transcription services are also benefiting from these advancements, as accurate and efficient speech-to-text conversion is in high demand. Furthermore, industries such as healthcare, customer service, and automotive are exploring the integration of speech recognition systems into their products and services. For example, in healthcare, speech recognition can be used to transcribe doctor-patient conversations, enabling more accurate and efficient medical documentation. In customer service, speech recognition can enhance call center operations by automating certain tasks and improving the accuracy of voice-based interactions. In the automotive industry, speech recognition can enable hands-free operation of in-car entertainment systems and navigation. As the market for speech recognition systems that use machine learning to classify the phone’s posterior context and estimate boundaries continues to grow, it is expected that the technology will become even more accurate, efficient, and versatile. The demand for voice-controlled devices and services is on the rise, and companies that can provide reliable and high-performing speech recognition systems will have a competitive advantage. In conclusion, the market for speech recognition systems that incorporate machine learning to classify the phone’s posterior context and estimate boundaries is experiencing significant growth and innovation. This technology has the potential to revolutionize various industries and improve the way we interact with our devices and the world around us. As advancements continue to be made, we can expect even more accurate and efficient speech recognition systems in the future.

The Sony Interactive Entertainment Inc invention works as follows

A speech-recognition system includes a boundary classifier and a phone classifier.” Phone classifiers generate combined boundary posteriors using a combination auditory attention features, phone posteriors and a machine-learning algorithm that classifies phone posterior context. The combined boundary posteriors are used by the boundary classifier to estimate boundaries in speech in the audio signal.

Background for A speech recognition system that uses machine learning to classify the phone’s posterior context and estimate boundary boundaries from combined boundary posteriors”.

Segmenting continuous speech into segments can be beneficial for many applications, including speech analysis (Speech Analysis), automatic speech recognition(ASR) and Speech Synthesis. Manually determining phonetic transcripts and segmentations, for example requires expert knowledge, and is expensive and time-consuming for large databases. Many automatic segmentation methods and labeling techniques have been proposed to solve this problem.

Proposed methods include S. Dusan and L. Rabiner’s, “On the relationship between maximum spectral shift positions and phone boundaries” (S. Dusan & L. Rabiner). In Proc. Reference [1] (hereinafter referred to as ‘Reference[1]’) is the 2006 Proc. Reference [1]; [2] Qiao and N. Shimomura and N. Minematsu: Unsupervised optimal phoneme Segmentation: Objectives, Algorithm and Comparisons In Proc. Reference [2] is a reference to the 2008 Proc. Reference [2]: F. Brugnara and D. Falavigna and M. Omologo. Automatic segmentation and labeling speech based upon hidden Markov models. Speech Communication, vol. 12, no, 4, pp, 357-370, 1993 (hereinafter ?Reference [3]? Reference [3]; [4] A. Sethy, S. S. Narayanan and J. A. Y. Chen, “Refined speech segmentation using concatenative speech syntheses” (Proc. In Proc. Reference [4] is a reference to the ICSLP 2002 Proceedings (hereinafter referred to as?Reference? Reference [4] (hereinafter referred to as “Reference”). In Proc. Reference [5 ]?).

These proposed methods correspond with references [1, 2, 3 and 4] cited in the paper entitled “Automatic Phoneme Segregation Using Auditory attention Features” “These proposed methods correspond to references [1, 2, 3, 4, 5] cited in a phoneme segmentation paper entitled?Automatic Phoneme Segmentation Using Auditory Attention Features?

A first group proposed segmentation methods requires transcriptions which are not always available. If the transcription is unavailable, you may want to consider a phoneme recognition tool for segmentation. HMMs, which are designed to identify the phone sequence correctly, cannot accurately place boundaries for phone calls. Refer to Reference [4]. Another group of methods do not require prior knowledge of transcription, or of acoustic phoneme models. However, their performance is usually limited.

The present disclosure is a part of this context.

Although this detailed description includes many specific details to illustrate the invention, any person of ordinary skill will understand that the details can be altered and varied in many ways. The exemplary embodiments are described without losing generality and without imposing any limitations on the claimed invention.

Introduction

Boundary-detection methods using auditory attention features have been proposed.” Phoneme posteriors and auditory attention features can be used to improve boundary accuracy. Phoneme posteriors can be obtained by training models (for example, deep neural networks) that estimate phoneme class posterior scores given acoustic characteristics (mfcc filterbank, etc.). Phoneme posteriors are obtained by training a model (for example, a deep neural network) which estimates phoneme class posterior score given acoustic features (mfcc, mel filterbank etc.). This information is very helpful for boundary detection. By combining auditory attention features with phoneme posteriors, it is suggested that boundary detection performance could be improved. This can be done by using the phoneme posteriors from the current frame. The context information from neighboring frames can also help improve performance.

In the present disclosure a new segmentation method is proposed that combines phone posteriors with auditory attention features. The algorithm is accurate and does not require transcription.

Patent application No. Ser. No. No. The entire contents are incorporated by reference herein. Phoneme posteriors can be combined with auditory features to improve boundary accuracy. Phoneme posteriors can be obtained through training a model that estimates the phoneme class posterior scores given acoustic characteristics (mfcc filterbank, mel filters etc.). Around the boundary, the accuracy of these models’ phoneme classification drops because the posteriors are more difficult to distinguish. In the middle of the phoneme segment there is a clear winner. This information is very helpful for boundary detection. It is therefore proposed here that boundary detection performance could be improved by combining auditory attention features with phoneme posteriors. This can be done by using the phoneme posteriors from a frame. The context information of neighboring frames can also help improve performance.

Discussion

In certain aspects of the disclosure, a signal corresponding with recorded audio can be analyzed in order to determine boundaries such as phoneme borders. This boundary detection can be achieved by extracting the auditory attention features and phoneme posteriors. Combining the auditory attention features with phoneme posteriors can be used to detect boundaries within a signal. The present disclosure can be summarized by stating that first, auditory attention features are extracted. Next, phone posterior extraction will be described. Two approaches are then discussed for combining phoneme posteriors and auditory attention features for boundary detection.

In the disclosure presented here, a novel phoneme segmentation method is proposed that utilizes auditory attention cues. The motivation for the proposed method, which is not limited to any particular theory of operation, is as follows. In a spectrum of speech, edges and discontinuities are usually visible around the phoneme boundaries. This is especially true around vowels, which have a high formant energy. In FIG. In the paper “Automatic Phoneme Segregation Using Auditory attention Features” (p. As mentioned earlier, the spectrum of a segment of speech that is transcribed?his Captain was? The approximate boundaries of phonemes are shown along with the spectrum. One can see some of the phoneme boundaries in the spectrum. For example, the boundaries for the vowels ih ae ix, etc. It is therefore believed that the auditory spectrum can be used to detect the edges and discontinuities of the phonemes. Visually, it is possible to locate phoneme segments or boundaries in speech.

Auditory Attention Features that Attract Auditory Interest

Auditory Attention Cues” are extracted and inspired by the stages of processing in the human auditory systems. The sound spectrum is filtered using 2D spectrotemporal filters that are based on the stages of the central auditory systems and then converted into low-level auditory features. The auditory attention model is different from the previous work in the literature because it analyzes the 2D spectrum like an image to detect edges and local temporal or spectral discontinuities. It detects boundaries in the speech.

The auditory attention model can be seen as an analogy to a visual image. Contrast features are extracted in multiple scales from the spectrum using 2D spectrotemporal receptive filter. The extracted features can be tuned for different local oriented edge: e.g. frequency contrast features could be tuned for local horizontally-oriented edges which are useful in detecting formants and their change. Then, low-level auditory gists can be extracted and a neural net can be used for discovering the relevant oriented edge and learning the mapping between gists and phoneme boundaries.

The following steps can be taken to extract auditory attention cues in a speech input signal. The first spectrum can be calculated using an early auditory model or a fast Fourier Transform (FFT). The central auditory system can be used to extract multi-scale features. Then, the differences between the center and surround can be calculated and finer and coarser scales compared. By dividing the feature map into m by n grids and computing the mean for each sub-region, auditory gist features can be calculated. The auditory gist’s dimension and redundancy can be reduced using, for example, principle component analysis (PCA), or discrete cosine transformation (DCT). Dimension reduction and redundancy are reduced to produce final features, referred herein as auditory gists.

Patent application No. No. 13/078,866. In FIG. 1, a block diagram of the attention model, and a flow diagram for feature extraction are shown. 1A. The flow chart in FIG. According to the aspects of this disclosure, FIG. 1A shows a method that uses auditory attention cues for syllable/vowel/phone boundaries detection in speech. The auditory attention is biologically based and mimics stages of processing in the human auditory systems. It is designed to determine where and when sound signals will attract human attention.

At first, an input window for sound 101 is received. This input window may be captured using a microphone that converts the acoustic waveforms of a specific input window to an electric signal. The input window for sound 101 can be any segment of speech. The input window for sound 101 can contain, by way of example and without limitation, a single word, syllable or sentence.

The input window for sound 101 is passed through a series of processing stages that convert the window into an audio spectrum. These stages 103 may be based on early stages of auditory systems, like the human auditory. The processing stages 103 can, for example, consist of inner hair cells, cochlear filters, and lateral inhibiting stages, which mimic the auditory system’s process of moving from the basilar membrane towards the cochlear nicleus. Cochlear filters can be implemented by a bank 128 of overlapping constant Q asymmetric band-pass filters, with center frequencies uniformly distributed on a logarithmic frequency scale. These filters can be implemented using electronic hardware that is configured to suit the filtering needs. The filters can also be implemented using a general purpose computer with software that performs the filter functions. To analyze audio, 20 ms frames with a 10 ms delay can be used. This results in each frame of audio being represented as a 128-dimensional vector.

The central auditory system can be simulated by analyzing the auditory spectrum by extracting the multi-scale features (117) as shown at 107. Auditory attention is captured or directed voluntarily to a variety of acoustical characteristics such as intensity (or “energy”), frequency, temporal (pitch), timbre (or “orientation” in this case), etc. here), etc. These features can then be implemented to simulate the receptive field in the primary auditory cortex.

Click here to view the patent on Google Patents.