Portuguese Unisyn Lexicon (LUPo): An Accent-Independent Pronunciation Lexicon for Portuguese

Participating Institutions

Instituto de Linguística Teórica e Computacional (ILTEC)
Universidade Federal do Rio de Janeiro (UFRJ)
Universidade Federal de São Paulo (UFSP)
Fundação para a Ciência e a Tecnologia (FCT)

Project Duration

3 years (March 2010 through February 2013)

Research Coordinator and Primary Investigator

Simone Ashby (ILTEC)

Research Partners

Sílvia Brandão (Universidade Federal do Rio de Janeiro)
João Antônio de Morais (Universidade Federal do Rio de Janeiro)
Mário Eduardo Viaro (Universidade Federal de São Paulo)
Margarita Correia (ILTEC)
Maria Helena Mateus (ILTEC)
José Pedro Ferreira (ILTEC)

Consultants

Susan Fitt (formerly of the University of Edinburgh)
Maarten Janssen (Universitat Pompeu Fabra, Barcelona)

Research assistants 

Successful speech technologies require the ability to account for variation in the speech signal. Most text-to-speech (TTS) systems are built using data from a single accent, usually what is considered to be the standard accent, or dialect, for a given language. While users of these technologies represent an ever widening speaker base, the prospect of developing separate lexicons to account for regional pronunciation variants is an extremely costly one. Semi-automatic approaches for exploiting regularities between graphemes and phones have yielded good results. However, such systems rarely extend to multiple accents, and make limited or no use of morphology. Moreover, these projects typically occur in isolation, and are governed by private sector interests that prohibit the sharing of data and tools.

The Portuguese Unisyn Lexicon project (LUPo) is dedicated to delivering an accent-independent lexicon and rule system for generating accent-specific pronunciations in Portuguese. With the consultancy of Susan Fitt, author and developer of the Unisyn Lexicon for English, our methodologies will be a reformulation of those originally employed by Fitt to adapt this largely successful paradigm to Portuguese, and take advantage of the MorDebe database´s relational structure and rich lexicographic content to minimize confusability and create a more integrated and well informed system. Our model will capitalize on having direct access to mappings of European and Brazilian Portuguese spelling variants, part of speech information, etymological relationships, and a morphological parser.

The end product will be a set of open-source tools for generating accent-specific output for individual lexical entries, along with the ability to produce transcriptions for multi-word texts. Pronunciation models will be included for European and Brazilian Portuguese standards, plus eight or more actual spoken accents representing the continents of Africa, Asia, Europe, and South America. All deliverables, including cross-dialectal data, phonetic transcriptions, the master lexicon, allophonic rules, and tools, will be documented and made freely available to the research community and general public via the Portal da Língua Portuguesa knowledge base. The ‘Portal’ currently gets 4000-4500 hits by unique users per day and is increasingly regarded as a standard resource for inquiries about the Portug”uese language. Inclusion of LUPo in the Portal will greatly enhance the Portal as a pan Lusophone resource and the only one of its kind to provide richly detailed and varied phonetic output for a large number of Portuguese accents. Indeed, it will be the first online resource to provide high-quality phonetic transcription data for regional variants of Portuguese.

Research partners for this project include specialists from Brazil and Europe, representing the fields of phonetics, computational linguistics, lexicography, and Portuguese phonology, morphology, dialectology, and sociolinguistics.

The LUPo project will produce an accent-independent pronunciation lexicon for Portuguese, along with tools for generating accent-specific output for lexical entries and multi-word texts. The proposed tools will feature an interactive mode in which the post-lexical rules used to derive accent-specific transforms are displayed in the output. Users will have the option of accessing the open-source lexicon and tools either as a standalone application or via the Portal da Língua Portuguesa online knowledge base. The ‘Portal’ module will be accessible as part of the page view for each lexical entry, wherein the user can select a desired accent to view the corresponding transcription for a given word. Online and offline users will also have access to a tool for inputting a fixed amount of text, selecting a desired accent, and generating multi-word transcribed output for that accent, while also having the option to show the rules where they apply. The complete software package will contain documentation in Portuguese and English and be subject to regular updates as improvements are made to the lexicon and tools, and new pronunciation models are introduced. While the aims of the current project will be achieved over a span of three years, ILTEC is committed to ensuring that LUPo continues to evolve as a software application and scholarly database.

In keeping with these development initiatives, our objectives for the project are as follows:

Create an accent-independent ‘master’ lexicon for Portuguese using key symbols (i.e. an extended set of X-SAMPA-based typographical symbols that account for morphological boundaries and other key phenomena for encoding lexical entries and processing conversions).

Use a knowledge-driven approach to create a system of post-lexical morpho-phonological rules for processing conversions from the master lexicon to accent-specific targets.
Develop tools for automatically generating accent-specific output for individual lexical entries and multi-word texts.
Establish a regional accent hierarchy for Portuguese (including the fields COUNTRY, REGION, TOWN, and PERSON) for specifying which rules apply to one or more accents, along with the ability to create default inheritances for all the sub-nodes of a large geographic area, and a system for overriding these values.
Create pronunciation models for: standard Brazilian Portuguese (BP) and European Portuguese (EP); the Lisbon accent and at least one additional EP accent; the two major BP accents, as they are actually spoken in Rio de Janeiro and São Paulo, plus one or more other accents specific to Brazil; and three or more accents from the continents of Africa and Asia. In short, we aim to describe as many Portuguese accents as we have the time and resources to adequately model.
Enhance MorDebe by introducing richly detailed and varied phonetic content, and open up the Portal da Língua Portuguesa knowledge base to a much wider audience, while reinforcing connections across the pan Lusophone community.
Provide the research community and general public with a freely available electronic data standard for: testing the results of different speech processing systems, conducting empirical analyses across multiple Portuguese accents, and facilitating L2 studies of Portuguese.
Facilitate the entry of lesser or undocumented regional variants into the digital domain.
Establish the basis for a subsequent project aimed at developing a TTS module for inclusion in the Portal and as a freely available, open-source standalone application.