Computational Linguistics, or Language Technology, is an interdisciplinary field dealing with the computational modeling of natural language. Research is driven both by the theoretical goal of understanding human language processing and by practical applications involving natural language processing, such as systems for automatic translation, information retrieval and human-computer dialogue.
The Computational Linguistics group at Uppsala University has a strongly empirical orientation emphasizing multilingual systems, especially machine translation, and systems for grammatical analysis of text, in particular dependency-based parsing. Another focus area is digital humanities with projects on hand-written text recognition, historical text processing, and historical ciphers. The group has been involved in the development of a number of tools and resources, such as MaltParser (data-driven dependency parser), UPlug (toolbox for parallel corpus alignment), Swedish Treebank (syntactically annotated corpus), and OPUS (multilingual parallel corpus).
Projects, tools, and resources below.
SWE-CLARIN is the Swedish branch of CLARIN, the Common Language Resources and Technology Infrastructure, which provides easy and sustainable access for scholars in the humanities and social sciences to digital language data (in written, spoken, video or multimodal form), and advanced tools to discover, explore, exploit, annotate, analyse or combine them, wherever they are located.
Speaking to One’s Superiors: Petitions as cultural heritage and sources of knowledge. The purpose of this project is to enhance accessibility to and knowledge of a historical source – petitions – used relatively little in Sweden, and to use this source to answer questions about people’s ways of supporting themselves and claiming rights in the past.
Universal Dependencies (UD) is a project that is developing cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective.
Diskursorienterad statistisk maskinöversättning (Discourse-Oriented Statistical Machine Translation) is a project whose aim is to develop translation models that use discourse-wide contextual information to achieve fluent and coherent translations across sentence boundaries
Efficient Algorithms for Natural Language Processing Beyond Sentence Boundaries is a sub-project of eSSENCE - The e-Science Collaboration. The goal of the project is to develop efficient algorithms to allow the integration of wide contextual information for improved quality in language technology applications, such as machine translation.
From Close Reading to Distant Reading (Swedish only) is a project that develops methods for textual analysis of the literary cultural heritage.
PARSEME is an EU COST Action devoted to the role of of multiword expressions in parsing. It gathers interdisciplinary experts from 29 countries, representing 28 languages and 6 dialects from 9 language families. It covers a number of different parsing frameworks, both grammar-based and data-driven, and a number of different language technology applications, such as machine translation and information retrieval
q2b (From Quill to Bytes) sets out to develop automatic tools for data mining, transcription (OCR, handwritten text recognition), and further linguistic analysis applied to historical manuscripts, based on methods from image analysis and computational linguistics. (Hosted by the Dept of Information Technology).
Resources and Tools
Software and linguistic data (different types of corpora) are indispensable for research, application development, and teaching in Computational Linguistics and Language Technology. Below are several freely available resources developed by researchers in the department (in collaboration with researchers at other institutions):
- MaltParser is a system for data-driven dependency parsing
- Swedish Treebank is a Swedish corpus with syntactic annotation, OPUS and Universal Dependencies (multilingual corpus with syntactic annotation).
- Uplug is a collection of tools for linguistic corpus processing, word alignment, and term extraction from parallel corpora
- Riksdagsord, a lexicon over word usage in the Swedish parlament
- Swegram, an automatic linguistic analysis of Swedish texts.