BME HLT | Digital language description

Digital language description

Eötvös Loránd University, 2014/2015 spring

Since contemporary language technology is almost exclusively based on machine learning techniques that require significant amounts of training data, modern language documentation is focused almost exclusively on the collection and annotation of corpora, both written and spoken. In many cases, the mainstays of traditional fieldwork, systematic elicitation, and the field linguist actually learning the language, are simply not feasible. In this course your mission, should you choose to accept it, is to produce large corpora for languages such as Amhara or Guarani about which you presumably know very little. You will have no access to native or near-native speakers, translators, or even linguists familiar with the language, but you are free to use any resource you can identify, very much including traditional descriptive grammars of the language, dictionaries, encyclopedias, wikipedias, etc. (except that you cannot type, scan, or otherwise make part of your corpus significant amounts of copyrighted material). You will learn how to (i) identify and use standard online resources; (ii) build basic vocabulary; (iii) reliably identify online texts written in your target language; (iv) roll your own basic tools such as tokenizers, stemmers, taggers; (v) crawl the web for relevant monolingual data and parallel texts; (v) automatically extract rough English word senses from parallel corpora and even build a crude machine translator. Particular attention will be paid to the often very complex dialectal and sociolectal variation we find in many of these languages, and the often still emerging literary standards governing spelling variation. Course grade will be based on the quality of the tools and size/quality of the languages resources you build.