Corpora are often used for language instruction and learning. They give information about how a language works. They also help compute the relative frequency of different features. Exploring corpora can help students to observe slight differences of usage and to make comparisons between languages. Corpora are also used to investigate cultural habits expressed via language. NB a corpus will not give information about whether something is possible or not, only whether it is frequent or not! Corpora are also used in translation. Comparable corpora permit to compare the use of apparent equivalents. Parallel corpora let us to see how words and phrases have been translated in the past. General corpora can be used to establish norm of frequency and usage. So, it can be claimed that corpora are beneficial for translation studies. Let’s find out why is that!
What is a corpus?
A corpus is explained in terms of: 1) Form 2) Purpose. The word corpus is used to describe a collection of examples of language collected for linguistic study. It can also describe collections of texts stored and accessed electronically. (Hunston:2002). Corpus planning and design is functional to some linguistic purpose. It is on this basis that texts are selected and stored, so that they can be studied quantitatively and qualitatively.
What can a corpus do?
Corpus access software is used to reorganize the information which has been stored so that observations of different kinds can be made. It is not the corpus which gives new information about language. It is the software which gives new perspectives on what is already common. Software packages process data showing: frequency, phraseology and collocation.
Corpus processing allows comparisons of words in terms of frequency lists. Quite obviously, grammar words are more frequent than lexical words. That explains why they are found top of the list. Frequency lists can be useful for identifying differences between the corpora. But comparisons can be made only if the corpora are comparable, i.e. if their length is approximately the same.
The most frequent way to access a corpus is through a concordancing program. Concordance lines bring together instances of use of words or phrases, so that regularities in use can be observed. Concordances also help to understand how nouns or adjectives are used.
Collocation is the tendency of words to co-occur. The collocates of a given word are those words which often occur in conjunction Collocation can indicate pairs of lexical items, or the association between a lexical word and its frequent grammatical environment. In the latter case, the term used is colligation.
Types of corpora
A corpus is designed for a particular purpose. Consequently, the type of corpus depends on its purpose: Specialized corpus, General corpus, Comparable corpora Parallel corpora, Learner corpus, Historical or diachronic corpus and Monitor corpus.
Specialized corpus: a corpus of texts of a particular type (editorials, academic articles, lectures, essays, etc.). Specialized corpora reflect the type of language a researcher wants to explore. You may also restrict the corpus to a time frame, to a social setting, to a given topic.
General corpus: is a corpus of texts of many types, of written or spoken language, or of both. A general corpus is usually much larger than a specialized corpus. Since it can be used to produce reference materials it is sometimes called a reference corpus.
Comparable corpora: two or more corpora in different languages, or in different varieties of a language. They are designed to contain the same proportion of texts (i.e. newspaper texts, essays, novels, conversations, etc.). They can be used by translators and learners to identify differences and equivalences in each language.
Parallel corpora: two or more corpora in different languages, containing translated texts, or texts produced simultaneously in two or more languages (e.g. EU texts). They can be used by translators and learners to find potential equivalents in each language, and to investigate differences between languages.
Learner corpus: a collection of texts produced by learners of a language. It is used to identify differences among learners, frequency and type of mistakes, etc.
Historical or diachronic corpus: a corpus of texts from different periods of time. It helps to trace the development of a language over time.
Monitor corpus: a corpus used to track current changes in a language. It rapidly increases in size, since it is added annually, monthly, daily, etc. The proportion of text types has to remain constant, so that each year is comparable with every other.
The use of corpora is not limited to identifying, quantifying and analyzing keywords. The concordance lines offer many instances of use of words or phrases, so that the user can observe regularities in use by means of several examples of the same word or phrase in its natural context.
Calculating collocation means finding the statistical tendency of words to co-occur, and collocations also emphasize some metaphorical use. A good example is the collocations of the word shed, with light, tears, blood, pounds, confidence, hair, skin, labor. In this context shed is a verb. As such, its Italian equivalent may vary, so collocates are different.
Usually word-forms are considered to belong to the same lemma when they belong to the same word-class (verb, noun, adjective, etc.)
Tagging usually refers to the addition of a code to each word in a corpus, to indicate the part of speech. Automatic tagging is possible, but not fully accurate. Tagging is useful when you want to look at different word categories. For instance, the noun work can be considered separately from the verb.
Corpus parsing is the analysis of a text constituents, for instance clauses, and groups. This allows you to analyse the different structures in a corpus.
Just like tagging, parsing can be done automatically, though the output is not very accurate. Manual editing is often necessary.
Sources: Hunston S. (Corpora in Applied Linguistics, 2002) & www.uniroma3.it