start your own blog now!
 
Read other blogs...

Historical Linguistics Blog

Chinese/typology/grammaticalization/dialect

Wednesday, 03 November 2004

French Language Blog Archiveshttp://french.about.com/b/archives.htm

Posted by: zyc at 07:27 | link | comments |

Danqing Liu (Shanghai Teacher’s University / City University of Hong Kong)
Identical Topics and Topic Prominent Languages
AG8, Mittwoch 17.30—18.00

This paper, based on Liu & Xu’s (1998) description, is a further exploration of the identical topic (previously referred to as copying topic) in Chinese including Shanghainese. An identical topic (IT) is a main (sentence-initial) topic or a (post-subject) subtopic which is fully and partially identical to a corresponding element (CE) in the following part of the clause. IT exists in old Chinese, modern Mandarin, all Chinese dialects as well as in Tibeto-Burman languages. So it is a common phenomenon in Sino-Tibetan languages.

IT is often followed by a pause and/or a topic marker. CE can be an argument or adverbial of the predicate, the predicate itself, or a predicate in a lower level clause. IT can be equal to CE in size, or larger or smaller than it. So we can hardly say IT copies CE or vice versa. IT is often semantically empty. Physically IT does add an element to the clause, but in most cases it doesn’t change the argument structure at all. Nor does it bring in any semantic content for the clause. IT will be totally deleted without semantic loss in translation. IT prefers generic or nonreferential NPs mostly in the form of bare nouns, and unbounded VPs. Proper names and definite NPs can serve as IT only if they are contrastive topics or they can be interpreted as generic or nonreferential. The constraints on definite NPs and proper names are relevant to the semantic emptiness of IT. Basically the use of IT is pragmatically motivated. The IT-CE structure is a device to indicate focus, especially to emphasize the predicate.

CE is usually a focus or a head with a focused complement or modifier. Since the pseudo-cleft structure doesn’t apply to the predicate, the IT-CE structure serves as a useful device for focusing a predicate in Chinese, especially in Shanghainese. The focusing function seems to arise from the IT-CE structure as a whole. Which constituent is IT is not important. For instance, a sentence with IT consisting of V could been semantically and pragmatically equal to a sentence with IT consisting of VO or O.

A classification of topics is proposed. There are four types: gap topic (argument topic/modifier topic /island topic), non-gap topic, sentential topic (conditional topic) and identical topic.

Among this IT seems to be the most unlikely to have counterparts in subject-prominent languages. One can easily add a semantically empty topic in a Chinese clause because there syntactic positions for are available topics in various syntactic levels in topic prominent languages. In subject prominent languages the topic position is a highly marked one, which is limited to the left-dislocation. The topic always is a ready position in topic prominent languages. One can choose one or more of the arguments of the predicate to fill in the topic position(s) . If there are no arguments suitable to function as topics, we add an element out of the argument structure to fill it. It is the non-gap topic. When there is neither argument nor external element suitable for the position, we simply add an element identical to an existing one to fill in the position. Thus IT might be a better candidate for a sufficient condition to identify a topic prominent language than the so-called "Chinese-style topic", namely, the non-gap topic.

 

Posted by: zyc at 07:26 | link | comments |

Sunday, 31 October 2004

1 Studying language typology by means of corpora

Studying language typology by means of corpora

Lumme Erilt

University of Tartu

Department of English

lumme.erilt@helsinki.fi


Introduction


In the present paper I am going to discuss some aspects of a study in progress, namely that of my Master抯 thesis 慉 quantitative-typological study of Old English?that I am currently writing for the English Department at the University of Tartu. To begin with, I shall refer to the theoretical framework for the study and try to explain why I became interested in diachronic typology. Secondly, I shall describe the primary source of the study, the Helsinki Corpus and more specifically the Old English part of it. Then I shall discuss some problems that I encountered in the preparation of the texts in the corpus for the computation of word frequencies. Finally I shall point to the possible analysis of the data for the benefit of linguistic typology.


Theoretical framework

Borrowing a thought from Lyons (Lyons 1981:10), the aim of linguistics is to describe language competence as opposed to language performance, i.e. not all the actual sentences of a language, but all the potential ones as determined by general rules.

It seems to me that the same terminological opposition might be applied to the description of two theories of language universals that arose in the mid 1960s, namely Chomskyian Generative Grammar (Chomsky 1965) and Greenbergian Linguistic Typology (Greenberg 1966). Generative Grammar is interested in language competence in a fairly straightforward manner -- relying on introspection and speaker抯 intuition, it is subjective and not empirical in the sense that it does not build itself on large amounts of heterogeneous data but rather on few languages or standard dialects. Language Typology is first and foremost interested in language performance, which is based on empirical description and comparison of a great number of languages. This theory assumes that only through generalising on large amounts of variable data it is possible to get an idea of language competence and language universals. Recent developments in corpus and computerlinguistics have made the systematic analysis of large bodies of text samples easier and faster, thus enabling not only intra-lingual analysis but also inter-lingual or cross-linguistic comparison.

One of the assumptions in the universalist research is the belief that all the languages of the world, ancient or modern are intrinsically similar and exhibit similar complexity of system that, although different on the surface goes back to similar deep structure. Following this principle we can compare on a par Modern High German and Modern English and Modern High German and Old or Middle High German, as all these are considered independent stages of languages which reveal the same basic underlying structure.


This theoretical framework forms the background of my study of Old English. Old English, also known as Anglo-Saxon, is a term which is used to refer to the language spoken in the British Isles from the 5th to the 12th century by three Germanic tribes -- Angles, Saxons and Jutes. Although not utterly homogeneous, the dialectal differences have not been that great as to disturb mutual understanding.



2 Studying language typology by means of corpora


Sources


As a primary source I have used the Old English part of the Helsinki Corpus of English Texts (Kytö 1991) This corpus contains texts from Old, Middle and Early Modern English periods and early texts from the Scottish and American varieties of English. Recently, the Middle English part of the corpus was syntactically parsed and is now available under the title of Penn-Helsinki Parsed Corpus. The morphological and syntactic analysis for the Old English part is currently carried out in Amsterdam and York resulting some time in the future in the Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus. It must be stressed, though, that the tagging and parsing processes of the diachronic corpora are very laborious and time consuming, mainly because of the lack of standard spelling and so most of the work is done practically manually.

The Old English period of the Helsinki Corpus consists of four sub-periods:

OE1 (up to year 850);

OE2 (yrs. 850-950);

OE3 (yrs. 950-1050);

OE4 (yrs. 1050-1150).

At the initial stage of the study I intended to use all the texts from all four periods, but I ended up with the periods OE2 and OE4, 92,050 words and 67,380 words respectively. Period OE3 was discarded because it comprised a great deal of poetry with significantly different syntax and lexical content and period OE1 due to its shortness (2190 words only), though a pilot study was made on it. The time gap between the periods OE2 and OE4 provided a suitable diachronic perspective and could reveal some typological change, if any.

The texts from the two sub-periods were kept apart, while no distinction what so ever was made between different text types and categories. One of the reasons for such a decision was that with the aim of gaining a representative picture of Old English, the compilers of the corpus have tried to include proportionate representation of different text genres, i.e. they have tried to diminish the proportion of religious and historical texts that prevail among the texts that have been preserved from those times. The other reason for not distinguishing between the text types was a somewhat naive hope to get an objective picture of a language as a whole. On the basis of these texts frequency lists were made up. At this moment some important problems needed to be solved.

4. De-coding

Firstly, the texts in the Helsinki Corpus are preceded by 24 reference codes in COCOA format which give the textual parameters of texts or text groups (Kytö 1991:42). The codes are given in angular brackets. These codes had to be removed in order to get pure frequency lists, i.e. lists not containing these reference codes.

Secondly, inside the texts, so called 'text level' codes had been added by the compilers of the corpus (Kytö 1991:28 ff.). These codes were designed to mark text in foreign language, runes, emendations, editor's comments, compilers' comments, fonts other than basic fonts (e.g. italics) and headings. Foreign language text mostly included parallel text or translation from Latin. These codes and comments inside them were likewise excluded for the present study. The question of excluding emendations, though, is somewhat problematic, because these might have included essential vocabulary items.

3 Studying language typology by means of corpora

For those purposes a command in the shell script of UNIX programming language was devised. This command removed also punctuation marks and transformed capital letters into lower-case ones, so that words like

dryhten

dryhten,

dryhten.

Dryhten

Dryhten,

all meaning 'the Lord' were from now onwards considered as one and the same word by the computer.

Normalisation

In addition to the extra-linguistic problems that have been mentioned so far, intra-linguistic irregularities appeared, mainly caused by the non-existent spelling standard in Old English and various dialects. It is well-known that most of the manuscripts that have been preserved from the Old English period are in West-Saxon dialect (as visible, for example in the Toronto Corpus that contains all the existing Old English texts (v. Healey & Venezky 1980). Yet, one of the chief aims of the compilers' of the Helsinki Corpus has been to include as many texts from different Old English dialects as possible and thus get more objective picture of the dialectal variation in Anglo-Saxon England. For the purpose of the present study, though, dialectal forms had to be standardised to one sole form and spelling irregularities, rising from various scribal traditions and lack of tradition in many case, had to be normalised.

"In essence the process of normalisation," as Raymond Hickey (1994:169) argues, "consists of replacing variants of a grammatical form by a single form by external consensus, e.g. as the latter is the input to a later standard form [...]". Hickey also warns of "almost ideological dislike of normalisation, particularly on the part of medieval scholars" (ibid.), but in spite of that, he acknowledges, there are some obvious advantages. Normalisation in this particular case enabled me to find out frequency information about different word forms, i.e. not all possible spellings. So, for example, words like

wifmann

wifmonn

wifman

wifmon

meaning 'woman', had to be considered as one and the same word, or,

swæ

suæ

swe

sue

sua

swa

all meaning 憇o?had to be standardised to one form swa.

Normalisation in this case did not mean modernisation (to Modern English, e.g.), but rather standardisation. The standard or 'norm' was taken to be the West-Saxon dialect of English. In practice the normalisation procedure mostly relied on the forms given in the Concise Dictionary of Old English (Clark Hall 1960), if not included there, on An Anglo-Saxon dictionary (Bosworth & Toller 1898). In some rare cases like paradigms of pronouns, the forms were normalised after the Old English grammar (Campbell 1959). As all these sources have taken over the West-Saxon standard, so the normalisation-standardisation process meant in practice "translation" of the texts into West-Saxon.

Besides these, following normalisations should be mentioned:

þ and ð (coded in Helsinki Corpus as +t and +d respectively) were both changed to þ

+a in Helsinki Corpus was turned to the runic æ


4 Studying language typology by means of corpora

ae was treated as & 230;

k was changed to c

& was changed to and

(Roman ) numbers were written out as full words (ordinal numbers)

homonyms were as a rule not separated into discrete meanings

The normalisation was in its essence semi-automatic or computer-assisted because, for example, changing automatically all words of the shape mon would have changed also words like cumon or monandæg. It was often important to check the context of each word separately to find out the 'type' to which a variant should be normalised.

Due to these normalisations and changes the total length of the texts changed as well. In stead of the original 92,050 tokens in OE2, the amount of words now was 91,044 (the big difference between the two figures is due to the large proportions of parallel translation into Latin that was excluded from the Vespasian Psalter) and for the OE4 these figures changed from the previous 67,380 to 67,206 tokens.

On the basis of those modified texts frequency lists based on the word forms were created. It is to be hoped that the morphological annotation prepared for the Brooklyn-Geneva-Amsterdam-Helsinki Parsed corpus of Old English (see Pintzuk & Taylor, forthcoming) will make similar analysis possible on frequencies of lemmas as well.

Analysis

The frequency lists of Old English were made on two purposes, both contributing to the study of linguistic typology and universals .

First, I was interested in relative frequencies as possible indicators of morphological type. This line of thought follows the studies made by Tuldava (Tuldava 1977, 1995) and Bektaev (Bektaev 1978) and is based on the idea that the morphological type of a langugage is expressed by several quantitative parameters of the type/token frequency contrast and the amount of text they cover. Thus those parameters should also help to determine the morphological type of a particular language. The diachronic aspect of the study was in detecting possible typological change from an earlier period of language (OE2) to later (OE4).

My second aim was to put the relative frequencies into the use of the theory of markedness and study markedness hierarchies in contrasting relative frequencies of members of some linguistic categories. This idea is based on the fact that grammatically and semantically unmarked words appear more often than their marked counterparts (see e.g. Croft 1990, Halliday 1991).

The discussion of the analysis has to be left for another time.

References

Bektaev, Kladibai. 1978. Statistiko-informatsionnaja tipologija tjurskogo teksta. Alma-Ata: Nauka.

Bosworth, J. & T. N. Toller.1898. An Anglo-Saxon dictionary. Oxford: Oxford University Press.

Campbell, A. 1959. Old English grammar. Oxford: Oxford University Press.

Chomsky, Noam. 1965. Aspects of the theory of syntax. Cambridge, Mass.: MIT Press.

Clark Hall J. R. 1960. A concise Anglo-Saxon dictionary, 4th ed. with a supplement by Herbert D. Meritt. Cambridge: Cambridge University Press.

Croft, William. 1990. Typology and universals. Cambridge: Cambridge University Press.

Greenberg, Joseph. H. 1966. Some universals of grammar with particular reference to the order of meaningful elements. Universals of language, ed. J.H. Greenberg, 2nd edn. Cambridge, Mass: MIT Press.

Halliday, M.A.K. 1991. Corpus studies and probabilistic grammar. English corpus linguistics: Studies in honour of Jan svartvik, eds Karin Aijmer & Bengt Altenberg. London, New York: Longman, 30-43.

Healey, Antonette di Paolo & Richard L. Venezky. 1980. A microfiche concordance to Old English: The list of texts and index of editions. (Publications of the Dictionary of Old English 1.) Toronto: The Pontifical Institute of Mediaeval Studies.

Hickey, Raymond. 1994. Applications of software in the compilation of corpora. Corpora across the centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, eds. Merja Kytö, Matti Rissanen & Susan Wright. Amsterdam, Atlanta, GA: Rodopi, 165-186.

Kytö, Merja (comp.) 1991. Manual to the diachronic part of the Helsinki Corpus of English texts: Coding conventions and lists of source texts. (2nd ed. 1993.) Helsinki: Department of English, University of Helsinki.

Lyons, John. 1981. Language and linguistics. Cambridge: Cambridge University Press.

Pintzuk, Susan & Ann Taylor. Forthcoming. Annotating the Helsinki Corpus: The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English and the Penn-Helsinki Parsed Corpus of Middle English

Tuldava, Juhan. 1977. Sagedussõnastik leksikostatistilise uurimise objektina. TRÜ Toimetised, vihik 413. Tartu: Tartu Riiklik Ülikool, 141-171.






















Posted by: zyc at 12:30 | link | comments |

 

Buttons

Counter

visited *loading* times