Corpus: a collection of texts, held
electronically, often annotated, with different types of
annotations depending on the intended application.
Tagset: a predefined set of annotation labels.
The Brown Corpus, 1963-4: a million word
collection of samples from 500 written texts from different genres
(newspaper, novels, non-fiction, academic, etc.) with a
tagset of 87 POS tags.
http://www.hit.uib.no/icame/brown/bcm.html
A. PRESS: REPORTAGE (44 texts)
B. PRESS: EDITORIAL (27 texts)
C. PRESS: REVIEWS (17 texts)
D. RELIGION (17 texts)
E. SKILL AND HOBBIES (36 texts)
F. POPULAR LORE (48 texts)
G. BELLES-LETTRES (75 texts)
H. MISCELLANEOUS: GOVERNMENT & HOUSE ORGANS (30 texts)
J. LEARNED (80 texts)
K: FICTION: GENERAL (29 texts)
L: FICTION: MYSTERY (24 texts)
M: FICTION: SCIENCE (6 texts)
N: FICTION: ADVENTURE (29 texts)
P.FICTION: ROMANCE (29 texts)
R. HUMOR (9 texts)
Example:
A01 0010 The Fulton County Grand Jury said Friday an
investigation
A01 0020 of Atlanta's recent primary election produced "no
evidence" that
A01 0030 any irregularities took place. The jury further said in
term-end
A01 0040 presentments that the City Executive Committee, which had
over-all
A01 0050 charge of the election, "deserves the praise and thanks
of the
A01 0060 City of Atlanta" for the manner in which the election was
conducted.
The British National Corpus, 1991-1994: one
million words of
British English from a range of sources, written and spoken.
http://www.hcu.ox.ac.uk/BNC/
The Corpus is designed to represent as wide a range of modern British English as possible. The written part (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. The spoken part (10%) includes a large amount of unscripted informal conversation, recorded by volunteers selected from different age, region and social classes in a demographically balanced way, together with spoken language collected in all kinds of different contexts, ranging from formal business or government meetings to radio shows and phone-ins.
The corpus comprises 4,124 texts, of which 863 are transcribed from spoken conversations or monologues. Each text is segmented into orthographic sentence units, within which each word is automatically assigned a word class (part of speech) code.
BNC has two tagsets, C5 (61 tags) and C7 (146 tags)
(SLP Appendix C gives both in full). Both were
developed at UCREL (The University Centre for Computer Corpus
Research on Language, University of
Lancaster):
http://www.comp.lancs.ac.uk/computing/research/ucrel/
The Treebank Corpus, early 1990s
http://www.cis.upenn.edu/ treebank/
The Penn Treebank Project annotates naturally-occuring text for linguistic structure. Most notably, we produce skeletal parses showing rough syntactic and semantic information - a bank of linguistic trees. We also annotate text with part-of-speech tags, and for the Switchboard corpus of telephone conversations, dysfluency annotation.
Wall Street Journal | The Brown Corpus | Switchboard |
ATIS
A 45-tag tagset - this is the one you
used in the POS tagging practical.
The Switchboard Corpus, early 1990s
http://www.isip.msstate.edu/projects/switchboard/
Telephone conversations between
strangers, 2430 conversations averaging 6 minutes each of spoken
language: 2.4 million wordform tokens, about 20,000 wordform
types. Compared to written language, spoken language has a smaller
vocabulary,
and many more syntactic fragments.
Here is a sample annotated in the Treebank with POS tags and syntactic bracketing.
SpeakerB1/SYM ./. Okay/UH ./.
SpeakerA2/SYM ./. Okay/UH ./.
SpeakerB3/SYM ./. Well/UH what/WP do/VBP you/PRP think/VB about/IN
the/DT idea/NN of/IN ,/, uh/UH ,/, kids/NNS having/VBG to/TO do/VB
public/JJ service/NN work/NN for/IN a/DT year/NN ?/. Do/VBP
you/PRP think/VBP it/PRP 's/BES a/DT ,/,
( (CODE SpeakerB1 .))
( (INTJ Okay
.
E_S))
( (CODE SpeakerA2 .))
( (INTJ Okay
.
E_S))
( (CODE SpeakerB3 .))
( (SBARQ (INTJ Well)
(WHNP-1 what)
(SQ do
(NP-SBJ you)
(VP think
(NP *T*-1)
(PP about
(NP (NP the idea)
(PP of
,
(INTJ uh)
,
(S-NOM (NP-SBJ-2 kids)
(VP having
(S (NP-SBJ *-2)
(VP to
(VP do
(NP public service work))))
(PP-TMP for
(NP a year)))))))))
?
E_S))
( (SQ Do
(NP-SBJ you)
(VP think
(SBAR 0
(S (NP-SBJ it)
(VP 's
(NP-PRD-UNF a)))))
,
N_S))
And here is a sample tagged with the DAMSL ``Shallow Discourse Function Annotation":
http://stripe.colorado.edu/ jurafsky/manual.august1.html
FILENAME: 4360_1599_1589
^h A.1 utt1: {F Uh, } let's see. /
% A.1 utt2: How [ about, + {F uh, } let's see, about ] ten years ago, /
qo A.1 utt3: {F uh, } what do you think was different ten years ago from now? /
sv B.2 utt1: {D Well, } I would say as, far as social changes go,
{F uh, } I think families were more together. /
sv B.2 utt2: [ They, + they ] did more things together. /
b @A.3 utt1: Uh-huh <>. /
sv B.4 utt1: {F Uh, } they ate dinner at the table together. /
sv B.4 utt2: {F Uh, } the parents usually took out [ time, + {F uh, }
{D you know, } more time ] than they do now to come with the children
and just spend the day doing a family activity. /
b A.5 utt1: Uh-huh. /
sv B.6 utt1: {F Uh, } although I'm not a mother, [ I, + I ] still think that,
{F uh, } a lot has changed since ten years ago. /
qo B.6 utt2: {F Uh, } what # do you # --
% A.7 utt1: # We, # -/
+ B.8 utt1: -- think about that? /
sv A.9 utt1: {D Well, } {F uh, } {D actually } ten years from today seems rather short. /
b B.10 utt1: Yeah. /
sv A.11 utt1: {F Uh, } {C but } I do agree that, {F uh, } generally [ it's, + society ]
has sort of, {F uh, } let's see, rushed everything ahead. /
b B.12 utt1: Uh-huh. /
h A.13 utt1: {C And, } {F uh, } I don't know, /
sv A.13 utt2: it [ leaves, + leaves ] a lot of time out for family and things like that.
^h hold (often but not always after a question) ('let me think';
question in response to a question)
% indeterminate, interrupted, or contains just a floor holder (see manual)
qo open ended question
sv viewpoint, from personal opinions to proposed general facts
(listener could have basis to dispute)
b default agreement or continuer (uh-huh, right, yeah)
+ continued from previous by same speaker
The Linguistic Data Consortium
http://www.ldc.upenn.edu/
The Linguistic Data Consortium supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards.
The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC's host institution. The LDC was founded in 1992 with a grant from the Advanced Research Projects Agency (ARPA),and is partly supported by grant IRI-9528587 from the Information and Intelligent Systems division of the National Science Foundation.
LDC Catalog by Type and Source
| lexicon | speech | text |
Corpora are first divided into major categories
according to the type of data they contain, and then
are further broken down into minor categories based on
the source of the data.
lexicon
| microphone | pronunciation | varied | various |
speech
| broadcast | broadcast speech | cellular telephone |
microphone | mobile-radio | telephone | varied |
text
| broadcast | cellular telephone | conversation |
microphone | newswire | parallel | telephone | varied |
Looking at the list, one can see the move towards specialist corpora, and increasing coverage of languages other than English.