The Growth in Grammar corpus

Introduction

The Growth in Grammar Corpus is a collection of texts written by children at schools in England as part of their regular school work. This page describes the process of text collection, transcription and annotation and summarizes the contents of the corpus. 

The full corpus can be accessed at gigcorpus.com (registration required - contact Phil Durrant for access details p.l.durrant@exeter.ac.uk).

Corpus procedures

Collecting the corpus

Our research team contacted schools from across the country, briefing them on the project and inviting them to participate. All writing was obtained subject to the students’ voluntary informed consent, with additional consent obtained from the head teacher, the relevant subject teachers, and the students’ legal guardians.

Teachers collected texts from participating students and either photocopied these texts and mailed them to us or invited us into their schools to make photocopies ourselves.

Transcription

All of the texts were received in hand-written form so we employed a small team of transcribers to type them up. Transcribers received two days of training and worked closely with a member of the core project team to deal with issues that arose during the process.

Transcription proceded in two phases. In the first phase, each transcriber was assigned a set of photocopies to type up, in accordance with our transcription conventions. They were also asked to make two types of change to the original texts: 1) replace any proper names which might compromise participants’ or institutions’ anonymity with anonymisation markers; 2) where a word had been mis-spelled, contained erroneous capitalization or an abbreviation, insert a tag recording both the original form and a ‘correction’ with the correct spelling/capitalization/expanded form of the abbreviation.

In the second phase, each transcriber was assigned texts which had originally been transcribed by someone else. They both reviewed the original transcription for accuracy and added annotations related to punctuation and grammar.

At each stage, transcribers followed a manual which set out our transcription conventions and principles to be followed during the process. The manual for stage one can be found here and the manual for stage two here.

Linguistic Annotation

The conventions set out above describe the ‘basic’ version of the corpus. For the purposes of analysis, further versions were created incorporating different types of additional linguistic information.

Part-of-speech-tagged corpus

We used the CLAWStagger to automatically add information about the part-of-speech of each word in the corpus. To achieve more accurate classifications, prior to tagging, misspelled words were corrected and unclear/illegible material removed. Material appearing inside tables was also removed.

Syntactically-tagged corpus

The corpus was tagged with syntactic information in two ways. First, the entire corpus was tagged for part-of-speech and grammatical relations using the Stanford Core NLPsuit of tools (as with the part-of-speech tagging, misspellings were corrected and unclear/illegible material and tables were removed prior to parsing).

Second, a subset of the corpus was manually tagged by a team of trained annotators. This analysis focused specifically on tagging syntactic elements within noun phrases and subordinate clauses. Procedures and conventions used in this process are described in full here. The hand-parsed version of the corpus is available upon request. Please contact Phil Durrant (p.l.durrant@exeter.ac.uk) for more information.

Corpus contents

The Growth in Grammar corpus comprises nearly 3,000 texts, written by 983 children in 24 different schools. See here‌ for quantitative summaries of the corpus contents. See here for metadata describing the full contents of the corpus in detail.

Our primary points of data collection were years 2, 6, 9 and 11. We were also sent some texts from year 4, which are included as a supplement to the main corpus