A conversation analytic transcription of a spoken subcorpus of the BNC

View the Project on GitHub saulalbert/CABNC

CABNC transcription project


The CABNC corpus is a open-licensed, detailed conversation analytic re-transcription of naturalistic conversations from a subcorpus of the British National Corpus amounting to around 4.2 million words in 1436 separate conversations.

The project aims to produce transcripts usable for both computational and detailed qualitative analysis. If you are a CA transcriptionist and you use the data, please make sure you re-submit your updated transcripts to help improve the corpus over time.

Accessing and using CABNC data

Using transcripts

To edit transcripts in CLAN, place both the transcript .cha file and the audio .wav file in the same directory. Check the CLAN manual for details of how to use the CLAN editor.

Contributing transcripts

Transcriptions are made using Jeffersonian CA transcription conventions, and the CHAT-CA file format and transcription symbols provided by the CLAN transcription system.

A guideline for transcribers is currently being devised to help with standardisation - the guidelines adhere as closely as possible to current standards in CA without sacrificing machine readability.

To use or contribute to these transcripts:

  1. download and install CLAN,
  2. download the corresponding audio file from the Audio BNC site,
  3. improve existing transcripts with CLAN, then submit them to the CABNC project for inclusion.

Underlying BNC Data and Usage Rights

Accessing original BNC data

The data on which this project builds is available here:

If you want to perform complex searches on BNC data:

Subcorpus Data Selection Rationale

The Audio BNC contains about 7.5 million words of recorded speech, all of it already roughly transcribed, with audio recordings of sufficient quality for automated phonetic transcriptions, and full Praat TextGrid files aligning audio to transcriptions are available for the entire corpus. There are also comprehensive wordclass and part-of-speech tag annotations. Within the overall BNC corpus, this project focuses on a subcorpus of more naturalistic, conversations from informal contexts. These include 152 rough transcripts of audio files, labelled by the original BNC transcribers with the following tags:

These are conversations around water-coolers, in corridors, bus-stops, homes etc. and as such are most useful for analysing natural talk-in-interaction. There are 4,228,314 words in this subcorpus.

Rights and Usage Information