Revising Stream Transcripts to WEBVTT with R
Posted: Saturday, May 1, 2021, 05:00 PM
During our project, funded in part by the National Endowment of the Humanities, Dr. Ryan Bessett, Dr. Ana Carvalho, and myself (Dr. Katherine Christoffersen) tested several technologically-aided transcription methods. We eventually found that Stream auto-generated transcripts were preferable based on accuracy, speed, and ease of use. While Stream generated transcripts with timestamps, it did not create the precise WEBVTT format required for time-alignment and clickable transcripts on the CoBiVa website.
During Summer 2020, Mr. Bart Rossman at the University of Arizona created a sample script for Atom which would revise the transcripts through several steps. This was one step in the right direction, but the multiple steps allowed for more errors. We also encountered some bugs working in Atom. Bart suggested that we look into R and work with Ms. Jessica Draper on an R script that would allow for a one-step process that could revise all the transcripts in a given file.
During Fall 2020, Ms. Jessica Draper created an initial script for the revision of Stream auto-generated transcripts to WEBVTT format. Since Stream does not identify and tag different speakers, this part needs to be done manually. Thus, we created a two step process. There is an initial revision of the transcript. Then, it is additionally revised after students insert speaker codes to identify when someone speaks (and who that speaker is).
During Spring 2021, Ms. Jessica Draper worked with us along with two research assistants, Ms. Isabella Calafate de Barros (University of Arizona, UofA) and Ms. Mayte Vega Mudy (University of Texas Rio Grande Valley, UTRGV) to test the script run through 30 total transcripts from students in experiential learning internship style classes at both campuses, taught by Dr. Katherine Christoffersen (UTRGV) and Dr. Ana Carvalho (UofA). This allowed for de-bugging of the script and the instructions. It worked very well, except if the students did not correctly insert the speaker codes or there were some other problem with formatting in a given file, it would not work.
We have provided the R script files and instructions below:
R Script to Change Stream Transcripts to WEBVTT Format (Step 1)
R Script to Change Stream Transcripts to WEBVTT Format (Step 2)
Instructions on Using R Script to Change Stream Transcripts to WEBVTT Format (Step 1 & 2)
You can also find this project listed on Github here.