Research Note: YouTube autocaptions and studying social media sites that don't want to be studied

We're building a pipeline to study YouTube video content. Can we rely on YouTube's captioning system to provide transcripts?


Ryan McGrady

October 26, 2022

For years, Media Cloud has provided researchers with tools to analyze and process large amounts of digital news content. Other sites like Brandwatch and CrowdTangle allow for similar kinds of research on specific social media platforms like Twitter and Facebook, but there’s no clear equivalent for YouTube. Those of us working on the International Hate Observatory want to better understand YouTube content, and would like to eventually have a tool to process YouTube data like we already do for news. 

But there’s a big, obvious challenge: video is a lot harder for computers to make sense of than text. When it comes to a subject like hate speech, there is a great deal more research focused on platforms like Twitter than on YouTube in part because it’s just so much easier to process short blocks of written language. So how do we evaluate video? There are at least four components of videos we could study: metadata, spoken language, other audio features, and visuals. 

The YouTube API makes it relatively easy to pull video metadata like the title, description, view count, tags, duration, number of comments, and default language. Those comprise an important component of our research, and there are many studies based on the material accessible through the API. However, there are limitations to metadata. Some of those limitations are due to the nature of metadata – it’s data about the video rather than the content of the video itself. On YouTube, some of the metadata is based on user inputs, too, creating the possibility of missing or even misleading information about the video.

Visual and non-verbal audio data are still a bit further down the road for us, although there are a lot of good ideas out there. For example, Gianluca Stringhini of Boston University is working on some approaches to tracking image-based misinformation which hold promise for YouTube. 

For now, we want to tackle what is said in YouTube videos, and thus we need methods of transcription.

While there are analyses of YouTube comments (like Döring and Mohseni’s study of gendered hate speech in comments) and researchers have frequently studied the content of videos by watching and/or manually transcribing them, such labor-intensive work practically requires smaller sample sizes. We want the flexibility to work on a larger scale.

YouTube does have transcripts for some videos available as captions which researchers have begun to use more often. Van der Vegt, et al. (2021), for example, pulled captions for a set of right-wing and progressive channels in order to study the effect of the Unite the Right rally on language and social identity of those YouTubers. Some of the most interesting examples of studies that use YouTube transcripts are from medicine, perhaps spurred on by the well-publicized spread of COVID-19 misinformation on YouTube. Herbert, at al., analyzed the complexity and quality of 100 videos about pelvic organ prolapse in a 2021 paper, a small but focused data set. Ginossar, et al. (2022) included topic modeling of 1,280 transcripts about vaccines. A few years earlier, Yiannakoulias, et al. worked with about 200 vaccine-related transcripts. They selected popular videos from the set, which is a common strategy both because researchers typically want to understand what viewers actually watch rather than what exists, but also because popular videos often have better transcripts (due to cleaner production, higher quality equipment or recording locations, or with greater awareness of features like captions).

YouTube rolled out video captions in 2006. They were quickly developed to allow translation, and automatically generated captions followed in 2009. Those autocaptions were only available for select channels, but slowly expanded. As of February 2012, there are 1.6 million videos with manual captions and 135 million with autocaptions. In 2017, tYouTube reported that there were more than a billion autocaptioned videos, and YouTube reported it had improved accuracy by 50% in English (although 150% of an undisclosed number does not tell us much).

Autocaptions are now available in 13 languages: Dutch, English, French, German, Indonesian, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Turkish, and Vietnamese. Creators can add their own captions that they created from scratch, or can edit the autocaptions. When retrieving captions via the API, user-generated captions are distinguished from those which were automatically generated. YouTube previously offered a feature where viewers could add manual captions to videos that did not have them, but it was removed due to abuse and underuse in September 2020. 

Automatically generated captions in general, including YouTube’s, don’t have the best reputation. Yes, for most purposes they’re frequently better than no captions at all most of the time, but in the deaf community, YouTube’s autocaptions system became not-so-affectionately known as “autocraptions.” They do not include any punctuation, do not differentiate between speakers or clips, and have no way to describe visuals. They are timestamped, but do not indicate pauses. Sometimes verbal pauses are transcribed as words. Whereas captions for television broadcasts in the US are done manually and professionally and held to a high standard by the FCC via the Americans With Disabilities Act, those standards do not apply to video hosting websites. Automated captions are decidedly insufficient to study, say, group dynamics with a video of multiple people talking at once, unless manually edited by the researcher. They also struggle with accuracy when transcribing non-native speakers of a language, deaf speakers, women, and native English speakers from Scotland. (Scottish speakers’ frustration with speech recognition technology is fairly well documented in popular culture).

Are they good enough for our purposes in studying harmful speech on YouTube? Consumer Reports recently published the results of a study it conducted with researchers at Northeastern University and Pomona College, testing autocaptioning software on several platforms. In addition to confirming that there were more errors in transcripts of non-native English speakers and finding that women were transcribed more accurately than men), their comparison revealed YouTube’s autocaptions to produce the fewest errors of the bunch (5 per 100 words). 95% accuracy is very high for autocaptioning, but we have to consider that the sample for this study was an autocaption’s ideal: TED talks, which are well produced, with rehearsed presentations, good separation of speaker and audience, and high quality equipment. Popular YouTubers may have the resources and knowledge to create something of that quality, but we want to be able to sample more broadly, and thus have some more research to do about the limitations of YouTube captions in amateur video.

YouTube autocaptions, and autocaptions in general, have gotten much better in recent years and should be useful for a range of projects. It will be important to understand their limitations, however, and relying on them will require some extra consideration at the methodology design stage. Last year, when I did some preliminary research on the quality and availability of YouTube transcripts for our purposes, it did not take long to realize we could not rely on them. More important than any quality issues was that they were still missing on many of the videos we were looking at. We want to be able to perform and support research that includes unpopular videos, after all. My colleague, Kevin Zheng, developed a pipeline using Vosk, an open source toolkit which is able to identify and transcribe more than 20 languages. Then something surprising happened: I was doing a spot check of the quality of the Vosk transcripts, comparing them to the video content and then to autogenerated captions on YouTube. The YouTube transcripts were almost uniformly more accurate, but more importantly nearly all of the transcripts I wanted… were available! Had we invested time and money in building a tool that is already mostly obsolete? We’ll need to dig into a range of YouTube datasets before coming to a conclusion. Most likely we will retain the Vosk pipeline as a backup for when captions are unavailable.

Sometimes when you’re studying social media, surprising things happen that force you to reconsider your subject and/or methods. Scientists studying COVID-19 misinformation on YouTube in the first months of the pandemic had to rethink their approach and conclusions when YouTube first announced a special COVID-19 policy in May 2020, disallowing videos that contradicted the medical consensus at the time. They would have had to recalibrate again when YouTube updated the policy in September 2021 to explicitly prohibit vaccine misinformation. 

In the years after the 2016 election, there was a great deal of research about the ways YouTube’s recommendation algorithms pushed people towards more radical political content, but the company’s changes to those algorithms in 2019 seems to have addressed a big part of the problem (see Chen, et al., 2021; Munger & Phillips, 2022). Two studies done months or weeks apart could yield very different results for the same research questions. 

Did the research influence YouTube’s decisions to implement these changes, or did it detect, evaluate, and restrict the spread of harmful content on its own? Probably some combination, but these are the uncertainties that come with studying platforms owned by companies that don’t always appear to be enthusiastic subjects of study. When an entity’s motivation is profit rather than fostering a healthy society, after all, there are limited motivations to exploring the relationship between profitable services and ways it may harm peoples and publics. It is worth pressing to give researchers more access to what’s under the hood at YouTube, especially for platforms that have become more like infrastructure, but for now we will have to continually adapt and catch up. We do not know exactly what YouTube changed about its algorithms in 2019, or what motivated it to change its policy on COVID-19 misinformation, but we do know that those changes helped in some ways. 

So how much better are YouTube’s autocaptions now than when we started this project? What exactly changed? Was the difference between my evaluation then and now due just to inadequate sampling on my part, or were there other changes inaccessible to me? There are a few traces of information available, but we ultimately don’t know. What we do know is YouTube’s autocaptions are good enough for us to use in some work on hate speech, and perhaps to start building tools for other people to use, even if they remain a poor substitute for manually produced, high-quality captions/transcripts.

We are making our first foray into studies using large numbers of transcripts using videos from a set of far-right channels (some popular, some not) to build an anti-immigrant speech classifier. We want to be able to feed a whole bunch of transcripts into a program and ask what likelihood there is that they contain a particular type of harmful speech. To do that, we first have to annotate the transcripts we have, then train a machine learning model to be able to detect such content elsewhere. 

Yonatan Lupu, et al. published an impressively expansive study last year in which they did something similar with a number of other, text-based platforms. In order to better understand the relationship between offline events and online hate speech, the researchers developed several types of hate speech classifiers and applied them to a range of moderated and unmoderated sites. We are doing something similar with YouTube, which we believe to be the most important and understudied platform of all. To contend with some of the challenges presented by the medium, we are using a sample comprising mostly native English speaking men from the United States, segments long enough to allow for annotators to properly understand context where ambiguities arise, and paying attention to specific words or phrases which may yield too many errors. I look forward to a future blog post with more information about this process, which is in process now in collaboration with a brilliant team from UMass Amherst’s Center for Data Science.

Hero image from How to Edit YouTube Auto Captions Jan 2020 by Online Network of Educators (CC BY 3.0)