TV Pilots are a treasure trove of data

A few months ago, I did a little weekend project of looking at TV comedy pilot scripts. For those unfamiliar with the concept, when a television show is being developed a network will order a pilot episode as a test to see if it will pick it up for a full season. As a result, the idea may be reworked and elements changed to “make it work” for that network.

Part 1: Fetching and Normalizing the data

To start, I scraped about 450 television pilots from Here was my first challenge, some of these were just text files (awesome) but others were PDFs. In order to extract the text, I turned to Tesseract. Below is the script I used to extract the text:

for f in $(find . -name '*.pdf'); do
  PAGES=`pdfinfo $f | grep Pages: | awk '{print $2}' | tail -n 1`
  if [ ! -f textfiles/$parsedfilename.txt ]
    #some text was parsable just using pdftotext
    pdftotext -layout $f - > textfiles/$parsedfilename.txt
    echo "File $parsedfilename does not exists"
    for i in `seq 1 $PAGES`; do
      # converts the file to an image
      convert -density 500 -depth 8 $f\[$(($i - 1 ))\] images/page$i.png
      # tesseract parses the image for text and puts it into a file
      tesseract images/page$i.png stdout >> parsed/$parsedfilename.txt

This got most of the scripts in a format that could be queried. Here’s a sample of the very funny 30 Rock pilot – note the different character names –

The studio's homebase set. Workman are polishing a big
sign that reads, "Friday Night Bits with Jenna DeCarlo.
"Pull back through the picture window to where KENNETH a
bright and chirpy (Clay Aiken type) NBC page is giving a
tour. He stands next to-a life-size standee of impish
comedian Jenna DeCarlo. '

Part 2: Apache Spark analysis

Now that the data was machine readable, the best first course of action was to query the text files for data that I thought might be interesting.  Apache Spark is a great tool for loading up datasets like this so I went into the Spark shell and ran some different experiments. Here is some of the code I used to get to these numbers:

//loads both folders of the 450 comedy scripts into the RDD
var parsedFiles = sc.textFile("./tvscript/parsed,./tvscript/textfiles")
//outputs the count of the phrase "20s"
parsedFiles.filter(line => line.contains("20s")).count()

Exterior vs interior scenes

Screenplays are unique because of the way they are formatted, they announce whether a scene is interior or exterior at the beginning of the scene with either INT. or EXT. so I started there.


Screen Shot 2016-10-09 at 1.46.01 PM

My take: It is significantly cheaper to shoot indoors than outdoors, this might be a self-selection by writers to make sure their show gets picked up.


Age of characters

When announcing a character in a screenplay you usually give a short description which includes their age usually by decade, for example from The Grinder script “STEWART SANDERSON (30’s) drives with his family”.

Screen Shot 2016-10-09 at 1.48.06 PM


My Take: No surprises here, television is geared towards 24-54 and they want to show a good distribution of those people on TV.

Part 3: Sentiment Analysis

That was a fun experiment, but it was time to go further. In looking at the data, I realized I could do a sentiment analysis of block of text in an episode and see if there were any patterns that appeared. I created a new scala project focused on using Stanford’s natural language processing library and based on work done here. Each block of text was taken and analyzed then put into a MongoDB store with a structure that looks like this

  "sentiment" : 1,
  "textFile" : "Black-ish 1x01 - Pilot",
  "line" : " DIANE\n She’s weird, so feel free to say no.",
  "weight" : 263

Here the “sentiment” is a scale from 1-5 with 1 being most negative, “line” is the actual block of text and “weight” is what order it occurs in the episode, so this was the 263rd thing said in the episode. With the data in place, I built a small node server that could display a chart for the scripts I parsed. Here are some screenshots of the results


Screen Shot 2016-06-05 at 11.08.25 AM Screen Shot 2016-06-05 at 11.07.40 AM

Pretty neat right? Well the completely interactive version is located at where you can look at the 100+ scripts I did sentiment analysis for.

Leave a Reply