A few months ago, I did a little weekend project of looking at TV comedy pilot scripts. For those unfamiliar with the concept, when a television show is being developed a network will order a pilot episode as a test to see if it will pick it up for a full season. As a result, the idea may be reworked and elements changed to “make it work” for that network.
Part 1: Fetching and Normalizing the data
To start, I scraped about 450 television pilots from https://sites.google.com/site/tvwriting/us-drama/pilot-scripts. Here was my first challenge, some of these were just text files (awesome) but others were PDFs. In order to extract the text, I turned to Tesseract. Below is the script I used to extract the text:
#!/bin/bash for f in $(find . -name '*.pdf'); do filename=(${f//\.\/scripts\/comedy\//}) parsedfilename=(${filename[0]}) PAGES=`pdfinfo $f | grep Pages: | awk '{print $2}' | tail -n 1` if [ ! -f textfiles/$parsedfilename.txt ] #some text was parsable just using pdftotext pdftotext -layout $f - > textfiles/$parsedfilename.txt then echo "File $parsedfilename does not exists" for i in `seq 1 $PAGES`; do # converts the file to an image convert -density 500 -depth 8 $f\[$(($i - 1 ))\] images/page$i.png # tesseract parses the image for text and puts it into a file tesseract images/page$i.png stdout >> parsed/$parsedfilename.txt done fi done
This got most of the scripts in a format that could be queried. Here’s a sample of the very funny 30 Rock pilot – note the different character names –
ACT ONE INT. NBC STUDIOS, NEW YORK e DAY The studio's homebase set. Workman are polishing a big sign that reads, "Friday Night Bits with Jenna DeCarlo. "Pull back through the picture window to where KENNETH a bright and chirpy (Clay Aiken type) NBC page is giving a tour. He stands next to-a life-size standee of impish comedian Jenna DeCarlo. '
Part 2: Apache Spark analysis
Now that the data was machine readable, the best first course of action was to query the text files for data that I thought might be interesting. Apache Spark is a great tool for loading up datasets like this so I went into the Spark shell and ran some different experiments. Here is some of the code I used to get to these numbers:
//loads both folders of the 450 comedy scripts into the RDD var parsedFiles = sc.textFile("./tvscript/parsed,./tvscript/textfiles") //outputs the count of the phrase "20s" parsedFiles.filter(line => line.contains("20s")).count()
Exterior vs interior scenes
Screenplays are unique because of the way they are formatted, they announce whether a scene is interior or exterior at the beginning of the scene with either INT.
or EXT.
so I started there.
My take: It is significantly cheaper to shoot indoors than outdoors, this might be a self-selection by writers to make sure their show gets picked up.
Age of characters
When announcing a character in a screenplay you usually give a short description which includes their age usually by decade, for example from The Grinder script “STEWART SANDERSON (30’s) drives with his family”.
My Take: No surprises here, television is geared towards 24-54 and they want to show a good distribution of those people on TV.
Part 3: Sentiment Analysis
That was a fun experiment, but it was time to go further. In looking at the data, I realized I could do a sentiment analysis of block of text in an episode and see if there were any patterns that appeared. I created a new scala project focused on using Stanford’s natural language processing library and based on work done here. Each block of text was taken and analyzed then put into a MongoDB store with a structure that looks like this
{ "sentiment" : 1, "textFile" : "Black-ish 1x01 - Pilot", "line" : " DIANE\n She’s weird, so feel free to say no.", "weight" : 263 }
Here the “sentiment” is a scale from 1-5 with 1 being most negative, “line” is the actual block of text and “weight” is what order it occurs in the episode, so this was the 263rd thing said in the episode. With the data in place, I built a small node server that could display a chart for the scripts I parsed. Here are some screenshots of the results
Pretty neat right? Well the completely interactive version is located at https://script-sentiment.herokuapp.com/ where you can look at the 100+ scripts I did sentiment analysis for.