🎬 Long form wins. There is twice the amount of movies compared to TV shows on Netflix.

🌏 Culture drives content. There are more movies in India, whilst in South Korea there are more TV Shows.

🎭 Comedic documentaries? Why not. There is a high correlation between independent and international drama movies, but an untapped market for comedic documentaries.

Untitled

<aside> 💡 Background: This project was completed as part of a hackathon and forms my capstone project for the Cambridge Spark Data Analytics L4 Bootcamp.

For this project, I used Python, Pandas, Matplotlib and Seaborn on Jupyter.

This page includes my code snippets for exploring, cleaning and analysing the dataset.

</aside>

The Problem Statement


An emerging startup wants to compete with Netflix to provide a content streaming service. To strategically design a product with a USP, it needs insight into one of the industry giants, Netflix, and what it is doing well or where it can improve.

The Dataset


The dataset contains 8,807 unique records, of both TV shows and movies available on Netflix.

There are a total of 12 columns which includes information on title, director, cast, country, date added, release year, content rating, duration and genre.

<aside> ℹ️ Datasource: Kaggle, 2019

</aside>

Exploring the data


To understand what the dataset contains, I ran a series of standard functions such as count, shape, info and describe.

# First, I import the various libraries I will be using for this analysis
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Assign the CSV file into a Pandas Dataframe
netflix_df = pd.read_csv('netflix_titles.csv')

# Number of rows and columns
netflix_df.shape

# Quick overview of the first 5 rows
netflix_df.head()

# Count the number of non-null values in each column of the DataFrame
netflix_df.count()

# Display the data types 
netflix_df.info()

# Count the number of missing (null) values in each column
netflix_df.isnull().sum()

# Count the number of unique values in each column
netflix_df.nunique()

# Count the number of duplicated rows
netflix_df.duplicated().sum()