{ site.title | escape }

I completed this project as part of Udacity's Nanodegree progam, I was given a task to choose a dataset and perform data wrangling and EDA techniques on the chosen dataset. For this task I chose The Movie Dataset which contains information about 10,000 movies,short films and tv series of the last 50+ years collected from The Movie Database (TMDb), including user ratings, revenue, runtime and budget. You can find the dataset in the link given at the end.

As already mentioned, this project consists of two main parts i.e Data Wrangling and Exploratory Data Analysis. I'll breifly mention both of them here but you can find the detailed version in the repo link below.

Data Wrangling:

I performed several steps to clean the dataset and convert/transform it into usable format.Following steps were taken in data wrangling process:

  • Dropping Columns with high percentage of null values.
  • Dropping Rows of null values where percentage was low.
  • Replacing Zero values with mean where zero values were large.
  • Dropping Zero values where zero values were low.
  • Dropping Duplicates.
  • Dropping redundant columns.

Exploratory Data Analysis:

In this phase I explored the dataset and asked questions about the characteristics of the data and their visualizations. Following were the questions which were asked and analyzed:

  • How has the popularity and vote count increased in the last few decades?
  • Which decade was the most successful monetarily?
  • What are the most popular genres and keywords of the last decade?
  • Which director and actor/actress has been the most successful?
  • What is the ratio of movies and tv-series in the dataset ? How to differentiate?

Conclusion:

All the data wrangling steps were performed after detailed analysis and decisions were made after seeing all the feasible options. In the EDA phase all the questions were analyzed and insights were gathered after visual plotting and statistical analysis. You can check the repo link below to find the code along with the dataset and report.

Github Repo