Contact at or 8097636691
Responsive Ads Here

Thursday, 8 February 2018

Predicting Movie Success Based on IMDB Data

Predicting Movie Success Based on IMDB Data

American film studios collectively produce several hundred movies every year, making the United States the third most prolific producer of films in the world. The budget of these movies is of the order of hundreds of millions of dollars, making their box office success absolutely essential for the survival of the industry. Knowing which movies are likely to succeed and which are likely to fail before the release could benefit the production houses greatly as it will enable them to focus their advertising campaigns which itself cost millions of dollars, accordingly. And it could also help them to know when it is most appropriate to release a movie by looking at the overall market. So the prediction of movie success is of great importance to the industry. Machine learning algorithms are widely used to make predictions such as growth in the stock market, demand for products, nature of tumors etc. This paper presents a detailed study of Logistic Regression, SVM Regression and Linear Regression on IMDB data to predict movie box office. Predicting Movie Success Based on IMDB Data
Dataset Collection :- The initial dataset to be used will be collected from IMDB. Predicting Movie Success Based on IMDB Data will consist of movies that were released from 2000 to 2012. Among these movies, we only selected the ones that were released in the United States and are in English, in the anticipation that we would be able to make more accurate predictions on these movies given that their reviews would also be in English. We removed movies which don’t have any information about Box office details. We got data regarding 1050 films.
A. Data preprocessing:- The data we obtained are highly susceptible to noisy, missing and inconsistent data due to the huge size and their likely origin from multiple, heterogeneous sources. We mainly used IMDb and Rotten Tomatoes and Wikipedia. The main problem with datasets was missing fields. To overcome this missing field problem we adopted a method which uses a measure of central tendency for the attribute. We used both mean and median as central tendency. Then removed duplicate items.
B. Data Integration and Transformation:- Data obtained from three different resources IMDB, Wikipedia and Rotten Tomatoes were then integrated into one database.In this step integrated data are transformed or consolidated so that the regression process may be more efficient and easier. Dataset is mixed with both nominal and numeric attributes, but for a regression process, we need all attributes to be numerical. We used a measure of central tendency of Box office revenue to convert corresponding nominal attributes to numerical.

No comments:

Post a Comment