The data and evaluation metric to be used is root mean square log error or RMSLE mentioned in Kaggle Bluebook for Bulldozers competition.
The goal is to predict the future sale price of a bulldozer based on its characteristics. (Given previous sales price & specification of similar types of bulldozers.)
In the data set, historical sales data of bulldozers. Include things like, model type, size, sale date and more.
There are 3 datasets:
- Train.csv - Historical bulldozer sales examples up to 2011 (close to 400,000 examples with 50+ different attributes, including
SalePrice
which is the target variable). - Valid.csv - Historical bulldozer sales examples from January 1 2012 to April 30 2012 (close to 12,000 examples with the same attributes as Train.csv).
- Test.csv - Historical bulldozer sales examples from May 1 2012 to November 2012 (close to 12,000 examples but missing the
SalePrice
attribute, as this is what we'll be trying to predict).
For this problem, Kaggle has set the evaluation metric to being root mean squared log error (RMSLE). As with many regression evaluations, the goal will be to get this value as low as possible.
For this dataset, Kaggle provide a data dictionary which contains information about what each attribute of the dataset means. You can download this file directly from the Kaggle competition page