top of page
Alaxo Joy

Training, validation, and test data sets - in Machine Learning


We split the training data into two disjoint subsets. One of these subsets is used to learn the  parameters. The other subset is our validation set, used to evaluate the performance of a model  during the training process, ensuring that it generalizes well to new, unseen data. 


The subset of data used to learn the parameters is still typically called the training set. The  subset of data used to guide the selection of hyperparameters is called the validation set.  Typically, one uses about 80% of the training data for training and 20% for validation. 

 

Training vs Test data
Training vs Test

The validation set plays a crucial role in the machine learning process for various reasons. Firstly, it serves the purpose of evaluating the model's performance by assessing how well it is learning the underlying patterns in the data. By comparing the model's predictions on the validation set to the actual outcomes, we can gauge its effectiveness and generalization capabilities.


Moreover, the validation data is instrumental in hyperparameter tuning. Hyperparameters, such as learning rate and regularization parameters, greatly impact a model's performance. By experimenting with different hyperparameter values on the validation set, we can optimize these parameters to enhance the model's accuracy and efficiency.


Furthermore, the validation set aids in model selection by allowing us to compare the performances of different models. By training multiple models on the training set and evaluating them on the validation set, we can identify the best-performing model that is most likely to generalize well to unseen data.


Purpose of Validation Data


1. Model Evaluation: Helps assess how well the model is performing and whether it is  learning the right patterns. 


2. Hyperparameter Tuning: Assists in finding the best hyperparameters (like learning rate,  regularization parameters, etc.) that improve the model's performance.

 

3. Model Selection: Aids in choosing the best model among a set of models by comparing their performances on validation data 


Examples in Practice 


• Neural Networks: Validation data helps decide the number of layers, neurons, learning  rate, etc. 


• Decision Trees: Validation data is used to determine the depth of the tree or the  minimum number of samples required at a leaf node.


• Support Vector Machines (SVMs): Validation data helps choose the kernel type and  regularization parameter. 



2 views0 comments

Recent Posts

See All

Comments


bottom of page