With the developement of cheap electronics, a trend has been on the rise lately to measure and record all kinds of personal data. One type of such data is accelerometer readings from various fitness trackers during excersises. In this study I am asked to develop a model which will be able to classify performance quality of the excersise based on these readings.

The data is provided by Groupware@LES team of Pontifical Catholic University of Rio de Janeiro. It contains data gathered in a Unilateral Dumbbell Biceps Curl excersise by 5 people done in a correct way and in 4 incorrect ways. The goal of the study is to correctly classify types of mistakes.

Exploratory analysis

From a brief look on the dataset one can hypothesise that a lot of columns have almost no useful data in them. To check this I make a plot of counts of non empty cells for every column.

As can be seen on the plot, about a 100 out of 160 variables have less than 3% meaningful data in them. Moreover, there are temporal, entry index and measuring window columns which bear no significant information as well. Deleting these will improve performance and accuracy of the model. In the end I am left with 53 predictors and 1 dependent variable

Models

I am going to use brute force. Just train a number of models using 7-fold cross validation technique and choose the one with best predictive power. The models I have chosen to test are:
- Gradient boosting model (mdlGbm)
- Linear discriminant analysis (mdlLda)
- Multiple logistic regression (mdlMulti)
- Naive Bayes (mdlNb)
- Random Forest (mdlRf)
Parallel processing will be used to speed things up.

Training Results

After the training is complete I acqure the following results. The two winning models are Gradient boosting and Random forests. 3 other models do not approach even 95% accuracy. Further, while having also high accuracy, the Gbm model is not sufficiently accurate to pass the final Quiz.

##               mdlGbm    mdlLda    mdlMulti      mdlNb       mdlRf
## accuracy   0.9621132 0.7244790   0.7639026  0.7407141   0.9943595
## kappa      0.9526059 0.6557569   0.7047451  0.6758402   0.9929440
## elapsed  194.8250000 2.6740000 146.6670000 61.4790000 592.9420000

Final Model

So based on the accuracies, the best model is the Random forest model. Let’s examine it deeper. Below is out-of-sample error rates for the model.

##                cm$overall
## Accuracy        0.9994902
## Kappa           0.9993553
## AccuracyLower   0.9985110
## AccuracyUpper   0.9998949
## AccuracyNull    0.2844520
## AccuracyPValue  0.0000000
## McnemarPValue         NaN

##                       Class: A  Class: B  Class: C  Class: D  Class: E
## Sensitivity          0.9982079 1.0000000 1.0000000 1.0000000 1.0000000
## Specificity          1.0000000 0.9995786 1.0000000 0.9997968 1.0000000
## Pos Pred Value       1.0000000 0.9982472 1.0000000 0.9989637 1.0000000
## Neg Pred Value       0.9992881 1.0000000 1.0000000 1.0000000 1.0000000
## Precision            1.0000000 0.9982472 1.0000000 0.9989637 1.0000000
## Recall               0.9982079 1.0000000 1.0000000 1.0000000 1.0000000
## F1                   0.9991031 0.9991228 1.0000000 0.9994816 1.0000000
## Prevalence           0.2844520 0.1935429 0.1743415 0.1638063 0.1838573
## Detection Rate       0.2839422 0.1935429 0.1743415 0.1638063 0.1838573
## Detection Prevalence 0.2839422 0.1938828 0.1743415 0.1639762 0.1838573
## Balanced Accuracy    0.9991039 0.9997893 1.0000000 0.9998984 1.0000000

And the model performs excellent giving almost 100% on all metrics.

Variable importance

Below is the variable importance plot for the model.

It might be viable to try and fit former models with only about a third most important parameters, but this is out of scope of his article.

Conclusion

In this study I have successfully fitted a model that can classify excersise performance with very high accuracy. Moreover, the model does not require any summary parameters that are calculated at the end of excersise set or even at fixed time windows. Which means that this model may be applied to real-time excersise supervision and feedback as it is, with minimal adjustments.