Results and Discussion¶
Evaluation Metrics¶
Before discussing the results, let us revisit some commonly used metrics for binary classification, explained using this assignment as an example:
Actual\Predicted | Negative | Positive |
---|---|---|
Negative | \(\color{red}{\text{TN}}\) | \(\color{blue}{\text{FP}}\) |
Positive | \(\color{green}{\text{FN}}\) | \(\color{orange}{\text{TP}}\) |
\(\text{Precision}\): Measures the proportion of articles predicted as popular that are actually popular. Higher values indicate greater trust in the model's predictions for popular articles. Formula: $$ \text{Precision} = \frac{\color{orange}{\text{TP}}}{\color{blue}{\text{FP}} + \color{orange}{\text{TP}}} $$
\(\text{Recall}\): Measures the proportion of actual popular articles that are correctly predicted by the model. Also known as True Positive Rate (TPR) or Sensitivity. Higher values indicate the model's ability to capture actual popular articles. Formula: $$ \text{Recall} = \dfrac{\color{orange}{\text{TP}}}{\color{green}{\text{FN}} + \color{orange}{\text{TP}}} $$
\(\text{Specificity}\): Measures the proportion of actual non-popular articles that are correctly predicted by the model. Also known as True Negative Rate (TNR). Higher values indicate the model's ability to capture actual non-popular articles. Formula: $$ \text{Specificity} = \dfrac{\color{red}{\text{TN}}}{\color{red}{\text{TN}}+\color{blue}{\text{FP}}} $$
\(\text{F1-score}\): A harmonic mean of \(\text{Precision}\) and \(\text{Recall}\), ranging from \(0\) to \(1\), with higher values being better. Formula: $$ \text{F1-score} = \dfrac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$
\(\text{Balanced Acc.}\): A combined metric of \(\text{TPR}\) and \(\text{TNR}\), ranging from \(0\) to \(1\), with higher values being better. Formula: $$ \text{Balanced Acc.} = \dfrac{\text{TNR} + \text{TPR}}{2} $$
When using GridSearchCV
to find the best parameter combination, we record these five metrics and select the best combination based on the f1-score. Example code:
scoring = {
'precision': 'precision',
'recall': 'recall',
'specificity': make_scorer(specificity_score),
'balanced_accuracy': 'balanced_accuracy',
'f1_score': 'f1',
}
grid_search = GridSearchCV(pipe, param_grid=param_grid, scoring=scoring, refit='f1_score',
n_jobs=-1, cv=3, return_train_score=True)
Experimental Results¶
Info
The best model is AdaBoostClassifier
without any resampling, consisting of 100 decision trees with a maximum depth of 2. The average f1-score during cross-validation is 0.56, while the f1-score on the public test set is 0.53. Detailed prediction information is as follows:
Note
===================GETTING CONNECTOR START!==================
============================DONE!============================
====================GETTING TABLES START!====================
posts_test Total: 225,986 rows, 3 columns
post_shared_test Total: 83,376 rows, 3 columns
post_comment_created_test Total: 607,251 rows, 3 columns
post_liked_test Total: 908,910 rows, 3 columns
post_collected_test Total: 275,073 rows, 3 columns
============================DONE!============================
====================MERGING TABLES START!====================
============================DONE!============================
================PREPROCESSING TOTAL_DF START!================
============================DONE!============================
==================PREDICTING TESTSET START!==================
f1-score = 0.53
balanced acc = 0.70
precision recall f1-score support
0 0.99 1.00 0.99 221479
1 0.75 0.40 0.53 4507
accuracy 0.99 225986
macro avg 0.87 0.70 0.76 225986
weighted avg 0.98 0.99 0.98 225986
============================DONE!============================
Now, let us analyze the experimental results. (All figures below are based on cross-validation results, not the entire training set or public test set.)
Resampler¶
First, let us examine how different resampling strategies affect the f1-score:
Info
Different resampling strategies indeed affect the f1-score:
- NearMiss (undersampling) has the lowest f1-score, likely due to excessive removal of non-popular articles, losing too much majority class information.
- SMOTE (oversampling) achieves a moderate f1-score.
- No resampling achieves the highest f1-score.
Next, we investigate how these resampling strategies impact precision and recall:
Info
- NearMiss and SMOTE significantly increase the model's focus on the minority class, resulting in excellent recall scores of 0.91 and 0.95, respectively. However, this comes at the cost of precision, which drops to 0.07 and 0.20, respectively.
- In other words, resampling strategies can capture actual popular articles but reduce the trustworthiness of the predicted popular articles.
We further explore whether resampling strategies interact with different classifiers to influence the f1-score:
Info
- Under "SMOTE" and "No Resampling" strategies, different classifiers do not significantly affect the f1-score.
- However, under the NearMiss strategy,
XGBClassifier
achieves the highest f1-score (0.18), whileAdaBoostClassifier
has the lowest (0.07).AdaBoostClassifier
performs poorly because it relies on weak classifiers, which struggle with limited majority class information.XGBClassifier
outperformsGradientBoostingClassifier
due to its optimized GBDT implementation.
Classifier¶
Next, let us examine how different classifiers affect the f1-score:
Info
- Different classifiers have minimal impact on the f1-score. On average,
XGBClassifier
achieves the highest score (0.35), primarily due to its performance under the NearMiss strategy.
Finally, we analyze whether the number of internal classifiers in ensemble models affects the f1-score:
Clearly, the number of classifiers has little impact. Similarly, the tree depth for AdaBoostClassifier
and the learning rate for the other two models also have minimal impact on the f1-score (figures omitted).
Future Directions¶
The experimental results are summarized above. Due to time constraints, additional attempts were not included. Potential future directions are outlined below:
Explore Other Resampling Techniques¶
Resampling techniques can increase the model's focus on the minority class. Although the experimental results were not ideal, we can continue fine-tuning hyperparameters or exploring other resampling techniques. Refer to the "Over-sampling" and "Under-sampling" sections of the imblearn
User Guide for potential directions.
Consider Other Evaluation Metrics¶
The assignment requires using f1-score as the evaluation metric. However, if we use balanced accuracy instead, the best model would be a GradientBoostingClassifier
trained with SMOTE, consisting of 120 classifiers and a learning rate of 0.1, achieving a balanced accuracy of 0.93.
The impact of different resampling strategies on balanced accuracy is shown below:
Info
SMOTE achieves the highest balanced accuracy. If the goal is to preliminarily identify potentially popular articles for subsequent workflows, balanced accuracy might be a better evaluation metric.
Explore Other Feature Transformations and Classifiers¶
The experiment only considered tree-based ensemble models, which require minimal feature transformation. However, we could explore logistic regression, support vector machines, Poisson regression, etc., combined with effective feature transformations. For example, converting weekday
and hour
into circular coordinates (refer to this post) could improve model performance.
Incorporate Sequential Information¶
The experiment ignored the "time trends" of shares, comments, likes, and collections within 10 hours of posting. One potential direction is to use recurrent neural networks (RNN, LSTM, GRU, etc.) to capture these trends and nonlinear relationships between variables.
A simple approach is to combine the four count variables into a 4-dimensional vector (e.g., [4, 23, 17, 0]
for 4 shares, 23 comments, etc.), with a sequence length of 10. Each article's sequential information would then be a (10, 4)
matrix, which can be fed into the model for training.
For details on LSTM models, refer to my notes.
Explore Other Hyperparameter Optimization Methods¶
The experiment used GridSearchCV
for hyperparameter optimization. However, RandomizedSearchCV
might be a better choice for optimizing a large number of hyperparameter combinations. Refer to this 2012 JMLR paper for details.
Additionally, consider Bayesian optimization implementations provided by optuna
or hyperopt
. Watch this video for details, and compare the two libraries in this article.
Incorporate Text Data and User Behavior¶
The assignment does not include text data or user behavior. Since the ultimate goal is to "recommend articles more accurately to users," consider incorporating Latent Dirichlet Allocation (LDA) topic modeling to enrich article topic information. Refer to my presentation for details on LDA.
Additionally, combining user behavior data could enable more refined personalized text recommendations. Refer to this video and this paper for details.