Skip to content

Exploratory Data Analysis (EDA)

The dataset is divided into training and testing sets. To avoid data leakage, only the training set is analyzed during EDA, leaving the testing set aside.

When we first receive a dataset, the initial step is to examine its details, including the number of records and columns in each table. Below is the dataset information as of the update on 2020/04/13:

posts_train                Contains 793,751 records and 3 columns
post_shared_train          Contains 304,260 records and 3 columns
post_comment_created_train Contains 2,372,228 records and 3 columns
post_liked_train           Contains 3,395,903 records and 3 columns
post_collected_train       Contains 1,235,126 records and 3 columns

The training set covers posts from April 1, 2019, to the end of October 2019, spanning approximately seven months with around 793,000 posts. The goal is to build a predictive model that uses 10-hour post metrics (e.g., shares, comments, likes, and saves) to predict whether a post will receive 1,000 likes within 36 hours, classifying it as a "popular post."

Approximately 2.32% of the training posts are popular, equating to about 18,000 posts. This imbalance in the dataset necessitates techniques like over/undersampling during preprocessing and alternative evaluation metrics during model assessment.

Problem Definition

The task can be approached in four ways, based on "whether sequence information is considered" and "whether the problem is framed as regression or binary classification":

Regression Binary Classification
With Sequence Info RNNs (e.g., GRU), traditional time series models (e.g., ARMA, ARIMA) Same as left
Without Sequence Info Poisson regression, SVM, tree-based models, etc. Logistic regression, SVM, tree-based models, etc.

For simplicity and time constraints, we focus on "without sequence info" and "binary classification," aggregating 10-hour metrics and building a binary classification model to predict popular posts. The focus will be on handling imbalanced data, tree-based models, and subsequent discussions.

Relationships Between Variables

We simplify the dataset to include total shares, comments, likes, and saves within 10 hours and use a heatmap to observe their relationships with the total likes within 36 hours:

Info

Key observations from the heatmap:

  • Total likes within 36 hours moderately correlate with total likes within 10 hours (.58), shares (.36), and saves (.36), but weakly with comments (.17).
  • Total likes within 10 hours moderately correlate with shares (.63) and saves (.61).
  • Shares and saves within 10 hours moderately correlate (.48).

In simple terms, posts with more likes within 10 hours tend to have more shares and saves. However, the strongest predictor of total likes within 36 hours is the likes within 10 hours. Comments show little correlation with total likes.

Heatmaps of Key Metrics

Danger

To protect Dcard's proprietary information, color bars (cbar=False) are omitted, showing only relative relationships.

Total Posts by Time

We examine whether the number of posts varies across different time periods:

The x-axis represents 24 hours, and the y-axis represents days of the week.

Info

Observations:

  • Posts are concentrated during midday, afternoon, and evening (12:00–18:00), with weekdays slightly higher than weekends.
  • The second-highest posting period is weekday mornings (05:00–12:00).
  • Posts are relatively fewer during evenings (18:00–01:00) on both weekdays and weekends.

These trends are reasonable, as students primarily post during the day. The relatively high number of early morning posts might be due to companies posting content before students wake up.

Popular Post Proportion by Time

Next, we analyze whether certain time periods have a higher proportion of popular posts:

Info

Observations:

  • Posts during late-night and early-morning hours on weekends have a higher likelihood of being popular, likely due to increased user activity during these times.
  • The heatmap confirms that the proportion of popular posts varies by time.

Average Likes Within 10 Hours by Time

We then examine the average likes within 10 hours for posts made at different times:

Info

Observations:

  • Posts made between 21:00–11:00 generally receive more likes within 10 hours.
  • Posts made between 11:00–21:00, especially during late afternoon and dinner hours, receive fewer likes on average.

This difference might be because students are less active during late afternoon and dinner hours but more active during the evening. Early morning posts are also visible to students the next day.

Average Likes Within 36 Hours by Time

Finally, we analyze the average likes within 36 hours for posts made at different times:

The trends are consistent with the 10-hour analysis and are not elaborated further.

Summary

Info

Key takeaways:

  • Variables are generally highly correlated. Polynomial transformations (PolynomialFeatures) may not yield significant improvements during feature engineering.
  • Posting time significantly impacts the proportion of popular posts and the number of likes, and this information should be incorporated into the model.