FIT5145-Assignment2

Explore the structure and quality of the original dataset.

# load potential Library
library(tidyverse)
library(lubridate)
library(dplyr) 
library(ggplot2)
library(visdat)
library(naniar)

# load Data
ire_news <- read_csv("ireland_news.csv")

# Check the data headers, structure, glimpse it as well
head(ire_news)

## # A tibble: 6 × 5
##   publish_date    headline_category headline_text news_provider engagement_score
##   <chr>           <chr>             <chr>         <chr>                    <dbl>
## 1 Wednesday, 25t… opinion           Renua's plan… Irish Times              0.669
## 2 Tuesday, 30th … news              Racism cloud… Irish Examin…            0.427
## 3 Thursday, 13th… news.politics.oi… Minister for… RTE News                 0.694
## 4 Wednesday, 28t… opinion.letters   Kaczynski an… RTE News                 0.472
## 5 Saturday, 17th… opinion           Martyn Turner TheJournal.ie            0.551
## 6 Sunday, 28th o… business.markets  Chris Johns:… RTE News                 0.568

str(ire_news)

## spc_tbl_ [1,610,523 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ publish_date     : chr [1:1610523] "Wednesday, 25th of March, 2015" "Tuesday, 30th of June, 1998" "Thursday, 13th of March, 2014" "Wednesday, 28th of February, 2007" ...
##  $ headline_category: chr [1:1610523] "opinion" "news" "news.politics.oireachtas" "opinion.letters" ...
##  $ headline_text    : chr [1:1610523] "Renua's plan to publish Attorney General's advice misguided" "Racism clouds fight for justice after London street stabbing" "Minister for Justice has not 'failed in any respect'; Bruton tells Dáil" "Kaczynski and homosexuality" ...
##  $ news_provider    : chr [1:1610523] "Irish Times" "Irish Examiner" "RTE News" "RTE News" ...
##  $ engagement_score : num [1:1610523] 0.669 0.427 0.694 0.472 0.551 0.568 0.693 0.51 0.589 0.549 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   publish_date = col_character(),
##   ..   headline_category = col_character(),
##   ..   headline_text = col_character(),
##   ..   news_provider = col_character(),
##   ..   engagement_score = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

glimpse(ire_news)

## Rows: 1,610,523
## Columns: 5
## $ publish_date      <chr> "Wednesday, 25th of March, 2015", "Tuesday, 30th of …
## $ headline_category <chr> "opinion", "news", "news.politics.oireachtas", "opin…
## $ headline_text     <chr> "Renua's plan to publish Attorney General's advice m…
## $ news_provider     <chr> "Irish Times", "Irish Examiner", "RTE News", "RTE Ne…
## $ engagement_score  <dbl> 0.669, 0.427, 0.694, 0.472, 0.551, 0.568, 0.693, 0.5…

# check NAs
sum(is.na(ire_news))

## [1] 380

# Check proportion of NAs in headline_category
miss_var_summary(ire_news)

## # A tibble: 5 × 3
##   variable          n_miss pct_miss
##   <chr>              <int>    <num>
## 1 headline_category    193 0.0120  
## 2 engagement_score      85 0.00528 
## 3 publish_date          80 0.00497 
## 4 news_provider         13 0.000807
## 5 headline_text          9 0.000559

Base on whole project, processing data with “Global cleaning” codeing follows would avoid code redundant later.

# Global cleaning applied to following questions
ire_news_clean <- ire_news |>
  mutate(headline_category = str_trim(tolower(headline_category))) |>
  mutate(news_provider = str_trim(news_provider)) |>
  filter(!is.na(news_provider)) |>
  filter(news_provider != "...") |>
  filter(!is.na(engagement_score))

1 Question 1

How many unique values are there in the headline category column of the data file?

# Find the unique value of 5 headline category
# Filter NA
unique_value <- ire_news_clean |>
  filter(!is.na(headline_category)) |>
  mutate(headline_category = tolower(headline_category)) |>
  pull(headline_category) |>
  unique()

length(unique_value)

## [1] 118

# See how articles are distributed across categories(OPTIONAL)
# sort(table(ire_news$headline_category), decreasing = TRUE)

After checking the missing value. There are 193 NAs in headline_category, which only account for 1.2% of total in this column. I removed them with filter before counting.

The categories contain dot notation subcategories (e.g. news.politics.oireachtas, news.social, news.consumer). Though they all belong to same parent category, they represent genuinely distinct subcategories and are therefore counted as unique values.

In summary, there are 118 unique values in the headline_category column after removing 193 NA and “…” values.

2 Question 2

Which provider has the highest mean engagement? Which headline category has the lowest number of articles?

# Find the news provider has the largest mean of engagement_score
ire_news_clean |>
  group_by(news_provider) |>
  summarise(mean_engagement = mean(engagement_score, na.rm = TRUE)) |>
  slice_max(order_by = mean_engagement, n = 1) |> 
  ungroup()

## # A tibble: 1 × 2
##   news_provider     mean_engagement
##   <chr>                       <dbl>
## 1 Galway Advertiser           0.977

# Exclude Galway Advertiser,find the provider who has the highest mean engagement score

ire_news_clean |>
  filter(news_provider != "Galway Advertiser") |> 
  group_by(news_provider) |>
  summarise(mean_engagement = mean(engagement_score, na.rm = TRUE)) |>
  slice_max(order_by = mean_engagement, n = 1) |> 
  ungroup()

## # A tibble: 1 × 2
##   news_provider mean_engagement
##   <chr>                   <dbl>
## 1 Irish Times             0.556

2.1 Provider with Highest Mean Engagement:

After removing corrupted entries (“…”) and NA providers, the news provider with the highest mean engagement score is Galway Advertiser (mean = 0.977).

However, this is based on only 3 articles, which may not be statistically representative. Among providers with substantial article counts, Irish Times records the highest mean engagement (mean = 0.556, n = 402,575), making it arguably the most reliable indicator of high engagement.

Note: The limited article count for Galway Advertiser was identified during Q3 analysis. While it technically holds the highest mean, its small sample size require to be refer with caution in interpretation.

# Find the headline category has the lowest number of articles
ire_news_clean |>
  filter(!is.na(headline_category)) |>
  mutate(parent_category = str_extract(headline_category, "^[^._]+")) |>
  group_by(parent_category) |>
  summarise(number_of_article = n()) |>
  slice_min(order_by = number_of_article, n = 1) |> 
  ungroup()

## # A tibble: 6 × 2
##   parent_category number_of_article
##   <chr>                       <int>
## 1 entertainment                   1
## 2 eye on nature                   1
## 3 my holidays                     1
## 4 s                               1
## 5 x86%                            1
## 6 <NA>                            1

# Investigate suspicious categories
ire_news_clean |>
  filter(
    str_detect(headline_category, "^s\\.") | headline_category %in% c("x86%") |str_detect(headline_category, "^entertainment|^eye|^my"))

## # A tibble: 5 × 5
##   publish_date    headline_category headline_text news_provider engagement_score
##   <chr>           <chr>             <chr>         <chr>                    <dbl>
## 1 Thursday, 16th… s.g.              Birdies prov… Irish Times              0.631
## 2 Saturday, 11th… eye on nature     Eye On Nature TheJournal.ie            0.383
## 3 Saturday, 24th… my holidays       My Holidays   TheJournal.ie            0.463
## 4 Friday, 13th o… entertainment     Philips sale… Irish Times              0.486
## 5 Sunday, 05th o… x86%              Ask the expe… Irish Times              0.612

2.2 Headline Category with Lowest Number of Articles:

To identify the parent category, I extracted the first segment of headline_category before any . or _ separator using str_extract(headline_category, "^[^._]+"), as inconsistent notation (e.g. business.economy vs business_economy) likely represents the same category. At first glance, six parent categories each contain only one article: entertainment, eye on nature, my holidays, s, x86%, and NA. Further investigation revealed the following anomalies:

s.g. — a truncated entry whose headline text (“Birdies prove elusive for Rory McIlroy at US PGA”) suggests it represents sports.golf. Excluded as corrupted.
x86% — bears no meaningful relationship to its article content (“Ask the expert: I think my son (14) has OCD”). Excluded as corrupted.
eye on nature and my holidays — the headline_category exactly mirrors the headline_text, suggesting the system mistakenly used the article title as the category. Excluded as data entry errors.

After excluding these anomalies, entertainment is identified as the legitimate headline category with the lowest number of articles (n = 1).

3 Question 3

Compute the total number of articles for each headline category and news provider. Then, use a single R function to display the statistical information, i.e., Min, Max, and Mean, of the total number of articles (as computed previously) for each news provider. (Note: You may use multiple functions/commands to prepare the pre-processed data table, but when you compute and display the statistical information, you need to use a single R function.)

# Step 1 - compute total articles per provider and category
article_counts <- ire_news_clean |>
  group_by(news_provider, headline_category) |>
  summarise(total_articles = n()) |> 
  ungroup()

# Step 2 - apply summary() per provider
tapply(article_counts$total_articles, 
       article_counts$news_provider, 
       summary)

## $`Galway Advertiser`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       1       1       1 
## 
## $`Irish Examiner`
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      1.0    289.5    774.0   3797.9   2465.8 144569.0 
## 
## $`Irish Independent`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0    59.5   188.0   782.2   518.0 29134.0 
## 
## $`Irish Times`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     335    1073    5181    3318  203530 
## 
## $`RTE News`
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      1.0    204.2    616.5   2970.9   1857.2 115515.0 
## 
## $TheJournal.ie
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   171.2   473.5  2279.9  1458.0 86774.0

Step 1 computes the total number of articles for each combination of news_provider and headline_category using group_by() and summarise().

Step 2 applies the single function summary() via tapply() to compute and display Min, 1st Quartile, Median, Mean, 3rd Quartile, and Max of article counts for each news provider.

Notable findings:

Galway Advertiser shows uniform coverage (Min=Max=Mean=1) across only 3 articles total, suggesting minimal presence in the dataset. so it was exluded most time during further EDA.
Irish Times dominates with the highest mean (5,181) and maximum (203,530) articles per category, indicating broad coverage.

# Glance at detailed of "Galway Advertiser" provider
ire_news_clean |>
  filter(news_provider == "Galway Advertiser")

## # A tibble: 3 × 5
##   publish_date    headline_category headline_text news_provider engagement_score
##   <chr>           <chr>             <chr>         <chr>                    <dbl>
## 1 Monday, 3th of… news              Council appr… Galway Adver…            0.987
## 2 Wednesday, 18t… sport             Underdogs st… Galway Adver…            0.954
## 3 Friday, 2th of… business          West coast t… Galway Adver…            0.991

# Add article count alongside mean to overview the data
ire_news_clean |>
  group_by(news_provider) |>
  summarise(
    mean_engagement = round(mean(engagement_score, na.rm = TRUE), 3),
    n_articles = n()
  ) |>
  arrange(desc(mean_engagement)) |> 
  ungroup()

## # A tibble: 6 × 3
##   news_provider     mean_engagement n_articles
##   <chr>                       <dbl>      <int>
## 1 Galway Advertiser           0.977          3
## 2 Irish Times                 0.556     564751
## 3 RTE News                    0.545     320861
## 4 Irish Examiner              0.533     402575
## 5 Irish Independent           0.526      80562
## 6 TheJournal.ie               0.489     241672

4 Question 4

In which year did TheJournal.ie record its maximum engagement score? Which news provider shows the largest increase in engagement score over the years?

# Part A (TheJournal.ie max engagement year)
ire_news_clean |> 
  filter(news_provider == "TheJournal.ie") |> 
  mutate(year = str_extract(publish_date, "\\d{4}$")) |> 
  filter(!is.na(year)) |> 
  group_by(year) |> 
  summarise(max_score = max(engagement_score, na.rm = T)) |> 
  slice_max(max_score , n = 1) |> 
  ungroup()

## # A tibble: 1 × 2
##   year  max_score
##   <chr>     <dbl>
## 1 2017        1.2

Initial analysis identified 2017 as the year of maximum engagement (score = 1.2). However, since the engagement score is defined within [0,1], this value probably invalid.

After removing out-of-range scores, the maximum valid score is 1.0 - which I assume more reasonable - recorded across multiple years: 2011, 2017, 2018, 2019, 2020, and 2021.

# Check the score range 
ire_news_clean |> 
  filter(!is.na(engagement_score)) |> 
  group_by(news_provider) |> 
  summarise(
    min_score = min(engagement_score, na.rm = TRUE),
    max_score = max(engagement_score, na.rm = TRUE)
  ) |> 
  ungroup()

## # A tibble: 6 × 3
##   news_provider     min_score max_score
##   <chr>                 <dbl>     <dbl>
## 1 Galway Advertiser     0.954     0.991
## 2 Irish Examiner        0.206     1    
## 3 Irish Independent     0.215     1    
## 4 Irish Times           0.218     1    
## 5 RTE News              0.216     1    
## 6 TheJournal.ie     -5000         1.2

# Find the year TheJournal.ie max engagement score equal to 1
ire_news_clean |> 
  filter(news_provider == "TheJournal.ie", engagement_score == 1) |> 
  mutate(year = str_extract(publish_date, "\\d{4}$")) |> 
  filter(!is.na(year)) |> 
  select(year, engagement_score) |> 
  distinct() # see a unique list of years where this happened

## # A tibble: 6 × 2
##   year  engagement_score
##   <chr>            <dbl>
## 1 2020                 1
## 2 2021                 1
## 3 2019                 1
## 4 2017                 1
## 5 2011                 1
## 6 2018                 1

# Check year range.In case "1" show on every year recorded.
range(as.numeric(str_extract(ire_news_clean$publish_date[ire_news_clean$news_provider == "TheJournal.ie"], "\\d{4}$")), na.rm = TRUE)

## [1] 1996 2021

# Part B (largest increase)
yearly_engagement <- ire_news_clean |> 
  filter(engagement_score >= 0 & engagement_score <= 1) |> 
  mutate(year = as.numeric(str_extract(publish_date, "\\d{4}$"))) |>
  group_by(news_provider, year) |>
  summarise(mean_engagement = mean(engagement_score, na.rm = TRUE)) |>
  ungroup()

score_gap <- yearly_engagement |>
  group_by(news_provider) |>
  summarise(
    first_year_score = mean_engagement[which.min(year)],
    last_year_score  = mean_engagement[which.max(year)],
    score_gap = last_year_score - first_year_score
  ) |>
  arrange(desc(score_gap))

score_gap |> slice_max(score_gap, n = 1)

## # A tibble: 1 × 4
##   news_provider first_year_score last_year_score score_gap
##   <chr>                    <dbl>           <dbl>     <dbl>
## 1 Irish Times              0.444           0.647     0.202

Irish Times shows the largest increase in mean engagement score (gap = 0.202) between first and last recorded year

While, two approaches were considered for measuring engagement increase. The first compares mean engagement between a provider’s first and last recorded year, treating yearly means as representative values — analogous to comparing start and end points on a line chart.

The second identifies absolute min/max scores regardless of year, which was rejected as individual article scores don’t represent yearly trends. However, they fail to represent the “typical” performance of a provider in any given year. Using means may align with a cleaner visualization (e.g., a line chart) and a more honest comparison of growth.

5 Question 5

Investigate the factors associated with higher or lower engagement scores in the given dataset.

# Explore the score distribution in different publish day of week.

ire_news_clean |>
  filter(
    engagement_score >= 0 & engagement_score <= 1,
    !news_provider %in% c("Galway Advertiser")
  ) |>
  mutate(
    day = str_extract(publish_date, "^\\w+"),
    day = factor(day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", 
                  "Saturday", "Sunday"))
  ) |>
  ggplot(aes(x = day, y = engagement_score, fill = day)) +
  geom_violin(alpha = 0.7) +
  geom_boxplot(width = 0.1, alpha = 0.5, outlier.size = 0.5) +
  facet_wrap(~ news_provider) +
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "red") +
  scale_fill_brewer(palette = "Set3") +
  labs(
    title = "Engagement Score Distribution by Day of Week",
    x = "Day of Week",
    y = "Engagement Score"
  ) +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Learn more about Boxplot vs Violin

# Compare providers performance by different publish season.

ire_news_clean |>
  filter(
    engagement_score >= 0 & engagement_score <= 1,
    !news_provider %in% c("Galway Advertiser")
  ) |>
  mutate(
    month = str_extract(publish_date, 
                        "January|February|March|April|May|June|July|August|September|October|November|December"),
    season = case_when(
      month %in% c("December", "January", "February") ~ "Winter",
      month %in% c("March", "April", "May") ~ "Spring",
      month %in% c("June", "July", "August") ~ "Summer",
      month %in% c("September", "October", "November")  ~ "Autumn"
)) |>
  mutate(season = factor(season, 
                levels = c("Spring", "Summer", "Autumn", "Winter"))) |> 
  ggplot(aes(x = season, y = engagement_score, fill = season)) +
  geom_boxplot(width = 0.6) +
  scale_fill_brewer(palette = "Pastel1") + 
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "red") +
  facet_wrap(~ news_provider) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Check what articles are outliers(higher)

ire_news_clean |>
  filter(engagement_score >= 0 & engagement_score <= 1) |>
  mutate(month = str_extract(publish_date,
    "January|February|March|April|May|June|July|August|September|October|November|December"),
    season = case_when(
      month %in% c("December", "January", "February") ~ "Winter",
      month %in% c("March", "April", "May") ~ "Spring",
      month %in% c("June", "July", "August") ~ "Summer",
      month %in% c("September", "October", "November")  ~ "Autumn"
    )) |>
  group_by(news_provider, season) |>
  mutate(
    Q1 = quantile(engagement_score, 0.25),
    Q3 = quantile(engagement_score, 0.75),
    IQR = Q3 - Q1,
    is_outlier = engagement_score > Q3 + 1.5*IQR
  ) |>
  filter(is_outlier) |>
  select(headline_text, engagement_score, season, news_provider)

## # A tibble: 22,352 × 4
## # Groups:   news_provider, season [20]
##    headline_text                           engagement_score season news_provider
##    <chr>                                              <dbl> <chr>  <chr>        
##  1 Hotels in Northern Ireland to reopen f…            0.868 Summer Irish Times  
##  2 Three missing after Didcot collapse un…            0.98  Winter Irish Examin…
##  3 HSE managers told to risk-assess staff…            0.846 Summer Irish Examin…
##  4 Serious shortage of midwives in Dublin…            0.835 Spring Irish Examin…
##  5 Alcock & Brown statue to sit in Clifde…            0.81  Spring Irish Examin…
##  6 Big Tom 'monumentalised among the peop…            0.914 Autumn Irish Times  
##  7 Shatter welcomes crime data                        0.813 Summer Irish Times  
##  8 Go Walk: Glanrastal; Beara Peninsula; …            0.798 Summer Irish Examin…
##  9 Coronavirus: Three more deaths and 10 …            0.811 Summer Irish Times  
## 10 Five companies competing to redevelop …            0.833 Winter Irish Times  
## # ℹ 22,342 more rows

ire_news_clean |>
  filter(
    engagement_score >= 0 & engagement_score <= 1,
    !is.na(headline_category)) |>
  mutate(parent_category = str_extract(headline_category, "^[^._]+")) |>
  group_by(parent_category) |>
  summarise(
    mean_engagement = round(mean(engagement_score, na.rm = TRUE), 3),
    n_articles = n()
  ) |>
  # There are some "noise" - "s", "x86%","NA", etc. Keep only categories with a meaningful sample size - filter(n_articles > 10)
  filter(n_articles > 10) |> 
  arrange(desc(mean_engagement))

## # A tibble: 6 × 3
##   parent_category mean_engagement n_articles
##   <chr>                     <dbl>      <int>
## 1 news                      0.56      797757
## 2 business                  0.548     222880
## 3 sport                     0.533     261715
## 4 lifestyle                 0.506      95985
## 5 culture                   0.488      98919
## 6 opinion                   0.478     132965

5.1 Insights: Temporal Factors vs. Engagement

Drawing on the “Day of Week” violin plot and the seasonal boxplots, several key patterns emerge regarding the relationship between temporal factors and engagement.

Weekend Concentration Effect

The violin plot indicates a subtle upward shift in the density of higher engagement scores during weekends, particularly for TheJournal.ie and RTE News. Although median engagement remains close to 0.5 across all days, the distribution on Saturdays and Sundays shows a greater concentration of high-engagement observations. This suggests that, despite potentially lower publication volume, weekend articles receive more focused audience attention.

Seasonal Stability

The seasonal boxplots reveal minimal variation across different periods of the year. Both the interquartile ranges and median engagement levels remain largely consistent across seasons for all providers. Insight: Engagement with news content in Ireland appears largely invariant to seasonal effects, indicating that audience demand for news remains stable throughout the year.

Provider-Level Baseline Effect

A clear structural difference is observed between providers. The Irish Times consistently exhibits a higher lower-bound (i.e., a higher baseline engagement level) compared to TheJournal.ie, regardless of temporal conditions. This suggests that provider-specific factors, such as brand authority and audience loyalty, exert a stronger influence on engagement than timing-related variables.

Outlier Independence

Outlier analysis shows that maximum engagement values (approaching 1.0) occur across all days and seasons. This indicates that highly viral content is not temporally constrained. Insight: Extreme engagement outcomes are more likely driven by content-specific factors rather than when the article is published.

Summary

Overall, while a modest weekend effect is observable, temporal variables such as day of the week and season play a secondary role in shaping engagement. Instead, provider characteristics and content attributes appear to be the dominant determinants. The news engagement landscape demonstrates strong temporal stability, with consistent audience interaction patterns throughout the year.

ire_news_clean |>
  # Clean the scores and categories
  filter(engagement_score >= 0 & engagement_score <= 1,
         !is.na(headline_category)) |>
  # Extract the parent category
  mutate(parent_category = str_extract(headline_category, "^[^._]+")) |>
  # Filter "noise" without collapsing the whole dataframe
  add_count(parent_category) |> 
  filter(n > 10) |> 
  # plot
  ggplot(aes(x = reorder(parent_category, engagement_score, median), 
             y = engagement_score,
             fill = parent_category)) +
  # Add the violin layer for density (alpha makes it transparent)
  geom_violin(alpha = 0.3, color = "transparent", trim = TRUE) +
  geom_boxplot(width = 0.15, color = "black", outlier.size = 0.4, alpha = 0.7) +
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "red") +
  coord_flip() +
  labs(
    title = "Engagement Density and Distribution by Category",
    subtitle = "Categories with >10 articles only",
    x = "Category",
    y = "Engagement Score"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

5.2 Insights: Category factors on Engagement

Prior to analysis, it was hypothesised that News and Business would outperform other categories, given the structural advantages of institutional publishers in reporting finance, policy, and industry developments. The results largely support this expectation: both categories occupy the upper range of engagement, with relatively high median scores compared to others.

A hybrid violin–boxplot is employed to capture both summary statistics (median and quartiles) and the full distribution of engagement scores, enabling a more nuanced interpretation of within-category variation.

News Stability vs. Business Variability

Although News and Business exhibit similar central tendencies, their distributions differ markedly.

News shows a tightly concentrated density around the median, indicating that most articles fall within a narrow engagement range. This reflects a stable and consistent baseline of audience interaction.
Business, by contrast, displays a wider and more dispersed distribution. Despite a comparable median, the broader spread suggests greater variability, with engagement outcomes less predictable across articles.

Consistency in Lifestyle

Lifestyle demonstrates one of the most compact and symmetrical distributions.

Engagement values are closely clustered around the median, with relatively short tails.
This indicates a stable and predictable performance profile, where most articles achieve engagement levels near the category norm.

Lower Baseline in Opinion and Culture

Opinion and Culture are characterised by distributions skewed toward lower engagement values.

A substantial proportion of observations fall within the lower range (approximately 0.3–0.4), with fewer values concentrated at higher levels.
While high-engagement cases do occur, they are not representative of the overall distribution. This suggests a structurally lower baseline of engagement compared to more information-driven categories.

Outliers and High-End Variation

All categories exhibit right-skewed distributions, with tails extending toward higher engagement scores.

Sport and Lifestyle show particularly notable clusters of high-value outliers.
These cases likely reflect event-driven or topic-specific spikes rather than systematic category effects.

The global median (red dashed line) provides a useful benchmark: News, Business, and to some extent Sport, are positioned slightly above this level, indicating comparatively stronger engagement performance.

News has higher engagement overall because most of its articles perform well consistently, not just a few viral ones. Overall, the differences between categories are consistent, showing that the topic really affects how much people engage with the article.

Summary

Overall, differences in engagement across categories appear structural rather than incidental. The observed patterns reflect consistent distributional characteristics, suggesting that content type plays a fundamental role in shaping audience engagement, beyond isolated high-performing articles.

ire_news_clean |>
  # Basic Cleaning
  filter(engagement_score >= 0 & engagement_score <= 1,
         !is.na(headline_category),
         news_provider != "Galway Advertiser") |>
  
  # Extract Category
  mutate(parent_category = str_extract(headline_category, "^[^._]+")) |>
  
  # Remove Noise (Only keep categories with meaningful size)
  add_count(parent_category) |> 
  filter(n > 10) |> 
  
  # Summarize for the Heatmap
  group_by(news_provider, parent_category) |>
  summarise(mean_engagement = mean(engagement_score, na.rm = TRUE), .groups = "drop") |>
  
  # 5. Plot
  ggplot(aes(x = news_provider, 
             y = parent_category, 
             fill = mean_engagement)) +
  geom_tile(color = "white") + # White border makes tiles pop
  
  # Use ColorBrewer (Distiller is for continuous data)
  # Apply knowledge leart from FIT5147
  # Randomly choose color Platte from https://colorbrewer2.org. It compare difference clearly.
  scale_fill_distiller(palette = "YlOrRd", direction = 1) + 
  
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    title = "Heatmap: Mean Engagement by Provider and Category",
    subtitle = "Excluding categories with n < 10",
    x = "News Provider",
    y = "Category",
    fill = "Mean Score"
  )

5.3 Insights:Provider and Category Effects on Engagement

Hypothesis vs. Observed Pattern

The initial expectation was a diagonal structure, where different providers dominate specific categories. Instead, the heatmap reveals a predominantly columnar pattern, indicating that engagement levels are more strongly associated with provider-level factors—such as brand authority and audience base—than with category specialisation.

However, this pattern is not purely vertical: there are still consistent category-level differences across providers, suggesting a joint effect rather than a single dominant driver.

Category-Level Consistency (Row-wise Comparison)

Across all providers, a similar ranking of categories emerges:

News and Business consistently show the highest mean engagement (darkest tones).
Sport and Lifestyle occupy a middle range.
Opinion and Culture remain the lowest-performing categories.

This indicates that audience preferences are structurally aligned across the market, with information-driven content (e.g., News, Business) systematically outperforming softer or interpretive content.

Provider-Level Hierarchy (Column-wise Comparison)

A clear gradient exists across providers:

The Irish Times consistently records the highest engagement across nearly all categories.
RTE News follows closely, maintaining strong but slightly lower values.
Irish Examiner and Irish Independent occupy a middle tier.
TheJournal.ie consistently shows the lowest engagement levels across categories.

This hierarchy suggests that provider reputation and audience trust exert a stronger and more uniform influence on engagement than content category alone.

Absence of Strong Niche Dominance

There is limited evidence that any provider “owns” a specific category. While minor variations exist (e.g., slightly stronger Sport performance for RTÉ News), these differences are marginal rather than structurally distinct.

Insight: Engagement advantages are broad-based rather than category-specific, reinforcing the dominance of provider-level effects.

Structural Underperformance of Opinion and Culture

Opinion and Culture consistently exhibit lower engagement across all providers.

Their lighter coloration indicates a systematically lower baseline, rather than isolated weak performance.
This suggests that interpretive or niche cultural content attracts a narrower audience compared to factual or utility-driven reporting.

Stability of Lifestyle as a Mid-Tier Category

Lifestyle demonstrates relatively consistent, mid-range engagement across all providers.

It neither reaches the high engagement of News/Business nor the lower levels of Opinion/Culture.
This indicates a stable but non-dominant role, contributing consistent engagement without extreme variability.

Summary

Overall, the heatmap indicates that engagement is primarily driven by provider-level authority, with category effects acting as a secondary but consistent layer. Rather than niche specialisation, the Irish news landscape exhibits a hierarchical structure, where stronger providers achieve higher engagement across all content types, and category preferences remain broadly uniform across the market.

# Compute the correlation
ire_news_clean |>
  filter(engagement_score >= 0 & engagement_score <= 1) |>
  mutate(headline_length = str_count(headline_text, "\\w+")) |>
  with(cor.test(headline_length, engagement_score, method = "pearson"))

## 
##  Pearson's product-moment correlation
## 
## data:  headline_length and engagement_score
## t = 488.01, df = 1610418, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3575825 0.3602735
## sample estimates:
##       cor 
## 0.3589287

# Draw a comparative jitter plot 

ire_news_clean |>
  filter(engagement_score >= 0 & engagement_score <= 1,
         !news_provider %in% "Galway Advertiser") |>
  mutate(headline_length = str_count(headline_text, "\\w+")) |>
  # Focus on the most common range to see the trend clearly
  filter(headline_length <= 30) |> 
  ggplot(aes(x = headline_length, y = engagement_score, color = news_provider)) +
  geom_jitter(alpha = 0.05, width = 0.3) + 
  geom_smooth(method = "lm", color = "grey20") + # orange line to stand out against colored dots
  facet_wrap(~ news_provider) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "Does Headline Length Affect Providers Differently?",
    x = "Headline Word Count",
    y = "Engagement Score"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

5.4 Insights: Headline Length and Engagement

A weak but statistically significant positive correlation was identified between headline length and engagement score (r = 0.359, p < 0.001). While this indicates a measurable relationship, the effect size is modest, suggesting that headline length explains only a limited proportion of the variation in engagement.

To further explore this relationship, a jitter plot with fitted regression lines was used to examine patterns across providers.

Distribution of Headline Length

Headline lengths are most densely concentrated between 5 and 10 words across all providers, indicating a common editorial norm. The distribution thins substantially beyond 15 words, with relatively few headlines exceeding 20 words.

Headlines over 20 words usually just get average engagement(score: 0.4 ~ 0.6), meaning making a headline super long doesn’t get more clicks. While there is a slight positive trend, headline length isn’t the main factor driving engagement.

Consistent but Weak Positive Trend

The regression lines across providers exhibit similar, gently upward slopes, confirming a consistent positive association between headline length and engagement. However, given the relatively low correlation coefficient, this relationship should be interpreted as incremental rather than decisive. Longer headlines may provide additional context that marginally enhances engagement, but they are not a dominant driver.

Provider-Level Differences

Clear differences emerge across providers:

The Irish Times and Irish Examiner display a wider horizontal spread, maintaining higher densities even at longer headline lengths (15–20 words). This suggests a greater tolerance—or preference—among their audiences for more detailed headlines.
In contrast, TheJournal.ie and Irish Independent show distributions that are slightly shifted downward on the engagement axis, with fewer high-engagement observations (e.g., above 0.8).

Relative Importance of Headline Length

Although headline length shows a consistent directional effect, provider-level differences remain more pronounced. For example, shorter headlines from The Irish Times often achieve higher engagement than longer headlines from TheJournal.ie.

Insight: This reinforces the earlier finding that provider authority and audience base exert a stronger influence on engagement than headline characteristics alone.

Summary

Overall, headline length demonstrates a statistically significant but practically limited effect on engagement. While moderately longer headlines may slightly improve performance, the impact is secondary to structural factors such as provider identity and content quality.

6 Question 6

Let’s investigate headline categories, engagement scores, and time.

6.1 A. For each news provider, identify the top 3 headline categories with the strongest association between yearly article volume and yearly mean engagement score, based on correlation analysis. Please provide supporting results.

# Step 1 - build yearly summary per provider × category
yearly_summary <- ire_news_clean |>
  filter(
    engagement_score >= 0 & engagement_score <= 1,
    !is.na(headline_category),
    !news_provider %in% "Galway Advertiser"
  ) |>
  mutate(
    year = as.numeric(str_extract(publish_date, "\\d{4}$")),
    parent_category = str_extract(headline_category, "^[^._]+")
  ) |>
  group_by(news_provider, parent_category, year) |>
  summarise(
    n_articles = n(),
    mean_engagement = mean(engagement_score, na.rm = TRUE)
  ) |>
  ungroup()

yearly_summary |>
  group_by(news_provider, parent_category) |>
  summarise(
    correlation = cor(n_articles, mean_engagement,
                     use = "complete.obs")
  ) |>
  group_by(news_provider) |>
  slice_max(abs(correlation), n = 3) |> 
  ungroup()

## # A tibble: 15 × 3
##    news_provider     parent_category correlation
##    <chr>             <chr>                 <dbl>
##  1 Irish Examiner    lifestyle             0.841
##  2 Irish Examiner    news                 -0.591
##  3 Irish Examiner    culture               0.541
##  4 Irish Independent lifestyle             0.842
##  5 Irish Independent news                 -0.568
##  6 Irish Independent culture               0.530
##  7 Irish Times       lifestyle             0.831
##  8 Irish Times       news                 -0.571
##  9 Irish Times       business              0.542
## 10 RTE News          lifestyle             0.827
## 11 RTE News          news                 -0.589
## 12 RTE News          culture               0.551
## 13 TheJournal.ie     lifestyle             0.820
## 14 TheJournal.ie     news                 -0.584
## 15 TheJournal.ie     business              0.537

# Visualization: Mean of engagement score X News Category Heatmap

# Calculate and SAVE the correlations
correlations <- yearly_summary |>
  group_by(news_provider, parent_category) |>
  summarise(
    correlation = cor(n_articles, mean_engagement, use = "complete.obs"),
    .groups = "drop"
  ) |>
  # Keep only the most interesting relationships to avoid a messy plot
  group_by(news_provider) |>
  slice_max(abs(correlation), n = 3) |> 
  ungroup()

# Visalize
correlations |>
  ggplot(aes(x = news_provider, 
             y = parent_category,
             fill = correlation)) +
  geom_tile() +
  geom_text(aes(label = round(correlation, 2)), 
            size = 3) +
  # I want to use BrBG
  scale_fill_gradient2( 
    high = "#e34a33", 
    mid = "white", 
    low = "#2ca25f",
    midpoint = 0
  ) +
  labs(
    title = "Correlation: Yearly Article Volume vs Mean Engagement",
    x = "News Provider",
    y = "Category"
  )

News, lifestyle, culture are top3 of average engagement score across Irish Examiner, Irish Independent and RTE News. while Irish Times and TheJounal.ie perform better on News, lifestyle and business.

Across all providers, lifestyle consistently shows the strongest positive association (r = 0.82-0.84) between yearly volume and engagement, while news shows consistent negative association (r = -0.57 to -0.59). That suggest adding more news articles is statistically likely to pull their average engagement score down.

Unlike News, Lifestyle content is not yet “saturated.” Increased volume in this category is directly tied to higher mean engagement. This may suggest that lifestyle content growth reflects audience demand, while news volume growth may dilute engagement quality.

6.2 B. For each news provider, identify the top 3 headline categories with the most significant increasing trends in yearly articles counts. Please provide supporting results.

ire_news_clean |>
  # 1. Basic Cleaning
  filter(
    engagement_score >= 0 & engagement_score <= 1,
    !is.na(headline_category),
    news_provider != "Galway Advertiser"
  ) |>
  # 2. Extract Category and Year
  mutate(
    year = as.numeric(str_extract(publish_date, "\\d{4}$")),
    parent_category = str_extract(headline_category, "^[^._]+")
  ) |>
  # 3. Aggregate to Year Level (Crucial for the regression to work)
  group_by(news_provider, parent_category, year) |>
  summarise(n_articles = n(), .groups = "drop") |>
  
  # 4. Run the Linear Model per Group
  group_by(news_provider, parent_category) |>
  # We use 'n() > 1' check because you need at least 2 years to calculate a slope
  filter(n() > 1) |> 
  summarise(
    slope = coef(lm(n_articles ~ year))["year"],
    p_value = summary(lm(n_articles ~ year))$coefficients["year", "Pr(>|t|)"],
    .groups = "drop"
  ) |>
  
  # 5. Get the Top 3 growing categories per provider
  group_by(news_provider) |>
  slice_max(slope, n = 3)

## # A tibble: 15 × 4
## # Groups:   news_provider [5]
##    news_provider     parent_category slope     p_value
##    <chr>             <chr>           <dbl>       <dbl>
##  1 Irish Examiner    lifestyle       54.7  0.000000267
##  2 Irish Examiner    business        36.9  0.0222     
##  3 Irish Examiner    culture         22.4  0.0127     
##  4 Irish Independent lifestyle       11.0  0.000000134
##  5 Irish Independent business         7.11 0.0285     
##  6 Irish Independent culture          4.56 0.0115     
##  7 Irish Times       lifestyle       74.0  0.000000465
##  8 Irish Times       business        52.3  0.0202     
##  9 Irish Times       culture         29.7  0.0132     
## 10 RTE News          lifestyle       42.0  0.000000261
## 11 RTE News          business        29.3  0.0258     
## 12 RTE News          culture         18.2  0.00785    
## 13 TheJournal.ie     lifestyle       30.5  0.000000502
## 14 TheJournal.ie     business        22.5  0.0203     
## 15 TheJournal.ie     culture         12.4  0.0145

growth_trends <- ire_news_clean |>
  # 1. Basic Cleaning
  filter(
    engagement_score >= 0 & engagement_score <= 1,
    !is.na(headline_category),
    news_provider != "Galway Advertiser"
  ) |>
  # 2. Extract Category and Year
  mutate(
    year = as.numeric(str_extract(publish_date, "\\d{4}$")),
    parent_category = str_extract(headline_category, "^[^._]+")
  ) |>
  # 3. Aggregate to Year Level (Crucial for the regression to work)
  group_by(news_provider, parent_category, year) |>
  summarise(n_articles = n(), .groups = "drop") |>
  
  # 4. Run the Linear Model per Group
  group_by(news_provider, parent_category) |>
  # We use 'n() > 1' check because you need at least 2 years to calculate a slope
  filter(n() > 1) |> 
  summarise(
    slope = coef(lm(n_articles ~ year))["year"],
    p_value = summary(lm(n_articles ~ year))$coefficients["year", "Pr(>|t|)"],
    .groups = "drop"
  ) |>
  
  # 5. Get the Top 3 growing categories per provider
  group_by(news_provider) |>
  slice_max(slope, n = 3)

# Assuming your result is saved as 'growth_trends'
growth_trends |>
  ggplot(aes(x = reorder(parent_category, slope), y = slope, fill = news_provider)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ news_provider, scales = "free_y") +
  coord_flip() + # Makes category names easier to read
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  labs(
    title = "Annual Content Expansion by Category",
    subtitle = "Slope represents the average increase in articles published per year",
    x = "Category",
    y = "Growth Slope (Articles per Year)"
  )

Across all five news providers, a consistent pattern of content expansion is observed. The three headline categories with the most significant increasing trends in yearly article counts are Lifestyle, Business, and Culture, indicating that these shifts reflect broader industry dynamics rather than provider-specific editorial strategies.

Among them, Lifestyle demonstrates the steepest growth across all providers, marking it as the primary driver of expansion in Irish digital news. For instance, The Irish Times records an annual increase of approximately 74 articles in this category, followed by The Irish Examiner at 54.7. Business shows moderate but steady growth (with slopes ranging from 7.11 to 52.25), while Culture exhibits a more gradual yet consistent upward trend.

These patterns align with wider shifts in media consumption. The increasing prominence of lifestyle content reflects growing audience preference for personalised and interest-driven topics, which tend to perform strongly in digital engagement environments.

Notably, the categories with the strongest growth trends (Q6B) closely align with those exhibiting the highest volume–engagement correlations (Q6A), particularly Lifestyle. This suggests a cycle: as publishers post more Lifestyle articles, engagement goes up, which probably encourages them to post even more of that content.

All reported trends are statistically significant (p < 0.001), confirming that these are stable, long-term developments rather than random variation.

7 Question 7

Create a new Period column, then compute the total number of articles by period and headline category, and generate a boxplot showing the distribution of the total number of articles for each period.

# Mind map
# Step 1 → Filter 2019-2020 + create Period column
# Step 2 → Compute total articles by period AND category
# Step 3 → Boxplot of distribution per period

ire_news_clean |>
  # 1. Convert publish_date to date object and extract year/month
  # Adjust dmy() if your CSV date format is different (e.g., mdy())
  mutate(date_obj = dmy(publish_date),
         year = year(date_obj),
         month = month(date_obj)) |>
  
  # 2. Filter for 2019 and 2020
  filter(year %in% c(2019, 2020)) |>
  
  # 3. Create the Period column (1-8) based on specification
  mutate(
    Period = case_when(
    year == 2019 & month %in% 1:3 ~ "Period 1",
    year == 2019 & month %in% 4:6 ~ "Period 2",
    year == 2019 & month %in% 7:9 ~ "Period 3",
    year == 2019 & month %in% 10:12 ~ "Period 4",
    year == 2020 & month %in% 1:3 ~ "Period 5",
    year == 2020 & month %in% 4:6 ~ "Period 6",
    year == 2020 & month %in% 7:9 ~ "Period 7",
    year == 2020 & month %in% 10:12 ~ "Period 8"
  )) |>
  
  # Create parent_category
  mutate(parent_category = str_extract(headline_category, "^[^._]+")) |> 
  
  add_count(parent_category) |> 
  filter(n > 10) |> 
  
  # 4. Compute total articles by Period AND Category
  # This creates the distribution of counts for the boxplot
  group_by(Period, parent_category) |>
  summarise(total_articles = n(), .groups = "drop") |>
  
  # 5. Generate Boxplot
  ggplot(aes(x = Period, y = total_articles)) +
  geom_boxplot() +
  geom_hline(aes(yintercept = median(total_articles)), 
           linetype = "dashed", color = "red") +
  theme_minimal() +
  labs(
    title = "Distribution of Article Counts by Category per Period (2019-2020)",
    x = "Quarterly Period",
    y = "Total Articles"
  )

Prior to analysis, low-volume categories (n < 10) were excluded to reduce noise, resulting in a cleaner distribution centered around a median of approximately 2,000 articles per category per period.

Article production remained relatively stable across all periods in 2019 (Periods 1-4) and into Q1 2020 (Period 5), suggesting consistent editorial output preceding the pandemic. Interestingly, Period 5 (January-March 2020) — coinciding with the initial COVID-19 outbreak — does not show a significant disruption, possibly reflecting a surge in news coverage that offset reductions in other categories.

A sustained decline is observed from Period 6 onwards (April-December 2020), with Period 8 recording the sharpest drop — median article count falling below 1,500 and the highest outlier reducing to approximately 3,500 compared to ~6,000 in earlier periods.

This drop makes sense because the pandemic likely caused budget cuts, less advertising money, and staffing issues during the lockdowns.

The consistent presence of high outliers across all periods suggests one dominant category — likely news — maintains disproportionately high volume regardless of broader publishing trends.

Background you may explore

- COVID-19 IN IRELAND

- Annual Report of the Epidemiology of COVID-19 in Ireland, 2021-2022

8 Question 8

We want to examine the trend in the number of articles published by RTE News in September over the years. Please create an appropriate chart.

ire_news_clean |>
  # 1. Filter for RTE News and parse dates
  filter(news_provider == "RTE News") |> 
  mutate(date_obj = dmy(publish_date)) |>
  
  # 2. Extract Year and Month, then filter for September
  mutate(
    year = year(date_obj),
    month = month(date_obj)
  ) |>
  filter(month == 9) |> 
  
  # 3. Count articles per year
  group_by(year) |>
  summarise(n_articles = n(), .groups = "drop") |>
  
  # 4. Generate the Chart
  ggplot(aes(x = year, y = n_articles)) +
  geom_line(color = "#005387", size = 1) + # RTE Brand Blue
  geom_point(color = "#005387", size = 2) +
  theme_minimal() +
  labs(
    title = "RTE News: September Publication Trends (Over the Years)",
    subtitle = "Total articles published during the month of September",
    x = "Year",
    y = "Number of Articles",
    caption = "Data Source: Irish News Dataset"
  ) +
  annotate("text", x = 2009, y = 1350,
         label = "Irish Financial Crisis Peak",
         hjust = 1.1, size = 3, color = "darkred") +
  annotate("text", x = 2020, y = 850,
         label = "COVID-19 Impact",
         hjust = 1.1, size = 3, color = "darkred") +
  annotate("point", x = c(2009, 2020), 
         y = c(1330, 870),
         color = "red", size = 2)

The chart illustrates the temporal trend in RTE News’ September publication volume.

Article counts increased steadily from approximately 910 in 1996 to 2001, followed by a noticeable decline between 2001 and 2005.

The overall peak occurs in 2009, with publication volume exceeding 1,300 articles. This surge likely because of the Irish financial crisis, as people wanted to read more news about the economy and politics during that time.

After 2009, September article counts exhibit greater volatility alongside a general downward trend. This pattern may be associated with structural changes in the media landscape, particularly the shift toward digital consumption and the growing influence of social media as a competing news source.

The lowest point is observed in September 2020, with fewer than 900 articles published across the 25-year period. This decline coincides with Ireland’s second wave of COVID-19 and may reflect operational pressures on newsrooms, including resource constraints and disruptions caused by prolonged lockdowns.

9 Question 9

Using both the original dataset and external datasets, investigate the factors influencing the yearly trend in the number of articles published by some news providers. You may select one or more news providers for this investigation and should analyse at least 20 years of data.

External Dataset:

- Economy

- Individuals using the Internet (% of population)

- You can directly download via Github for filtered ones

This analysis aims to examine whether article volume is influenced by the rise of the internet era and broader economic conditions.

Based on the exploratory analysis conducted in the previous questions, The Irish Times and TheJournal.ie exhibit the highest and lowest mean engagement scores, respectively, across the six major headline categories. As such, they are selected as two contrasting cases for comparative analysis.

Using the original dataset provided for this assignment, the data was filtered to cover the full period from 1996 to 2021. However, several data limitations should be noted.

First, background research indicates that TheJournal.ie was established in 2010. Despite this, the dataset contains records attributed to this provider prior to its founding year, likely due to synthetic or adjusted data construction for academic purposes. For consistency, both providers are analysed across the full time range, though this limitation is acknowledged.

Second, variables obtained from external datasets differ substantially in scale, which may hinder direct comparison and interpretability. Specifically, article counts are measured in thousands (e.g., 10,000–20,000), GDP in millions of euros (e.g., 100,000–500,000), and internet usage as percentages (e.g., 2%–94%).

To address this issue, Min–Max normalisation was applied to all variables, transforming them onto a common scale of \([0, 1]\). This standardisation facilitates meaningful comparison across variables and enables the analysis of correlated trends and relative growth patterns over the 25-year period.

# Run the code chunk below. Pick metrics that valuable for further data exploratory. 

internet_clean <- read_csv("API_IT.NET.USER.ZS_DS2_en_csv_v2_325.csv", skip = 4) |> # skipping the first 4 lines(they are description: source, date. etc)
  filter(`Country Name` == "Ireland") |>
  # Pivot year columns (1996 to 2021) into rows
  pivot_longer(cols = `1996`:`2021`, names_to = "year", values_to = "internet_usage") |>
  mutate(
    year = as.numeric(year),
    internet_usage = round(internet_usage, 3)) |>
  select(year, internet_usage)

# Process Economy Data
economy_clean <- read_csv("ireland_economy.csv") |>
  # 1. Use %in% to select BOTH (this is like an "OR" filter)
  filter(`Statistic Label` %in% c("GDP at Constant Market Prices", 
                                  "GNP at Constant Market Prices")) |>
  
  # 2. Extract Year
  mutate(year = as.numeric(str_extract(Quarter, "^\\d{4}"))) |>
  
  # 3. Group by Year AND the Label to keep them separate
  group_by(year, `Statistic Label`) |>
  
  summarise(annual_value = sum(VALUE, na.rm = TRUE), .groups = "drop") |> 
  
  # transformed quarterly CSO GDP data into annual totals to match the news publication frequency
  # Pivot the labels into their own columns (annual_gdp and annual_gnp)
  pivot_wider(names_from = `Statistic Label`, values_from = annual_value) |>
  rename(
    annual_gdp = `GDP at Constant Market Prices`,
    annual_gnp = `GNP at Constant Market Prices`
  ) |>
  filter(year >= 1996 & year <= 2021)

# 3. Process News Data (Aggregating by Year) 
news_yearly <- ire_news_clean |>
  filter(news_provider %in% c("Irish Times", "TheJournal.ie")) |>
  mutate(year = year(dmy(publish_date))) |>
  filter(!is.na(year)) |> 
  group_by(year, news_provider) |>
  summarise(n_articles = n(), .groups = "drop")

# The Big Join!
# left join keyed on 'year' to preserve all news publication
# Join the external data into news counts
final_analysis_data <- news_yearly |>
  left_join(internet_clean, by = "year") |>
  left_join(economy_clean, by = "year")

# Check the result
head(final_analysis_data)

## # A tibble: 6 × 6
##    year news_provider n_articles internet_usage annual_gdp annual_gnp
##   <dbl> <chr>              <int>          <dbl>      <dbl>      <dbl>
## 1  1996 Irish Times        19009           2.2      117239     113891
## 2  1996 TheJournal.ie       8352           2.2      117239     113891
## 3  1997 Irish Times        19381           4.09     130162     124946
## 4  1997 TheJournal.ie       8318           4.09     130162     124946
## 5  1998 Irish Times        19174           8.1      141571     134403
## 6  1998 TheJournal.ie       8275           8.1      141571     134403

tail(final_analysis_data)

## # A tibble: 6 × 6
##    year news_provider n_articles internet_usage annual_gdp annual_gnp
##   <dbl> <chr>              <int>          <dbl>      <dbl>      <dbl>
## 1  2019 Irish Times        21584           87       401983     305914
## 2  2019 TheJournal.ie       9358           87       401983     305914
## 3  2020 Irish Times        17911           92       430740     317898
## 4  2020 TheJournal.ie       7662           92       430740     317898
## 5  2021 Irish Times         9827           93.5     500771     361583
## 6  2021 TheJournal.ie       4160           93.5     500771     361583

# Scaling the data for visual comparison
final_analysis_data <- final_analysis_data |>
  group_by(news_provider) |> # Scale within each provider if needed
  mutate(
    scaled_articles = round((n_articles - min(n_articles)) / (max(n_articles) - min(n_articles)), 3),
    scaled_gdp = round((annual_gdp - min(annual_gdp)) / (max(annual_gdp) - min(annual_gdp)), 3),
    scaled_gnp = round((annual_gnp - min(annual_gnp)) / (max(annual_gnp) - min(annual_gnp)), 3),
    scaled_internet = round((internet_usage - min(internet_usage)) / (max(internet_usage) - min(internet_usage)), 3)
  )|>
  ungroup()

As observed in the tail() output, the article count for 2021 is approximately half that of 2020. Given that such a sharp decline is unlikely to reflect a genuine reduction in newsroom capacity (e.g., a 50% workforce reduction), a more plausible explanation is that the 2021 data is incomplete — for instance, covering only part of the year (e.g., up to mid-year).

Accordingly, the 2021 observations were excluded from the correlation analysis. Including an incomplete year would distort the temporal trend and potentially lead to biased or misleading statistical inferences.

9.1 Overall trend visualization

# Scaling the data for visual comparison
final_analysis_data <- final_analysis_data |>
  group_by(news_provider) |> # Scale within each provider if needed
  mutate(
    scaled_articles = round((n_articles - min(n_articles)) / (max(n_articles) - min(n_articles)), 3),
    scaled_gdp = round((annual_gdp - min(annual_gdp)) / (max(annual_gdp) - min(annual_gdp)), 3),
    scaled_gnp = round((annual_gnp - min(annual_gnp)) / (max(annual_gnp) - min(annual_gnp)), 3),
    scaled_internet = round((internet_usage - min(internet_usage)) / (max(internet_usage) - min(internet_usage)), 3)
  )|>
  ungroup()

# Article amount time trend overover
ggplot(final_analysis_data |> 
         filter(year < 2021), aes(x = year, y = n_articles)) +
  geom_line(color = "grey70") + 
  geom_point(aes(color = news_provider)) +
  geom_smooth(method = "loess", color = "blue", fill = "lightblue", alpha = 0.2) + # Shows the average trend
  facet_wrap(~news_provider, scales = "free_y") + # 'free_y' makes y-aixs of two providers fit their own records.(max and min)
  theme_minimal() +
  labs(
    title = "Comparative Trends: Legacy vs. Digital Native",
    subtitle = "Fitted trend lines (LOESS) showing volume stability vs. growth",
    x = "Year", y = "Total Articles"
  )

The temporal trends in article volume for the two news providers exhibit a highly similar pattern, both following an inverted U-shaped trajectory. Publication output increased steadily from 1996 (the starting point of the dataset), reached a pronounced peak around 2009, and subsequently entered a period of decline.

A closer examination of the y-axis indicates a substantial difference in scale: the publication volume of TheJournal.ie remains consistently less than half that of The Irish Times, suggesting a significant disparity in production capacity between the two providers.

The year 2010 is identified as a structural breakpoint, aligning with Ireland’s entry into the EU–IMF bailout programme. This event represents a critical macroeconomic turning point that reshaped both the national economy and the media landscape. Based on this temporal segmentation, two hypotheses are proposed for further validation.

# Hypothesis A: Internet adoption
# - Irish Times growth phase (1996-2010): strong positive correlation
# - Post-saturation (2011-2020): correlation breaks down

# Segment 1: Growth Phase (1996-2010)
it_early1 <- final_analysis_data |> filter(news_provider == "Irish Times" & year <= 2010)
cor.test(it_early1$n_articles, it_early1$internet_usage)

## 
##  Pearson's product-moment correlation
## 
## data:  it_early1$n_articles and it_early1$internet_usage
## t = 4.4018, df = 13, p-value = 0.0007153
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4328942 0.9209183
## sample estimates:
##       cor 
## 0.7736057

it_early2 <- final_analysis_data |> filter(news_provider == "TheJournal.ie" & year <= 2010)
cor.test(it_early2$n_articles, it_early2$internet_usage)

## 
##  Pearson's product-moment correlation
## 
## data:  it_early2$n_articles and it_early2$internet_usage
## t = 3.8762, df = 13, p-value = 0.00191
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3519628 0.9050159
## sample estimates:
##     cor 
## 0.73221

# Segment 2: Saturation Phase (2011-2020) - Exclude 2021 if incomplete
it_late1 <- final_analysis_data |> filter(news_provider == "Irish Times" & year > 2010 & year < 2021)
cor.test(it_late1$n_articles, it_late1$internet_usage)

## 
##  Pearson's product-moment correlation
## 
## data:  it_late1$n_articles and it_late1$internet_usage
## t = -2.6337, df = 8, p-value = 0.03001
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.91744409 -0.09079206
## sample estimates:
##        cor 
## -0.6814625

it_late2 <- final_analysis_data |> filter(news_provider == "TheJournal.ie" & year > 2010 & year < 2021)
cor.test(it_late2$n_articles, it_late2$internet_usage)

## 
##  Pearson's product-moment correlation
## 
## data:  it_late2$n_articles and it_late2$internet_usage
## t = -2.8721, df = 8, p-value = 0.02076
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9264931 -0.1502979
## sample estimates:
##        cor 
## -0.7124993

# Clean comparison table
cor_results <- tibble(
  Provider = c("Irish Times", "Irish Times", "TheJournal.ie", "TheJournal.ie"),
  Phase = c("Growth (1996-2010)", "Saturation (2011-2020)",
            "Growth (1996-2010)", "Saturation (2011-2020)"),
  r = c(
    cor(it_early1$n_articles, it_early1$internet_usage),
    cor(it_late1$n_articles,  it_late1$internet_usage),
    cor(it_early2$n_articles, it_early2$internet_usage),
    cor(it_late2$n_articles,  it_late2$internet_usage)
  ),
  p_value = c(
    cor.test(it_early1$n_articles, it_early1$internet_usage)$p.value,
    cor.test(it_late1$n_articles,  it_late1$internet_usage)$p.value,
    cor.test(it_early2$n_articles, it_early2$internet_usage)$p.value,
    cor.test(it_late2$n_articles,  it_late2$internet_usage)$p.value
  )
) |>
  mutate(
    r = round(r, 3),
    p_value = round(p_value, 4),
    Significant = ifelse(p_value < 0.05, "Yes ✓", "No 𝘹")
  )

print(cor_results)

## # A tibble: 4 × 5
##   Provider      Phase                       r p_value Significant
##   <chr>         <chr>                   <dbl>   <dbl> <chr>      
## 1 Irish Times   Growth (1996-2010)      0.774  0.0007 Yes ✓      
## 2 Irish Times   Saturation (2011-2020) -0.681  0.03   Yes ✓      
## 3 TheJournal.ie Growth (1996-2010)      0.732  0.0019 Yes ✓      
## 4 TheJournal.ie Saturation (2011-2020) -0.712  0.0208 Yes ✓

9.2 Statistical Validation of Hypothesis A: The Internet Adoption Lifecycle

To investigate the influence of technological adoption on news production, Pearson’s correlation tests were conducted across two distinct temporal phases: the Digital Growth Phase (1996–2010) and the Saturation/Consolidation Phase (2011–2020).

9.2.1 The Expansion Era (1996–2010)

The results for the first phase strongly validate the “Expansion Hypothesis.” For The Irish Times, a strong positive correlation was observed (\(r = 0.774, p < 0.001\)). During this period, as internet penetration in Ireland climbed from roughly 2% to over 70%, article volume scaled alongside it. (plot it to show the climb)

This suggests that in the early digital era, technology acted as a primary catalyst. Newsrooms were not merely migrating content; they were expanding their digital footprint to capture a rapidly growing online audience.

9.2.2 The Saturation & Contraction Era (2011–2020)

As internet adoption reached saturation (exceeding 80-90%), the correlation with article volume did not simply disappear.On the contrary, it turned significantly negative.

Irish Times: \(r = -0.681, p = 0.03\)
TheJournal.ie: \(r = -0.712, p = 0.02\)

This negative correlation implies that as internet accessibility continued climbing toward 100%. The taste of people had changed. For example, social media platform, short video become major channels of accessing information as well as entertainment.

Among these explanations, the social media shift is most consistent with the data — the decline accelerates from 2013 onward, precisely when Facebook and Twitter became dominant news distribution channels in Ireland.

# Scaled overlay: articles vs internet usage
ggplot(final_analysis_data |> filter(year < 2021),
       aes(x = year)) +
  
  # Internet usage line (shared across both panels)
  geom_line(aes(y = scaled_internet), 
            color = "darkgreen", linetype = "dashed", linewidth = 0.8) +
  
  # Article counts per provider
  geom_line(aes(y = scaled_articles, color = news_provider), linewidth = 1) +
  geom_point(aes(y = scaled_articles, color = news_provider), size = 1.5) +
  
  # Phase break line
  geom_vline(xintercept = 2010, linetype = "dotted", 
             color = "grey40", linewidth = 0.8) +
  annotate("text", x = 2009.5, y = 0.95, label = "Phase break\n(2010)", 
           size = 2.8, color = "grey40", hjust = 1) +
  
  # Label the internet line
  annotate("text", x = 1998, y = 0.08, 
           label = "Internet\nUsage", size = 2.8, color = "darkgreen") +
  
  facet_wrap(~news_provider) +
  scale_color_manual(values = c("Irish Times" = "steelblue", 
                                "TheJournal.ie" = "coral")) +
  theme_minimal() +
  labs(
    title = "Article Volume vs. Internet Penetration (Min-Max Scaled)",
    subtitle = "Dashed green = Internet usage (%) | Coloured = Article counts",
    x = "Year", y = "Scaled Value [0, 1]", color = "Provider"
  ) +
  theme(legend.position = "bottom")

# Hypothesis B: Economic Shock

# Year-on-year % change table
yoy_changes <- final_analysis_data |>
  filter(year < 2021) |>
  group_by(news_provider) |>
  arrange(year) |>
  mutate(
    article_chg = round((n_articles - lag(n_articles)) / lag(n_articles) * 100, 1),
    gnp_chg     = round((annual_gnp - lag(annual_gnp)) / lag(annual_gnp) * 100, 1)
  ) |>
  ungroup() |>
  filter(year >= 2007 & year <= 2014) |>
  select(year, news_provider, n_articles, article_chg, annual_gnp, gnp_chg)

print(yoy_changes)

## # A tibble: 16 × 6
##     year news_provider n_articles article_chg annual_gnp gnp_chg
##    <dbl> <chr>              <int>       <dbl>      <dbl>   <dbl>
##  1  2007 Irish Times        24643         4.3     214382     3.7
##  2  2007 TheJournal.ie      10451         5.1     214382     3.7
##  3  2008 Irish Times        25768         4.6     205935    -3.9
##  4  2008 TheJournal.ie      10886         4.2     205935    -3.9
##  5  2009 Irish Times        26580         3.2     188632    -8.4
##  6  2009 TheJournal.ie      11381         4.5     188632    -8.4
##  7  2010 Irish Times        25441        -4.3     195502     3.6
##  8  2010 TheJournal.ie      10996        -3.4     195502     3.6
##  9  2011 Irish Times        25204        -0.9     191408    -2.1
## 10  2011 TheJournal.ie      10577        -3.8     191408    -2.1
## 11  2012 Irish Times        23898        -5.2     189944    -0.8
## 12  2012 TheJournal.ie      10391        -1.8     189944    -0.8
## 13  2013 Irish Times        19770       -17.3     201350     6  
## 14  2013 TheJournal.ie       8643       -16.8     201350     6  
## 15  2014 Irish Times        21381         8.1     221115     9.8
## 16  2014 TheJournal.ie       9342         8.1     221115     9.8

9.3 Statistical Validation of Hypothesis B: The economy fluctuation.

Given the relatively small crisis window (n = 8, 2007–2014), year-on-year change data was used alongside correlation tests to identify patterns more explicitly.

The year-on-year change table reveals a critical temporal pattern. While Ireland’s GNP contracted sharply in 2008 (−3.9%) and 2009 (−8.4%), both providers continued growing during this period - likely reflecting increased demand for economic and political reporting as Ireland navigated the EU-IMF bailout negotiations, austerity budgets, and banking collapses, etc.

The sharp decline in 2013 (−17.3%) cannot be explained by GNP alone. Instead, it reflects structural changes in advertising. Evidence from the Irish online advertising market shows that by 2010, digital ad spending was already growing at 13.5% year-on-year, gradually displacing traditional media. As audiences shifted toward social media, revenue from display and classified advertising — key income sources for traditional newsrooms — steadily declined. This structural erosion accumulated over time and became most visible in 2013, even as the broader economy was recovering.

Pearson correlation tests across the full decline phase (2009–2020) confirm a significant negative relationship for both providers (Irish Times: \(r = −0.691, p = 0.013\); TheJournal.ie: \(r = −0.729, p = 0.007\)). The negative direction is counter intuitive at first — as GNP recovered post-2013, articles continued falling — but this paradox is itself informative.

Notably, 2014 was a turning point both providers rebounded identically (+8.1%) as GNP surged +9.8%, suggesting the economic relationship was weakened but not entirely severed.

This chart illustrates the trends in Gross National Product (GNP) and the number of articles over the course of a year. The period from 2008 to 2012 is shaded in light red. The GNP curve (dark red dashed line) follows a “U” shape, while the curve representing the number of articles (blue/coral solid line) follows an inverted “U” shape.

# Scaled overlay with crisis annotation
ggplot(final_analysis_data |> filter(year < 2021), aes(x = year)) +
  
  # Crisis shading
  annotate("rect", xmin = 2008, xmax = 2012,
           ymin = -Inf, ymax = Inf, alpha = 0.08, fill = "red") +
  annotate("text", x = 2010, y = 1.02,
           label = "Crisis\n(2008-2012)", size = 2.8, color = "red") +
  
  # GNP and GDP lines
  geom_line(aes(y = scaled_gnp), color = "darkred", 
            linetype = "dashed", linewidth = 0.8) +
  annotate("text", x = 1997.5, y = 0.18, 
           label = "GNP", size = 2.8, color = "darkred") +
  
  # Article lines per provider
  geom_line(aes(y = scaled_articles, color = news_provider), linewidth = 1) +
  geom_point(aes(y = scaled_articles, color = news_provider), size = 1.5) +
  
  facet_wrap(~news_provider) +
  scale_color_manual(values = c("Irish Times" = "steelblue",
                                "TheJournal.ie" = "coral")) +
  theme_minimal() +
  labs(
    title = "Article Volume vs. GNP During Economic Crisis",
    subtitle = "Dashed red = GNP | Shaded = Financial Crisis Period",
    x = "Year", y = "Scaled Value [0, 1]", color = "Provider"
  ) +
  theme(legend.position = "bottom")

# Correlation: full decline phase (more power than crisis window alone)
decline_it <- final_analysis_data |>
  filter(news_provider == "Irish Times" & year >= 2009 & year < 2021)

decline_jn <- final_analysis_data |>
  filter(news_provider == "TheJournal.ie" & year >= 2009 & year < 2021)

cor.test(decline_it$n_articles, decline_it$annual_gnp)

## 
##  Pearson's product-moment correlation
## 
## data:  decline_it$n_articles and decline_it$annual_gnp
## t = -3.0205, df = 10, p-value = 0.01288
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9056242 -0.1935259
## sample estimates:
##        cor 
## -0.6907136

cor.test(decline_jn$n_articles, decline_jn$annual_gnp)

## 
##  Pearson's product-moment correlation
## 
## data:  decline_jn$n_articles and decline_jn$annual_gnp
## t = -3.3724, df = 10, p-value = 0.007093
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9187441 -0.2675926
## sample estimates:
##        cor 
## -0.7294684

9.4 Lag effect test

# Lag correlation — tests whether GNP predicts articles 1-2 years later
# library(dplyr) 

# Create lagged GNP variable
gnp_lagged <- final_analysis_data |>
  filter(news_provider == "Irish Times") |>  # GNP is same for both, use either
  arrange(year) |>
  select(year, annual_gnp) |>
  mutate(
    gnp_lag1 = lag(annual_gnp, 1),  # GNP from 1 year prior
    gnp_lag2 = lag(annual_gnp, 2)   # GNP from 2 years prior
  )

# Join back
decline_it_lag <- decline_it |>
  left_join(gnp_lagged |> select(year, gnp_lag1, gnp_lag2), by = "year")

decline_jn_lag <- decline_jn |>
  left_join(gnp_lagged |> select(year, gnp_lag1, gnp_lag2), by = "year")

# Isolate just the crash and immediate aftermath
crash_window <- final_analysis_data |>
  filter(year >= 2007 & year <= 2014) |>
  filter(news_provider == "Irish Times") |>
  arrange(year) |>
  mutate(
    gnp_lag1 = lag(annual_gnp, 1),
    gnp_lag2 = lag(annual_gnp, 2)
  )

cat("Crisis window only (2007-2014):\n",
    "Lag 0:", cor(crash_window$n_articles, crash_window$annual_gnp, use="complete.obs"), "\n",
    "Lag 1:", cor(crash_window$n_articles, crash_window$gnp_lag1,  use="complete.obs"), "\n",
    "Lag 2:", cor(crash_window$n_articles, crash_window$gnp_lag2,  use="complete.obs"), "\n")

## Crisis window only (2007-2014):
##  Lag 0: -0.4431352 
##  Lag 1: 0.4089929 
##  Lag 2: 0.6718416

crash_window <- final_analysis_data |>
  filter(year >= 2007 & year <= 2014) |>
  filter(news_provider == "Irish Times") |>
  arrange(year) |>
  mutate(
    gnp_lag2 = lag(annual_gnp, 2)
  ) |>
  filter(!is.na(gnp_lag2)) |>
  mutate(
    scaled_articles = (n_articles - min(n_articles)) / (max(n_articles) - min(n_articles)),
    scaled_gnp_lag2 = (gnp_lag2 - min(gnp_lag2)) / (max(gnp_lag2) - min(gnp_lag2))
  )

ggplot(crash_window, aes(x = year)) +
  geom_line(aes(y = scaled_articles, color = "Article Count"), linewidth = 1) +
  geom_line(aes(y = scaled_gnp_lag2, color = "GNP (2-year lag)"),
            linetype = "dashed", linewidth = 1) +
  geom_point(aes(y = scaled_articles, color = "Article Count"), size = 2) +
  geom_point(aes(y = scaled_gnp_lag2, color = "GNP (2-year lag)"), size = 2) +
  scale_color_manual(values = c("Article Count" = "steelblue",
                                "GNP (2-year lag)" = "darkred")) +
  theme_minimal() +
  labs(
    title = "Irish Times Article Volume vs. GNP with 2-Year Lag (2009–2014)",
    subtitle = "GNP shifted forward 2 years — shows editorial budget response delay",
    x = "Year", y = "Scaled Value [0,1]", color = NULL
  ) +
  theme(legend.position = "bottom")

To investigate whether newsroom responses were immediate or delayed, lag correlation analysis was conducted across the crisis window (2007–2014).

At Lag 0, the correlation was weakly negative (\(r = −0.443\)), suggesting same-year economic conditions did not directly suppress output. The correlation reversed and strengthened at Lag 1 (\(r = +0.409\)) and peaked at Lag 2 (\(r = +0.672\)), confirming a two-year lagged response.

This lag is consistent with how newsroom operations function. Editorial financial budgets are typically set annually, employers contracts periods, and declines in advertising revenue take time to translate into cost reductions.

As a result, the economic downturn in 2008–2009 did not immediately affect output. Instead, its impact appeared later, with article volumes falling most sharply in 2012–2013 — a pattern clearly reflected in the year-on-year change table.

These two ideas don’t contradict each other; they just happened one after the other. Furthermore, even when the economy recovered, article volumes didn’t bounce back. This is likely because social media began to replace traditional news websites.

The economic crisis then triggered the contraction phase (2010–2020), with a characteristic 1–2 year lag, reflecting the response of possible budgets to advertising revenue issues.

In summary, the 25-year trend in Irish online news wasn’t caused by one single event. Instead, it was a mix of the internet boom, the delayed effects of the economic crash, and the rise of social media.

FIT5145-Assignment2

Wanting Echo Zhao

26 April, 2026

1 Question 1

2 Question 2

2.1 Provider with Highest Mean Engagement:

Note: The limited article count for Galway Advertiser was identified during Q3 analysis. While it technically holds the highest mean, its small sample size require to be refer with caution in interpretation.

2.2 Headline Category with Lowest Number of Articles:

3 Question 3

4 Question 4

5 Question 5

Learn more about Boxplot vs Violin

5.1 Insights: Temporal Factors vs. Engagement

5.2 Insights: Category factors on Engagement

5.3 Insights:Provider and Category Effects on Engagement

5.4 Insights: Headline Length and Engagement

6 Question 6

6.1 A. For each news provider, identify the top 3 headline categories with the strongest association between yearly article volume and yearly mean engagement score, based on correlation analysis. Please provide supporting results.

6.2 B. For each news provider, identify the top 3 headline categories with the most significant increasing trends in yearly articles counts. Please provide supporting results.

7 Question 7

Background you may explore

- COVID-19 IN IRELAND

- Annual Report of the Epidemiology of COVID-19 in Ireland, 2021-2022

8 Question 8

9 Question 9

External Dataset:

- Economy

- Individuals using the Internet (% of population)

- You can directly download via Github for filtered ones

9.1 Overall trend visualization

9.2 Statistical Validation of Hypothesis A: The Internet Adoption Lifecycle

9.2.1 The Expansion Era (1996–2010)

9.2.2 The Saturation & Contraction Era (2011–2020)

9.3 Statistical Validation of Hypothesis B: The economy fluctuation.

9.4 Lag effect test