Explore the structure and quality of the original dataset.
# load potential Library
library(tidyverse)
library(lubridate)
library(dplyr)
library(ggplot2)
library(visdat)
library(naniar)
# load Data
ire_news <- read_csv("ireland_news.csv")
# Check the data headers, structure, glimpse it as well
head(ire_news)## # A tibble: 6 × 5
## publish_date headline_category headline_text news_provider engagement_score
## <chr> <chr> <chr> <chr> <dbl>
## 1 Wednesday, 25t… opinion Renua's plan… Irish Times 0.669
## 2 Tuesday, 30th … news Racism cloud… Irish Examin… 0.427
## 3 Thursday, 13th… news.politics.oi… Minister for… RTE News 0.694
## 4 Wednesday, 28t… opinion.letters Kaczynski an… RTE News 0.472
## 5 Saturday, 17th… opinion Martyn Turner TheJournal.ie 0.551
## 6 Sunday, 28th o… business.markets Chris Johns:… RTE News 0.568
## spc_tbl_ [1,610,523 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ publish_date : chr [1:1610523] "Wednesday, 25th of March, 2015" "Tuesday, 30th of June, 1998" "Thursday, 13th of March, 2014" "Wednesday, 28th of February, 2007" ...
## $ headline_category: chr [1:1610523] "opinion" "news" "news.politics.oireachtas" "opinion.letters" ...
## $ headline_text : chr [1:1610523] "Renua's plan to publish Attorney General's advice misguided" "Racism clouds fight for justice after London street stabbing" "Minister for Justice has not 'failed in any respect'; Bruton tells Dáil" "Kaczynski and homosexuality" ...
## $ news_provider : chr [1:1610523] "Irish Times" "Irish Examiner" "RTE News" "RTE News" ...
## $ engagement_score : num [1:1610523] 0.669 0.427 0.694 0.472 0.551 0.568 0.693 0.51 0.589 0.549 ...
## - attr(*, "spec")=
## .. cols(
## .. publish_date = col_character(),
## .. headline_category = col_character(),
## .. headline_text = col_character(),
## .. news_provider = col_character(),
## .. engagement_score = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
## Rows: 1,610,523
## Columns: 5
## $ publish_date <chr> "Wednesday, 25th of March, 2015", "Tuesday, 30th of …
## $ headline_category <chr> "opinion", "news", "news.politics.oireachtas", "opin…
## $ headline_text <chr> "Renua's plan to publish Attorney General's advice m…
## $ news_provider <chr> "Irish Times", "Irish Examiner", "RTE News", "RTE Ne…
## $ engagement_score <dbl> 0.669, 0.427, 0.694, 0.472, 0.551, 0.568, 0.693, 0.5…
## [1] 380
## # A tibble: 5 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 headline_category 193 0.0120
## 2 engagement_score 85 0.00528
## 3 publish_date 80 0.00497
## 4 news_provider 13 0.000807
## 5 headline_text 9 0.000559
Base on whole project, processing data with “Global cleaning” codeing follows would avoid code redundant later.
# Global cleaning applied to following questions
ire_news_clean <- ire_news |>
mutate(headline_category = str_trim(tolower(headline_category))) |>
mutate(news_provider = str_trim(news_provider)) |>
filter(!is.na(news_provider)) |>
filter(news_provider != "...") |>
filter(!is.na(engagement_score))How many unique values are there in the headline category column of the data file?
# Find the unique value of 5 headline category
# Filter NA
unique_value <- ire_news_clean |>
filter(!is.na(headline_category)) |>
mutate(headline_category = tolower(headline_category)) |>
pull(headline_category) |>
unique()
length(unique_value)## [1] 118
# See how articles are distributed across categories(OPTIONAL)
# sort(table(ire_news$headline_category), decreasing = TRUE)After checking the missing value. There are 193 NAs in headline_category, which only account for 1.2% of total in this column. I removed them with filter before counting.
The categories contain dot notation subcategories (e.g. news.politics.oireachtas, news.social, news.consumer). Though they all belong to same parent category, they represent genuinely distinct subcategories and are therefore counted as unique values.
In summary, there are 118 unique values in the headline_category column after removing 193 NA and “…” values.
Which provider has the highest mean engagement? Which headline category has the lowest number of articles?
# Find the news provider has the largest mean of engagement_score
ire_news_clean |>
group_by(news_provider) |>
summarise(mean_engagement = mean(engagement_score, na.rm = TRUE)) |>
slice_max(order_by = mean_engagement, n = 1) |>
ungroup()## # A tibble: 1 × 2
## news_provider mean_engagement
## <chr> <dbl>
## 1 Galway Advertiser 0.977
# Exclude Galway Advertiser,find the provider who has the highest mean engagement score
ire_news_clean |>
filter(news_provider != "Galway Advertiser") |>
group_by(news_provider) |>
summarise(mean_engagement = mean(engagement_score, na.rm = TRUE)) |>
slice_max(order_by = mean_engagement, n = 1) |>
ungroup()## # A tibble: 1 × 2
## news_provider mean_engagement
## <chr> <dbl>
## 1 Irish Times 0.556
After removing corrupted entries (“…”) and NA providers, the news provider with the highest mean engagement score is Galway Advertiser (mean = 0.977).
However, this is based on only 3 articles, which may not be statistically representative. Among providers with substantial article counts, Irish Times records the highest mean engagement (mean = 0.556, n = 402,575), making it arguably the most reliable indicator of high engagement.
Note: The limited article count for Galway Advertiser was identified during Q3 analysis. While it technically holds the highest mean, its small sample size require to be refer with caution in interpretation.
# Find the headline category has the lowest number of articles
ire_news_clean |>
filter(!is.na(headline_category)) |>
mutate(parent_category = str_extract(headline_category, "^[^._]+")) |>
group_by(parent_category) |>
summarise(number_of_article = n()) |>
slice_min(order_by = number_of_article, n = 1) |>
ungroup()## # A tibble: 6 × 2
## parent_category number_of_article
## <chr> <int>
## 1 entertainment 1
## 2 eye on nature 1
## 3 my holidays 1
## 4 s 1
## 5 x86% 1
## 6 <NA> 1
# Investigate suspicious categories
ire_news_clean |>
filter(
str_detect(headline_category, "^s\\.") | headline_category %in% c("x86%") |str_detect(headline_category, "^entertainment|^eye|^my"))## # A tibble: 5 × 5
## publish_date headline_category headline_text news_provider engagement_score
## <chr> <chr> <chr> <chr> <dbl>
## 1 Thursday, 16th… s.g. Birdies prov… Irish Times 0.631
## 2 Saturday, 11th… eye on nature Eye On Nature TheJournal.ie 0.383
## 3 Saturday, 24th… my holidays My Holidays TheJournal.ie 0.463
## 4 Friday, 13th o… entertainment Philips sale… Irish Times 0.486
## 5 Sunday, 05th o… x86% Ask the expe… Irish Times 0.612
To identify the parent category, I extracted the first segment of
headline_category before any . or _ separator using
str_extract(headline_category, "^[^._]+"), as inconsistent
notation (e.g. business.economy vs business_economy) likely represents
the same category. At first glance, six parent categories each contain
only one article: entertainment, eye on nature, my holidays, s, x86%,
and NA. Further investigation revealed the following anomalies:
After excluding these anomalies, entertainment is identified as the legitimate headline category with the lowest number of articles (n = 1).
Compute the total number of articles for each headline category and news provider. Then, use a single R function to display the statistical information, i.e., Min, Max, and Mean, of the total number of articles (as computed previously) for each news provider. (Note: You may use multiple functions/commands to prepare the pre-processed data table, but when you compute and display the statistical information, you need to use a single R function.)
# Step 1 - compute total articles per provider and category
article_counts <- ire_news_clean |>
group_by(news_provider, headline_category) |>
summarise(total_articles = n()) |>
ungroup()
# Step 2 - apply summary() per provider
tapply(article_counts$total_articles,
article_counts$news_provider,
summary)## $`Galway Advertiser`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 1 1 1
##
## $`Irish Examiner`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 289.5 774.0 3797.9 2465.8 144569.0
##
## $`Irish Independent`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.0 59.5 188.0 782.2 518.0 29134.0
##
## $`Irish Times`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 335 1073 5181 3318 203530
##
## $`RTE News`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 204.2 616.5 2970.9 1857.2 115515.0
##
## $TheJournal.ie
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 171.2 473.5 2279.9 1458.0 86774.0
Step 1 computes the total number of articles for
each combination of news_provider and headline_category using
group_by() and summarise().
Step 2 applies the single function
summary() via tapply()
to compute and display Min, 1st Quartile, Median, Mean, 3rd Quartile,
and Max of article counts for each news provider.
Notable findings:
Galway Advertiser shows uniform coverage (Min=Max=Mean=1) across only 3 articles total, suggesting minimal presence in the dataset. so it was exluded most time during further EDA.
Irish Times dominates with the highest mean (5,181) and maximum (203,530) articles per category, indicating broad coverage.
# Glance at detailed of "Galway Advertiser" provider
ire_news_clean |>
filter(news_provider == "Galway Advertiser")## # A tibble: 3 × 5
## publish_date headline_category headline_text news_provider engagement_score
## <chr> <chr> <chr> <chr> <dbl>
## 1 Monday, 3th of… news Council appr… Galway Adver… 0.987
## 2 Wednesday, 18t… sport Underdogs st… Galway Adver… 0.954
## 3 Friday, 2th of… business West coast t… Galway Adver… 0.991
# Add article count alongside mean to overview the data
ire_news_clean |>
group_by(news_provider) |>
summarise(
mean_engagement = round(mean(engagement_score, na.rm = TRUE), 3),
n_articles = n()
) |>
arrange(desc(mean_engagement)) |>
ungroup()## # A tibble: 6 × 3
## news_provider mean_engagement n_articles
## <chr> <dbl> <int>
## 1 Galway Advertiser 0.977 3
## 2 Irish Times 0.556 564751
## 3 RTE News 0.545 320861
## 4 Irish Examiner 0.533 402575
## 5 Irish Independent 0.526 80562
## 6 TheJournal.ie 0.489 241672
In which year did TheJournal.ie record its maximum engagement score? Which news provider shows the largest increase in engagement score over the years?
# Part A (TheJournal.ie max engagement year)
ire_news_clean |>
filter(news_provider == "TheJournal.ie") |>
mutate(year = str_extract(publish_date, "\\d{4}$")) |>
filter(!is.na(year)) |>
group_by(year) |>
summarise(max_score = max(engagement_score, na.rm = T)) |>
slice_max(max_score , n = 1) |>
ungroup()## # A tibble: 1 × 2
## year max_score
## <chr> <dbl>
## 1 2017 1.2
Initial analysis identified 2017 as the year of maximum engagement (score = 1.2). However, since the engagement score is defined within [0,1], this value probably invalid.
After removing out-of-range scores, the maximum valid score is 1.0 - which I assume more reasonable - recorded across multiple years: 2011, 2017, 2018, 2019, 2020, and 2021.
# Check the score range
ire_news_clean |>
filter(!is.na(engagement_score)) |>
group_by(news_provider) |>
summarise(
min_score = min(engagement_score, na.rm = TRUE),
max_score = max(engagement_score, na.rm = TRUE)
) |>
ungroup()## # A tibble: 6 × 3
## news_provider min_score max_score
## <chr> <dbl> <dbl>
## 1 Galway Advertiser 0.954 0.991
## 2 Irish Examiner 0.206 1
## 3 Irish Independent 0.215 1
## 4 Irish Times 0.218 1
## 5 RTE News 0.216 1
## 6 TheJournal.ie -5000 1.2
# Find the year TheJournal.ie max engagement score equal to 1
ire_news_clean |>
filter(news_provider == "TheJournal.ie", engagement_score == 1) |>
mutate(year = str_extract(publish_date, "\\d{4}$")) |>
filter(!is.na(year)) |>
select(year, engagement_score) |>
distinct() # see a unique list of years where this happened## # A tibble: 6 × 2
## year engagement_score
## <chr> <dbl>
## 1 2020 1
## 2 2021 1
## 3 2019 1
## 4 2017 1
## 5 2011 1
## 6 2018 1
# Check year range.In case "1" show on every year recorded.
range(as.numeric(str_extract(ire_news_clean$publish_date[ire_news_clean$news_provider == "TheJournal.ie"], "\\d{4}$")), na.rm = TRUE)## [1] 1996 2021
# Part B (largest increase)
yearly_engagement <- ire_news_clean |>
filter(engagement_score >= 0 & engagement_score <= 1) |>
mutate(year = as.numeric(str_extract(publish_date, "\\d{4}$"))) |>
group_by(news_provider, year) |>
summarise(mean_engagement = mean(engagement_score, na.rm = TRUE)) |>
ungroup()
score_gap <- yearly_engagement |>
group_by(news_provider) |>
summarise(
first_year_score = mean_engagement[which.min(year)],
last_year_score = mean_engagement[which.max(year)],
score_gap = last_year_score - first_year_score
) |>
arrange(desc(score_gap))
score_gap |> slice_max(score_gap, n = 1)## # A tibble: 1 × 4
## news_provider first_year_score last_year_score score_gap
## <chr> <dbl> <dbl> <dbl>
## 1 Irish Times 0.444 0.647 0.202
Irish Times shows the largest increase in mean engagement score (gap = 0.202) between first and last recorded year
While, two approaches were considered for measuring engagement increase. The first compares mean engagement between a provider’s first and last recorded year, treating yearly means as representative values — analogous to comparing start and end points on a line chart.
The second identifies absolute min/max scores regardless of year, which was rejected as individual article scores don’t represent yearly trends. However, they fail to represent the “typical” performance of a provider in any given year. Using means may align with a cleaner visualization (e.g., a line chart) and a more honest comparison of growth.
Investigate the factors associated with higher or lower engagement scores in the given dataset.
# Explore the score distribution in different publish day of week.
ire_news_clean |>
filter(
engagement_score >= 0 & engagement_score <= 1,
!news_provider %in% c("Galway Advertiser")
) |>
mutate(
day = str_extract(publish_date, "^\\w+"),
day = factor(day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday",
"Saturday", "Sunday"))
) |>
ggplot(aes(x = day, y = engagement_score, fill = day)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.1, alpha = 0.5, outlier.size = 0.5) +
facet_wrap(~ news_provider) +
geom_hline(yintercept = 0.5, linetype = "dashed", color = "red") +
scale_fill_brewer(palette = "Set3") +
labs(
title = "Engagement Score Distribution by Day of Week",
x = "Day of Week",
y = "Engagement Score"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))Learn more about Boxplot vs Violin
# Compare providers performance by different publish season.
ire_news_clean |>
filter(
engagement_score >= 0 & engagement_score <= 1,
!news_provider %in% c("Galway Advertiser")
) |>
mutate(
month = str_extract(publish_date,
"January|February|March|April|May|June|July|August|September|October|November|December"),
season = case_when(
month %in% c("December", "January", "February") ~ "Winter",
month %in% c("March", "April", "May") ~ "Spring",
month %in% c("June", "July", "August") ~ "Summer",
month %in% c("September", "October", "November") ~ "Autumn"
)) |>
mutate(season = factor(season,
levels = c("Spring", "Summer", "Autumn", "Winter"))) |>
ggplot(aes(x = season, y = engagement_score, fill = season)) +
geom_boxplot(width = 0.6) +
scale_fill_brewer(palette = "Pastel1") +
geom_hline(yintercept = 0.5, linetype = "dashed", color = "red") +
facet_wrap(~ news_provider) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))# Check what articles are outliers(higher)
ire_news_clean |>
filter(engagement_score >= 0 & engagement_score <= 1) |>
mutate(month = str_extract(publish_date,
"January|February|March|April|May|June|July|August|September|October|November|December"),
season = case_when(
month %in% c("December", "January", "February") ~ "Winter",
month %in% c("March", "April", "May") ~ "Spring",
month %in% c("June", "July", "August") ~ "Summer",
month %in% c("September", "October", "November") ~ "Autumn"
)) |>
group_by(news_provider, season) |>
mutate(
Q1 = quantile(engagement_score, 0.25),
Q3 = quantile(engagement_score, 0.75),
IQR = Q3 - Q1,
is_outlier = engagement_score > Q3 + 1.5*IQR
) |>
filter(is_outlier) |>
select(headline_text, engagement_score, season, news_provider)## # A tibble: 22,352 × 4
## # Groups: news_provider, season [20]
## headline_text engagement_score season news_provider
## <chr> <dbl> <chr> <chr>
## 1 Hotels in Northern Ireland to reopen f… 0.868 Summer Irish Times
## 2 Three missing after Didcot collapse un… 0.98 Winter Irish Examin…
## 3 HSE managers told to risk-assess staff… 0.846 Summer Irish Examin…
## 4 Serious shortage of midwives in Dublin… 0.835 Spring Irish Examin…
## 5 Alcock & Brown statue to sit in Clifde… 0.81 Spring Irish Examin…
## 6 Big Tom 'monumentalised among the peop… 0.914 Autumn Irish Times
## 7 Shatter welcomes crime data 0.813 Summer Irish Times
## 8 Go Walk: Glanrastal; Beara Peninsula; … 0.798 Summer Irish Examin…
## 9 Coronavirus: Three more deaths and 10 … 0.811 Summer Irish Times
## 10 Five companies competing to redevelop … 0.833 Winter Irish Times
## # ℹ 22,342 more rows
ire_news_clean |>
filter(
engagement_score >= 0 & engagement_score <= 1,
!is.na(headline_category)) |>
mutate(parent_category = str_extract(headline_category, "^[^._]+")) |>
group_by(parent_category) |>
summarise(
mean_engagement = round(mean(engagement_score, na.rm = TRUE), 3),
n_articles = n()
) |>
# There are some "noise" - "s", "x86%","NA", etc. Keep only categories with a meaningful sample size - filter(n_articles > 10)
filter(n_articles > 10) |>
arrange(desc(mean_engagement))## # A tibble: 6 × 3
## parent_category mean_engagement n_articles
## <chr> <dbl> <int>
## 1 news 0.56 797757
## 2 business 0.548 222880
## 3 sport 0.533 261715
## 4 lifestyle 0.506 95985
## 5 culture 0.488 98919
## 6 opinion 0.478 132965
Drawing on the “Day of Week” violin plot and the seasonal boxplots, several key patterns emerge regarding the relationship between temporal factors and engagement.
The violin plot indicates a subtle upward shift in the density of higher engagement scores during weekends, particularly for TheJournal.ie and RTE News. Although median engagement remains close to 0.5 across all days, the distribution on Saturdays and Sundays shows a greater concentration of high-engagement observations. This suggests that, despite potentially lower publication volume, weekend articles receive more focused audience attention.
The seasonal boxplots reveal minimal variation across different periods of the year. Both the interquartile ranges and median engagement levels remain largely consistent across seasons for all providers. Insight: Engagement with news content in Ireland appears largely invariant to seasonal effects, indicating that audience demand for news remains stable throughout the year.
A clear structural difference is observed between providers. The Irish Times consistently exhibits a higher lower-bound (i.e., a higher baseline engagement level) compared to TheJournal.ie, regardless of temporal conditions. This suggests that provider-specific factors, such as brand authority and audience loyalty, exert a stronger influence on engagement than timing-related variables.
Outlier analysis shows that maximum engagement values (approaching 1.0) occur across all days and seasons. This indicates that highly viral content is not temporally constrained. Insight: Extreme engagement outcomes are more likely driven by content-specific factors rather than when the article is published.
Summary
Overall, while a modest weekend effect is observable, temporal variables such as day of the week and season play a secondary role in shaping engagement. Instead, provider characteristics and content attributes appear to be the dominant determinants. The news engagement landscape demonstrates strong temporal stability, with consistent audience interaction patterns throughout the year.
ire_news_clean |>
# Clean the scores and categories
filter(engagement_score >= 0 & engagement_score <= 1,
!is.na(headline_category)) |>
# Extract the parent category
mutate(parent_category = str_extract(headline_category, "^[^._]+")) |>
# Filter "noise" without collapsing the whole dataframe
add_count(parent_category) |>
filter(n > 10) |>
# plot
ggplot(aes(x = reorder(parent_category, engagement_score, median),
y = engagement_score,
fill = parent_category)) +
# Add the violin layer for density (alpha makes it transparent)
geom_violin(alpha = 0.3, color = "transparent", trim = TRUE) +
geom_boxplot(width = 0.15, color = "black", outlier.size = 0.4, alpha = 0.7) +
geom_hline(yintercept = 0.5, linetype = "dashed", color = "red") +
coord_flip() +
labs(
title = "Engagement Density and Distribution by Category",
subtitle = "Categories with >10 articles only",
x = "Category",
y = "Engagement Score"
) +
theme_minimal() +
theme(legend.position = "none")Prior to analysis, it was hypothesised that News and Business would outperform other categories, given the structural advantages of institutional publishers in reporting finance, policy, and industry developments. The results largely support this expectation: both categories occupy the upper range of engagement, with relatively high median scores compared to others.
A hybrid violin–boxplot is employed to capture both summary statistics (median and quartiles) and the full distribution of engagement scores, enabling a more nuanced interpretation of within-category variation.
Although News and Business exhibit similar central tendencies, their distributions differ markedly.
Lifestyle demonstrates one of the most compact and symmetrical distributions.
Opinion and Culture are characterised by distributions skewed toward lower engagement values.
All categories exhibit right-skewed distributions, with tails extending toward higher engagement scores.
The global median (red dashed line) provides a useful benchmark: News, Business, and to some extent Sport, are positioned slightly above this level, indicating comparatively stronger engagement performance.
News has higher engagement overall because most of its articles perform well consistently, not just a few viral ones. Overall, the differences between categories are consistent, showing that the topic really affects how much people engage with the article.
Summary
Overall, differences in engagement across categories appear structural rather than incidental. The observed patterns reflect consistent distributional characteristics, suggesting that content type plays a fundamental role in shaping audience engagement, beyond isolated high-performing articles.
ire_news_clean |>
# Basic Cleaning
filter(engagement_score >= 0 & engagement_score <= 1,
!is.na(headline_category),
news_provider != "Galway Advertiser") |>
# Extract Category
mutate(parent_category = str_extract(headline_category, "^[^._]+")) |>
# Remove Noise (Only keep categories with meaningful size)
add_count(parent_category) |>
filter(n > 10) |>
# Summarize for the Heatmap
group_by(news_provider, parent_category) |>
summarise(mean_engagement = mean(engagement_score, na.rm = TRUE), .groups = "drop") |>
# 5. Plot
ggplot(aes(x = news_provider,
y = parent_category,
fill = mean_engagement)) +
geom_tile(color = "white") + # White border makes tiles pop
# Use ColorBrewer (Distiller is for continuous data)
# Apply knowledge leart from FIT5147
# Randomly choose color Platte from https://colorbrewer2.org. It compare difference clearly.
scale_fill_distiller(palette = "YlOrRd", direction = 1) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(
title = "Heatmap: Mean Engagement by Provider and Category",
subtitle = "Excluding categories with n < 10",
x = "News Provider",
y = "Category",
fill = "Mean Score"
)Hypothesis vs. Observed Pattern
The initial expectation was a diagonal structure, where different providers dominate specific categories. Instead, the heatmap reveals a predominantly columnar pattern, indicating that engagement levels are more strongly associated with provider-level factors—such as brand authority and audience base—than with category specialisation.
However, this pattern is not purely vertical: there are still consistent category-level differences across providers, suggesting a joint effect rather than a single dominant driver.
Across all providers, a similar ranking of categories emerges:
This indicates that audience preferences are structurally aligned across the market, with information-driven content (e.g., News, Business) systematically outperforming softer or interpretive content.
A clear gradient exists across providers:
This hierarchy suggests that provider reputation and audience trust exert a stronger and more uniform influence on engagement than content category alone.
There is limited evidence that any provider “owns” a specific category. While minor variations exist (e.g., slightly stronger Sport performance for RTÉ News), these differences are marginal rather than structurally distinct.
Insight: Engagement advantages are broad-based rather than category-specific, reinforcing the dominance of provider-level effects.
Opinion and Culture consistently exhibit lower engagement across all providers.
Lifestyle demonstrates relatively consistent, mid-range engagement across all providers.
Summary
Overall, the heatmap indicates that engagement is primarily driven by provider-level authority, with category effects acting as a secondary but consistent layer. Rather than niche specialisation, the Irish news landscape exhibits a hierarchical structure, where stronger providers achieve higher engagement across all content types, and category preferences remain broadly uniform across the market.
# Compute the correlation
ire_news_clean |>
filter(engagement_score >= 0 & engagement_score <= 1) |>
mutate(headline_length = str_count(headline_text, "\\w+")) |>
with(cor.test(headline_length, engagement_score, method = "pearson"))##
## Pearson's product-moment correlation
##
## data: headline_length and engagement_score
## t = 488.01, df = 1610418, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3575825 0.3602735
## sample estimates:
## cor
## 0.3589287
# Draw a comparative jitter plot
ire_news_clean |>
filter(engagement_score >= 0 & engagement_score <= 1,
!news_provider %in% "Galway Advertiser") |>
mutate(headline_length = str_count(headline_text, "\\w+")) |>
# Focus on the most common range to see the trend clearly
filter(headline_length <= 30) |>
ggplot(aes(x = headline_length, y = engagement_score, color = news_provider)) +
geom_jitter(alpha = 0.05, width = 0.3) +
geom_smooth(method = "lm", color = "grey20") + # orange line to stand out against colored dots
facet_wrap(~ news_provider) +
scale_color_brewer(palette = "Set1") +
labs(
title = "Does Headline Length Affect Providers Differently?",
x = "Headline Word Count",
y = "Engagement Score"
) +
theme_minimal() +
theme(legend.position = "none")A weak but statistically significant positive correlation was identified between headline length and engagement score (r = 0.359, p < 0.001). While this indicates a measurable relationship, the effect size is modest, suggesting that headline length explains only a limited proportion of the variation in engagement.
To further explore this relationship, a jitter plot with fitted regression lines was used to examine patterns across providers.
Headline lengths are most densely concentrated between 5 and 10 words across all providers, indicating a common editorial norm. The distribution thins substantially beyond 15 words, with relatively few headlines exceeding 20 words.
Headlines over 20 words usually just get average engagement(score: 0.4 ~ 0.6), meaning making a headline super long doesn’t get more clicks. While there is a slight positive trend, headline length isn’t the main factor driving engagement.
The regression lines across providers exhibit similar, gently upward slopes, confirming a consistent positive association between headline length and engagement. However, given the relatively low correlation coefficient, this relationship should be interpreted as incremental rather than decisive. Longer headlines may provide additional context that marginally enhances engagement, but they are not a dominant driver.
Clear differences emerge across providers:
Although headline length shows a consistent directional effect, provider-level differences remain more pronounced. For example, shorter headlines from The Irish Times often achieve higher engagement than longer headlines from TheJournal.ie.
Insight: This reinforces the earlier finding that provider authority and audience base exert a stronger influence on engagement than headline characteristics alone.
Summary
Overall, headline length demonstrates a statistically significant but practically limited effect on engagement. While moderately longer headlines may slightly improve performance, the impact is secondary to structural factors such as provider identity and content quality.
Let’s investigate headline categories, engagement scores, and time.
# Step 1 - build yearly summary per provider × category
yearly_summary <- ire_news_clean |>
filter(
engagement_score >= 0 & engagement_score <= 1,
!is.na(headline_category),
!news_provider %in% "Galway Advertiser"
) |>
mutate(
year = as.numeric(str_extract(publish_date, "\\d{4}$")),
parent_category = str_extract(headline_category, "^[^._]+")
) |>
group_by(news_provider, parent_category, year) |>
summarise(
n_articles = n(),
mean_engagement = mean(engagement_score, na.rm = TRUE)
) |>
ungroup()
yearly_summary |>
group_by(news_provider, parent_category) |>
summarise(
correlation = cor(n_articles, mean_engagement,
use = "complete.obs")
) |>
group_by(news_provider) |>
slice_max(abs(correlation), n = 3) |>
ungroup()## # A tibble: 15 × 3
## news_provider parent_category correlation
## <chr> <chr> <dbl>
## 1 Irish Examiner lifestyle 0.841
## 2 Irish Examiner news -0.591
## 3 Irish Examiner culture 0.541
## 4 Irish Independent lifestyle 0.842
## 5 Irish Independent news -0.568
## 6 Irish Independent culture 0.530
## 7 Irish Times lifestyle 0.831
## 8 Irish Times news -0.571
## 9 Irish Times business 0.542
## 10 RTE News lifestyle 0.827
## 11 RTE News news -0.589
## 12 RTE News culture 0.551
## 13 TheJournal.ie lifestyle 0.820
## 14 TheJournal.ie news -0.584
## 15 TheJournal.ie business 0.537
# Visualization: Mean of engagement score X News Category Heatmap
# Calculate and SAVE the correlations
correlations <- yearly_summary |>
group_by(news_provider, parent_category) |>
summarise(
correlation = cor(n_articles, mean_engagement, use = "complete.obs"),
.groups = "drop"
) |>
# Keep only the most interesting relationships to avoid a messy plot
group_by(news_provider) |>
slice_max(abs(correlation), n = 3) |>
ungroup()
# Visalize
correlations |>
ggplot(aes(x = news_provider,
y = parent_category,
fill = correlation)) +
geom_tile() +
geom_text(aes(label = round(correlation, 2)),
size = 3) +
# I want to use BrBG
scale_fill_gradient2(
high = "#e34a33",
mid = "white",
low = "#2ca25f",
midpoint = 0
) +
labs(
title = "Correlation: Yearly Article Volume vs Mean Engagement",
x = "News Provider",
y = "Category"
)News, lifestyle, culture are top3 of average engagement score across Irish Examiner, Irish Independent and RTE News. while Irish Times and TheJounal.ie perform better on News, lifestyle and business.
Across all providers, lifestyle consistently shows the strongest positive association (r = 0.82-0.84) between yearly volume and engagement, while news shows consistent negative association (r = -0.57 to -0.59). That suggest adding more news articles is statistically likely to pull their average engagement score down.
Unlike News, Lifestyle content is not yet “saturated.” Increased volume in this category is directly tied to higher mean engagement. This may suggest that lifestyle content growth reflects audience demand, while news volume growth may dilute engagement quality.
ire_news_clean |>
# 1. Basic Cleaning
filter(
engagement_score >= 0 & engagement_score <= 1,
!is.na(headline_category),
news_provider != "Galway Advertiser"
) |>
# 2. Extract Category and Year
mutate(
year = as.numeric(str_extract(publish_date, "\\d{4}$")),
parent_category = str_extract(headline_category, "^[^._]+")
) |>
# 3. Aggregate to Year Level (Crucial for the regression to work)
group_by(news_provider, parent_category, year) |>
summarise(n_articles = n(), .groups = "drop") |>
# 4. Run the Linear Model per Group
group_by(news_provider, parent_category) |>
# We use 'n() > 1' check because you need at least 2 years to calculate a slope
filter(n() > 1) |>
summarise(
slope = coef(lm(n_articles ~ year))["year"],
p_value = summary(lm(n_articles ~ year))$coefficients["year", "Pr(>|t|)"],
.groups = "drop"
) |>
# 5. Get the Top 3 growing categories per provider
group_by(news_provider) |>
slice_max(slope, n = 3)## # A tibble: 15 × 4
## # Groups: news_provider [5]
## news_provider parent_category slope p_value
## <chr> <chr> <dbl> <dbl>
## 1 Irish Examiner lifestyle 54.7 0.000000267
## 2 Irish Examiner business 36.9 0.0222
## 3 Irish Examiner culture 22.4 0.0127
## 4 Irish Independent lifestyle 11.0 0.000000134
## 5 Irish Independent business 7.11 0.0285
## 6 Irish Independent culture 4.56 0.0115
## 7 Irish Times lifestyle 74.0 0.000000465
## 8 Irish Times business 52.3 0.0202
## 9 Irish Times culture 29.7 0.0132
## 10 RTE News lifestyle 42.0 0.000000261
## 11 RTE News business 29.3 0.0258
## 12 RTE News culture 18.2 0.00785
## 13 TheJournal.ie lifestyle 30.5 0.000000502
## 14 TheJournal.ie business 22.5 0.0203
## 15 TheJournal.ie culture 12.4 0.0145
growth_trends <- ire_news_clean |>
# 1. Basic Cleaning
filter(
engagement_score >= 0 & engagement_score <= 1,
!is.na(headline_category),
news_provider != "Galway Advertiser"
) |>
# 2. Extract Category and Year
mutate(
year = as.numeric(str_extract(publish_date, "\\d{4}$")),
parent_category = str_extract(headline_category, "^[^._]+")
) |>
# 3. Aggregate to Year Level (Crucial for the regression to work)
group_by(news_provider, parent_category, year) |>
summarise(n_articles = n(), .groups = "drop") |>
# 4. Run the Linear Model per Group
group_by(news_provider, parent_category) |>
# We use 'n() > 1' check because you need at least 2 years to calculate a slope
filter(n() > 1) |>
summarise(
slope = coef(lm(n_articles ~ year))["year"],
p_value = summary(lm(n_articles ~ year))$coefficients["year", "Pr(>|t|)"],
.groups = "drop"
) |>
# 5. Get the Top 3 growing categories per provider
group_by(news_provider) |>
slice_max(slope, n = 3)
# Assuming your result is saved as 'growth_trends'
growth_trends |>
ggplot(aes(x = reorder(parent_category, slope), y = slope, fill = news_provider)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ news_provider, scales = "free_y") +
coord_flip() + # Makes category names easier to read
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
labs(
title = "Annual Content Expansion by Category",
subtitle = "Slope represents the average increase in articles published per year",
x = "Category",
y = "Growth Slope (Articles per Year)"
)Across all five news providers, a consistent pattern of content expansion is observed. The three headline categories with the most significant increasing trends in yearly article counts are Lifestyle, Business, and Culture, indicating that these shifts reflect broader industry dynamics rather than provider-specific editorial strategies.
Among them, Lifestyle demonstrates the steepest growth across all providers, marking it as the primary driver of expansion in Irish digital news. For instance, The Irish Times records an annual increase of approximately 74 articles in this category, followed by The Irish Examiner at 54.7. Business shows moderate but steady growth (with slopes ranging from 7.11 to 52.25), while Culture exhibits a more gradual yet consistent upward trend.
These patterns align with wider shifts in media consumption. The increasing prominence of lifestyle content reflects growing audience preference for personalised and interest-driven topics, which tend to perform strongly in digital engagement environments.
Notably, the categories with the strongest growth trends (Q6B) closely align with those exhibiting the highest volume–engagement correlations (Q6A), particularly Lifestyle. This suggests a cycle: as publishers post more Lifestyle articles, engagement goes up, which probably encourages them to post even more of that content.
All reported trends are statistically significant (p < 0.001), confirming that these are stable, long-term developments rather than random variation.
Create a new Period column, then compute the
total number of articles by period and headline category, and generate a
boxplot showing the distribution of the total number of articles for
each period.
# Mind map
# Step 1 → Filter 2019-2020 + create Period column
# Step 2 → Compute total articles by period AND category
# Step 3 → Boxplot of distribution per period
ire_news_clean |>
# 1. Convert publish_date to date object and extract year/month
# Adjust dmy() if your CSV date format is different (e.g., mdy())
mutate(date_obj = dmy(publish_date),
year = year(date_obj),
month = month(date_obj)) |>
# 2. Filter for 2019 and 2020
filter(year %in% c(2019, 2020)) |>
# 3. Create the Period column (1-8) based on specification
mutate(
Period = case_when(
year == 2019 & month %in% 1:3 ~ "Period 1",
year == 2019 & month %in% 4:6 ~ "Period 2",
year == 2019 & month %in% 7:9 ~ "Period 3",
year == 2019 & month %in% 10:12 ~ "Period 4",
year == 2020 & month %in% 1:3 ~ "Period 5",
year == 2020 & month %in% 4:6 ~ "Period 6",
year == 2020 & month %in% 7:9 ~ "Period 7",
year == 2020 & month %in% 10:12 ~ "Period 8"
)) |>
# Create parent_category
mutate(parent_category = str_extract(headline_category, "^[^._]+")) |>
add_count(parent_category) |>
filter(n > 10) |>
# 4. Compute total articles by Period AND Category
# This creates the distribution of counts for the boxplot
group_by(Period, parent_category) |>
summarise(total_articles = n(), .groups = "drop") |>
# 5. Generate Boxplot
ggplot(aes(x = Period, y = total_articles)) +
geom_boxplot() +
geom_hline(aes(yintercept = median(total_articles)),
linetype = "dashed", color = "red") +
theme_minimal() +
labs(
title = "Distribution of Article Counts by Category per Period (2019-2020)",
x = "Quarterly Period",
y = "Total Articles"
)Prior to analysis, low-volume categories (n < 10) were excluded to reduce noise, resulting in a cleaner distribution centered around a median of approximately 2,000 articles per category per period.
Article production remained relatively stable across all periods in 2019 (Periods 1-4) and into Q1 2020 (Period 5), suggesting consistent editorial output preceding the pandemic. Interestingly, Period 5 (January-March 2020) — coinciding with the initial COVID-19 outbreak — does not show a significant disruption, possibly reflecting a surge in news coverage that offset reductions in other categories.
A sustained decline is observed from Period 6 onwards (April-December 2020), with Period 8 recording the sharpest drop — median article count falling below 1,500 and the highest outlier reducing to approximately 3,500 compared to ~6,000 in earlier periods.
This drop makes sense because the pandemic likely caused budget cuts, less advertising money, and staffing issues during the lockdowns.
The consistent presence of high outliers across all periods suggests one dominant category — likely news — maintains disproportionately high volume regardless of broader publishing trends.
Background you may explore
- COVID-19 IN IRELAND
- Annual Report of the Epidemiology of COVID-19 in Ireland, 2021-2022
We want to examine the trend in the number of articles published by RTE News in September over the years. Please create an appropriate chart.
ire_news_clean |>
# 1. Filter for RTE News and parse dates
filter(news_provider == "RTE News") |>
mutate(date_obj = dmy(publish_date)) |>
# 2. Extract Year and Month, then filter for September
mutate(
year = year(date_obj),
month = month(date_obj)
) |>
filter(month == 9) |>
# 3. Count articles per year
group_by(year) |>
summarise(n_articles = n(), .groups = "drop") |>
# 4. Generate the Chart
ggplot(aes(x = year, y = n_articles)) +
geom_line(color = "#005387", size = 1) + # RTE Brand Blue
geom_point(color = "#005387", size = 2) +
theme_minimal() +
labs(
title = "RTE News: September Publication Trends (Over the Years)",
subtitle = "Total articles published during the month of September",
x = "Year",
y = "Number of Articles",
caption = "Data Source: Irish News Dataset"
) +
annotate("text", x = 2009, y = 1350,
label = "Irish Financial Crisis Peak",
hjust = 1.1, size = 3, color = "darkred") +
annotate("text", x = 2020, y = 850,
label = "COVID-19 Impact",
hjust = 1.1, size = 3, color = "darkred") +
annotate("point", x = c(2009, 2020),
y = c(1330, 870),
color = "red", size = 2)The chart illustrates the temporal trend in RTE News’ September publication volume.
Article counts increased steadily from approximately 910 in 1996 to 2001, followed by a noticeable decline between 2001 and 2005.
The overall peak occurs in 2009, with publication volume exceeding 1,300 articles. This surge likely because of the Irish financial crisis, as people wanted to read more news about the economy and politics during that time.
After 2009, September article counts exhibit greater volatility alongside a general downward trend. This pattern may be associated with structural changes in the media landscape, particularly the shift toward digital consumption and the growing influence of social media as a competing news source.
The lowest point is observed in September 2020, with fewer than 900 articles published across the 25-year period. This decline coincides with Ireland’s second wave of COVID-19 and may reflect operational pressures on newsrooms, including resource constraints and disruptions caused by prolonged lockdowns.
Using both the original dataset and external datasets, investigate the factors influencing the yearly trend in the number of articles published by some news providers. You may select one or more news providers for this investigation and should analyse at least 20 years of data.
External Dataset:
- Economy
- Individuals using the Internet (% of population)
- You can directly download via Github for filtered ones
This analysis aims to examine whether article volume is influenced by the rise of the internet era and broader economic conditions.
Based on the exploratory analysis conducted in the previous questions, The Irish Times and TheJournal.ie exhibit the highest and lowest mean engagement scores, respectively, across the six major headline categories. As such, they are selected as two contrasting cases for comparative analysis.
Using the original dataset provided for this assignment, the data was filtered to cover the full period from 1996 to 2021. However, several data limitations should be noted.
First, background research indicates that TheJournal.ie was established in 2010. Despite this, the dataset contains records attributed to this provider prior to its founding year, likely due to synthetic or adjusted data construction for academic purposes. For consistency, both providers are analysed across the full time range, though this limitation is acknowledged.
Second, variables obtained from external datasets differ substantially in scale, which may hinder direct comparison and interpretability. Specifically, article counts are measured in thousands (e.g., 10,000–20,000), GDP in millions of euros (e.g., 100,000–500,000), and internet usage as percentages (e.g., 2%–94%).
To address this issue, Min–Max normalisation was applied to all variables, transforming them onto a common scale of \([0, 1]\). This standardisation facilitates meaningful comparison across variables and enables the analysis of correlated trends and relative growth patterns over the 25-year period.
# Run the code chunk below. Pick metrics that valuable for further data exploratory.
internet_clean <- read_csv("API_IT.NET.USER.ZS_DS2_en_csv_v2_325.csv", skip = 4) |> # skipping the first 4 lines(they are description: source, date. etc)
filter(`Country Name` == "Ireland") |>
# Pivot year columns (1996 to 2021) into rows
pivot_longer(cols = `1996`:`2021`, names_to = "year", values_to = "internet_usage") |>
mutate(
year = as.numeric(year),
internet_usage = round(internet_usage, 3)) |>
select(year, internet_usage)
# Process Economy Data
economy_clean <- read_csv("ireland_economy.csv") |>
# 1. Use %in% to select BOTH (this is like an "OR" filter)
filter(`Statistic Label` %in% c("GDP at Constant Market Prices",
"GNP at Constant Market Prices")) |>
# 2. Extract Year
mutate(year = as.numeric(str_extract(Quarter, "^\\d{4}"))) |>
# 3. Group by Year AND the Label to keep them separate
group_by(year, `Statistic Label`) |>
summarise(annual_value = sum(VALUE, na.rm = TRUE), .groups = "drop") |>
# transformed quarterly CSO GDP data into annual totals to match the news publication frequency
# Pivot the labels into their own columns (annual_gdp and annual_gnp)
pivot_wider(names_from = `Statistic Label`, values_from = annual_value) |>
rename(
annual_gdp = `GDP at Constant Market Prices`,
annual_gnp = `GNP at Constant Market Prices`
) |>
filter(year >= 1996 & year <= 2021)
# 3. Process News Data (Aggregating by Year)
news_yearly <- ire_news_clean |>
filter(news_provider %in% c("Irish Times", "TheJournal.ie")) |>
mutate(year = year(dmy(publish_date))) |>
filter(!is.na(year)) |>
group_by(year, news_provider) |>
summarise(n_articles = n(), .groups = "drop")
# The Big Join!
# left join keyed on 'year' to preserve all news publication
# Join the external data into news counts
final_analysis_data <- news_yearly |>
left_join(internet_clean, by = "year") |>
left_join(economy_clean, by = "year")
# Check the result
head(final_analysis_data)## # A tibble: 6 × 6
## year news_provider n_articles internet_usage annual_gdp annual_gnp
## <dbl> <chr> <int> <dbl> <dbl> <dbl>
## 1 1996 Irish Times 19009 2.2 117239 113891
## 2 1996 TheJournal.ie 8352 2.2 117239 113891
## 3 1997 Irish Times 19381 4.09 130162 124946
## 4 1997 TheJournal.ie 8318 4.09 130162 124946
## 5 1998 Irish Times 19174 8.1 141571 134403
## 6 1998 TheJournal.ie 8275 8.1 141571 134403
## # A tibble: 6 × 6
## year news_provider n_articles internet_usage annual_gdp annual_gnp
## <dbl> <chr> <int> <dbl> <dbl> <dbl>
## 1 2019 Irish Times 21584 87 401983 305914
## 2 2019 TheJournal.ie 9358 87 401983 305914
## 3 2020 Irish Times 17911 92 430740 317898
## 4 2020 TheJournal.ie 7662 92 430740 317898
## 5 2021 Irish Times 9827 93.5 500771 361583
## 6 2021 TheJournal.ie 4160 93.5 500771 361583
# Scaling the data for visual comparison
final_analysis_data <- final_analysis_data |>
group_by(news_provider) |> # Scale within each provider if needed
mutate(
scaled_articles = round((n_articles - min(n_articles)) / (max(n_articles) - min(n_articles)), 3),
scaled_gdp = round((annual_gdp - min(annual_gdp)) / (max(annual_gdp) - min(annual_gdp)), 3),
scaled_gnp = round((annual_gnp - min(annual_gnp)) / (max(annual_gnp) - min(annual_gnp)), 3),
scaled_internet = round((internet_usage - min(internet_usage)) / (max(internet_usage) - min(internet_usage)), 3)
)|>
ungroup()As observed in the tail() output, the article count for
2021 is approximately half that of 2020. Given that such a sharp decline
is unlikely to reflect a genuine reduction in newsroom capacity (e.g., a
50% workforce reduction), a more plausible explanation is that the 2021
data is incomplete — for instance, covering only part of the year (e.g.,
up to mid-year).
Accordingly, the 2021 observations were excluded from the correlation analysis. Including an incomplete year would distort the temporal trend and potentially lead to biased or misleading statistical inferences.
# Scaling the data for visual comparison
final_analysis_data <- final_analysis_data |>
group_by(news_provider) |> # Scale within each provider if needed
mutate(
scaled_articles = round((n_articles - min(n_articles)) / (max(n_articles) - min(n_articles)), 3),
scaled_gdp = round((annual_gdp - min(annual_gdp)) / (max(annual_gdp) - min(annual_gdp)), 3),
scaled_gnp = round((annual_gnp - min(annual_gnp)) / (max(annual_gnp) - min(annual_gnp)), 3),
scaled_internet = round((internet_usage - min(internet_usage)) / (max(internet_usage) - min(internet_usage)), 3)
)|>
ungroup()
# Article amount time trend overover
ggplot(final_analysis_data |>
filter(year < 2021), aes(x = year, y = n_articles)) +
geom_line(color = "grey70") +
geom_point(aes(color = news_provider)) +
geom_smooth(method = "loess", color = "blue", fill = "lightblue", alpha = 0.2) + # Shows the average trend
facet_wrap(~news_provider, scales = "free_y") + # 'free_y' makes y-aixs of two providers fit their own records.(max and min)
theme_minimal() +
labs(
title = "Comparative Trends: Legacy vs. Digital Native",
subtitle = "Fitted trend lines (LOESS) showing volume stability vs. growth",
x = "Year", y = "Total Articles"
)The temporal trends in article volume for the two news providers exhibit a highly similar pattern, both following an inverted U-shaped trajectory. Publication output increased steadily from 1996 (the starting point of the dataset), reached a pronounced peak around 2009, and subsequently entered a period of decline.
A closer examination of the y-axis indicates a substantial difference in scale: the publication volume of TheJournal.ie remains consistently less than half that of The Irish Times, suggesting a significant disparity in production capacity between the two providers.
The year 2010 is identified as a structural breakpoint, aligning with Ireland’s entry into the EU–IMF bailout programme. This event represents a critical macroeconomic turning point that reshaped both the national economy and the media landscape. Based on this temporal segmentation, two hypotheses are proposed for further validation.
# Hypothesis A: Internet adoption
# - Irish Times growth phase (1996-2010): strong positive correlation
# - Post-saturation (2011-2020): correlation breaks down
# Segment 1: Growth Phase (1996-2010)
it_early1 <- final_analysis_data |> filter(news_provider == "Irish Times" & year <= 2010)
cor.test(it_early1$n_articles, it_early1$internet_usage)##
## Pearson's product-moment correlation
##
## data: it_early1$n_articles and it_early1$internet_usage
## t = 4.4018, df = 13, p-value = 0.0007153
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4328942 0.9209183
## sample estimates:
## cor
## 0.7736057
it_early2 <- final_analysis_data |> filter(news_provider == "TheJournal.ie" & year <= 2010)
cor.test(it_early2$n_articles, it_early2$internet_usage)##
## Pearson's product-moment correlation
##
## data: it_early2$n_articles and it_early2$internet_usage
## t = 3.8762, df = 13, p-value = 0.00191
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3519628 0.9050159
## sample estimates:
## cor
## 0.73221
# Segment 2: Saturation Phase (2011-2020) - Exclude 2021 if incomplete
it_late1 <- final_analysis_data |> filter(news_provider == "Irish Times" & year > 2010 & year < 2021)
cor.test(it_late1$n_articles, it_late1$internet_usage)##
## Pearson's product-moment correlation
##
## data: it_late1$n_articles and it_late1$internet_usage
## t = -2.6337, df = 8, p-value = 0.03001
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.91744409 -0.09079206
## sample estimates:
## cor
## -0.6814625
it_late2 <- final_analysis_data |> filter(news_provider == "TheJournal.ie" & year > 2010 & year < 2021)
cor.test(it_late2$n_articles, it_late2$internet_usage)##
## Pearson's product-moment correlation
##
## data: it_late2$n_articles and it_late2$internet_usage
## t = -2.8721, df = 8, p-value = 0.02076
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9264931 -0.1502979
## sample estimates:
## cor
## -0.7124993
# Clean comparison table
cor_results <- tibble(
Provider = c("Irish Times", "Irish Times", "TheJournal.ie", "TheJournal.ie"),
Phase = c("Growth (1996-2010)", "Saturation (2011-2020)",
"Growth (1996-2010)", "Saturation (2011-2020)"),
r = c(
cor(it_early1$n_articles, it_early1$internet_usage),
cor(it_late1$n_articles, it_late1$internet_usage),
cor(it_early2$n_articles, it_early2$internet_usage),
cor(it_late2$n_articles, it_late2$internet_usage)
),
p_value = c(
cor.test(it_early1$n_articles, it_early1$internet_usage)$p.value,
cor.test(it_late1$n_articles, it_late1$internet_usage)$p.value,
cor.test(it_early2$n_articles, it_early2$internet_usage)$p.value,
cor.test(it_late2$n_articles, it_late2$internet_usage)$p.value
)
) |>
mutate(
r = round(r, 3),
p_value = round(p_value, 4),
Significant = ifelse(p_value < 0.05, "Yes ✓", "No 𝘹")
)
print(cor_results)## # A tibble: 4 × 5
## Provider Phase r p_value Significant
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Irish Times Growth (1996-2010) 0.774 0.0007 Yes ✓
## 2 Irish Times Saturation (2011-2020) -0.681 0.03 Yes ✓
## 3 TheJournal.ie Growth (1996-2010) 0.732 0.0019 Yes ✓
## 4 TheJournal.ie Saturation (2011-2020) -0.712 0.0208 Yes ✓
To investigate the influence of technological adoption on news production, Pearson’s correlation tests were conducted across two distinct temporal phases: the Digital Growth Phase (1996–2010) and the Saturation/Consolidation Phase (2011–2020).
The results for the first phase strongly validate the “Expansion Hypothesis.” For The Irish Times, a strong positive correlation was observed (\(r = 0.774, p < 0.001\)). During this period, as internet penetration in Ireland climbed from roughly 2% to over 70%, article volume scaled alongside it. (plot it to show the climb)
This suggests that in the early digital era, technology acted as a primary catalyst. Newsrooms were not merely migrating content; they were expanding their digital footprint to capture a rapidly growing online audience.
As internet adoption reached saturation (exceeding 80-90%), the correlation with article volume did not simply disappear.On the contrary, it turned significantly negative.
This negative correlation implies that as internet accessibility continued climbing toward 100%. The taste of people had changed. For example, social media platform, short video become major channels of accessing information as well as entertainment.
Among these explanations, the social media shift is most consistent with the data — the decline accelerates from 2013 onward, precisely when Facebook and Twitter became dominant news distribution channels in Ireland.
# Scaled overlay: articles vs internet usage
ggplot(final_analysis_data |> filter(year < 2021),
aes(x = year)) +
# Internet usage line (shared across both panels)
geom_line(aes(y = scaled_internet),
color = "darkgreen", linetype = "dashed", linewidth = 0.8) +
# Article counts per provider
geom_line(aes(y = scaled_articles, color = news_provider), linewidth = 1) +
geom_point(aes(y = scaled_articles, color = news_provider), size = 1.5) +
# Phase break line
geom_vline(xintercept = 2010, linetype = "dotted",
color = "grey40", linewidth = 0.8) +
annotate("text", x = 2009.5, y = 0.95, label = "Phase break\n(2010)",
size = 2.8, color = "grey40", hjust = 1) +
# Label the internet line
annotate("text", x = 1998, y = 0.08,
label = "Internet\nUsage", size = 2.8, color = "darkgreen") +
facet_wrap(~news_provider) +
scale_color_manual(values = c("Irish Times" = "steelblue",
"TheJournal.ie" = "coral")) +
theme_minimal() +
labs(
title = "Article Volume vs. Internet Penetration (Min-Max Scaled)",
subtitle = "Dashed green = Internet usage (%) | Coloured = Article counts",
x = "Year", y = "Scaled Value [0, 1]", color = "Provider"
) +
theme(legend.position = "bottom")# Hypothesis B: Economic Shock
# Year-on-year % change table
yoy_changes <- final_analysis_data |>
filter(year < 2021) |>
group_by(news_provider) |>
arrange(year) |>
mutate(
article_chg = round((n_articles - lag(n_articles)) / lag(n_articles) * 100, 1),
gnp_chg = round((annual_gnp - lag(annual_gnp)) / lag(annual_gnp) * 100, 1)
) |>
ungroup() |>
filter(year >= 2007 & year <= 2014) |>
select(year, news_provider, n_articles, article_chg, annual_gnp, gnp_chg)
print(yoy_changes)## # A tibble: 16 × 6
## year news_provider n_articles article_chg annual_gnp gnp_chg
## <dbl> <chr> <int> <dbl> <dbl> <dbl>
## 1 2007 Irish Times 24643 4.3 214382 3.7
## 2 2007 TheJournal.ie 10451 5.1 214382 3.7
## 3 2008 Irish Times 25768 4.6 205935 -3.9
## 4 2008 TheJournal.ie 10886 4.2 205935 -3.9
## 5 2009 Irish Times 26580 3.2 188632 -8.4
## 6 2009 TheJournal.ie 11381 4.5 188632 -8.4
## 7 2010 Irish Times 25441 -4.3 195502 3.6
## 8 2010 TheJournal.ie 10996 -3.4 195502 3.6
## 9 2011 Irish Times 25204 -0.9 191408 -2.1
## 10 2011 TheJournal.ie 10577 -3.8 191408 -2.1
## 11 2012 Irish Times 23898 -5.2 189944 -0.8
## 12 2012 TheJournal.ie 10391 -1.8 189944 -0.8
## 13 2013 Irish Times 19770 -17.3 201350 6
## 14 2013 TheJournal.ie 8643 -16.8 201350 6
## 15 2014 Irish Times 21381 8.1 221115 9.8
## 16 2014 TheJournal.ie 9342 8.1 221115 9.8
Given the relatively small crisis window (n = 8, 2007–2014), year-on-year change data was used alongside correlation tests to identify patterns more explicitly.
The year-on-year change table reveals a critical temporal pattern. While Ireland’s GNP contracted sharply in 2008 (−3.9%) and 2009 (−8.4%), both providers continued growing during this period - likely reflecting increased demand for economic and political reporting as Ireland navigated the EU-IMF bailout negotiations, austerity budgets, and banking collapses, etc.
The sharp decline in 2013 (−17.3%) cannot be explained by GNP alone. Instead, it reflects structural changes in advertising. Evidence from the Irish online advertising market shows that by 2010, digital ad spending was already growing at 13.5% year-on-year, gradually displacing traditional media. As audiences shifted toward social media, revenue from display and classified advertising — key income sources for traditional newsrooms — steadily declined. This structural erosion accumulated over time and became most visible in 2013, even as the broader economy was recovering.
Pearson correlation tests across the full decline phase (2009–2020) confirm a significant negative relationship for both providers (Irish Times: \(r = −0.691, p = 0.013\); TheJournal.ie: \(r = −0.729, p = 0.007\)). The negative direction is counter intuitive at first — as GNP recovered post-2013, articles continued falling — but this paradox is itself informative.
Notably, 2014 was a turning point both providers rebounded identically (+8.1%) as GNP surged +9.8%, suggesting the economic relationship was weakened but not entirely severed.
This chart illustrates the trends in Gross National Product (GNP) and the number of articles over the course of a year. The period from 2008 to 2012 is shaded in light red. The GNP curve (dark red dashed line) follows a “U” shape, while the curve representing the number of articles (blue/coral solid line) follows an inverted “U” shape.
# Scaled overlay with crisis annotation
ggplot(final_analysis_data |> filter(year < 2021), aes(x = year)) +
# Crisis shading
annotate("rect", xmin = 2008, xmax = 2012,
ymin = -Inf, ymax = Inf, alpha = 0.08, fill = "red") +
annotate("text", x = 2010, y = 1.02,
label = "Crisis\n(2008-2012)", size = 2.8, color = "red") +
# GNP and GDP lines
geom_line(aes(y = scaled_gnp), color = "darkred",
linetype = "dashed", linewidth = 0.8) +
annotate("text", x = 1997.5, y = 0.18,
label = "GNP", size = 2.8, color = "darkred") +
# Article lines per provider
geom_line(aes(y = scaled_articles, color = news_provider), linewidth = 1) +
geom_point(aes(y = scaled_articles, color = news_provider), size = 1.5) +
facet_wrap(~news_provider) +
scale_color_manual(values = c("Irish Times" = "steelblue",
"TheJournal.ie" = "coral")) +
theme_minimal() +
labs(
title = "Article Volume vs. GNP During Economic Crisis",
subtitle = "Dashed red = GNP | Shaded = Financial Crisis Period",
x = "Year", y = "Scaled Value [0, 1]", color = "Provider"
) +
theme(legend.position = "bottom")# Correlation: full decline phase (more power than crisis window alone)
decline_it <- final_analysis_data |>
filter(news_provider == "Irish Times" & year >= 2009 & year < 2021)
decline_jn <- final_analysis_data |>
filter(news_provider == "TheJournal.ie" & year >= 2009 & year < 2021)
cor.test(decline_it$n_articles, decline_it$annual_gnp)##
## Pearson's product-moment correlation
##
## data: decline_it$n_articles and decline_it$annual_gnp
## t = -3.0205, df = 10, p-value = 0.01288
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9056242 -0.1935259
## sample estimates:
## cor
## -0.6907136
##
## Pearson's product-moment correlation
##
## data: decline_jn$n_articles and decline_jn$annual_gnp
## t = -3.3724, df = 10, p-value = 0.007093
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9187441 -0.2675926
## sample estimates:
## cor
## -0.7294684
# Lag correlation — tests whether GNP predicts articles 1-2 years later
# library(dplyr)
# Create lagged GNP variable
gnp_lagged <- final_analysis_data |>
filter(news_provider == "Irish Times") |> # GNP is same for both, use either
arrange(year) |>
select(year, annual_gnp) |>
mutate(
gnp_lag1 = lag(annual_gnp, 1), # GNP from 1 year prior
gnp_lag2 = lag(annual_gnp, 2) # GNP from 2 years prior
)
# Join back
decline_it_lag <- decline_it |>
left_join(gnp_lagged |> select(year, gnp_lag1, gnp_lag2), by = "year")
decline_jn_lag <- decline_jn |>
left_join(gnp_lagged |> select(year, gnp_lag1, gnp_lag2), by = "year")
# Isolate just the crash and immediate aftermath
crash_window <- final_analysis_data |>
filter(year >= 2007 & year <= 2014) |>
filter(news_provider == "Irish Times") |>
arrange(year) |>
mutate(
gnp_lag1 = lag(annual_gnp, 1),
gnp_lag2 = lag(annual_gnp, 2)
)
cat("Crisis window only (2007-2014):\n",
"Lag 0:", cor(crash_window$n_articles, crash_window$annual_gnp, use="complete.obs"), "\n",
"Lag 1:", cor(crash_window$n_articles, crash_window$gnp_lag1, use="complete.obs"), "\n",
"Lag 2:", cor(crash_window$n_articles, crash_window$gnp_lag2, use="complete.obs"), "\n")## Crisis window only (2007-2014):
## Lag 0: -0.4431352
## Lag 1: 0.4089929
## Lag 2: 0.6718416
crash_window <- final_analysis_data |>
filter(year >= 2007 & year <= 2014) |>
filter(news_provider == "Irish Times") |>
arrange(year) |>
mutate(
gnp_lag2 = lag(annual_gnp, 2)
) |>
filter(!is.na(gnp_lag2)) |>
mutate(
scaled_articles = (n_articles - min(n_articles)) / (max(n_articles) - min(n_articles)),
scaled_gnp_lag2 = (gnp_lag2 - min(gnp_lag2)) / (max(gnp_lag2) - min(gnp_lag2))
)
ggplot(crash_window, aes(x = year)) +
geom_line(aes(y = scaled_articles, color = "Article Count"), linewidth = 1) +
geom_line(aes(y = scaled_gnp_lag2, color = "GNP (2-year lag)"),
linetype = "dashed", linewidth = 1) +
geom_point(aes(y = scaled_articles, color = "Article Count"), size = 2) +
geom_point(aes(y = scaled_gnp_lag2, color = "GNP (2-year lag)"), size = 2) +
scale_color_manual(values = c("Article Count" = "steelblue",
"GNP (2-year lag)" = "darkred")) +
theme_minimal() +
labs(
title = "Irish Times Article Volume vs. GNP with 2-Year Lag (2009–2014)",
subtitle = "GNP shifted forward 2 years — shows editorial budget response delay",
x = "Year", y = "Scaled Value [0,1]", color = NULL
) +
theme(legend.position = "bottom")To investigate whether newsroom responses were immediate or delayed, lag correlation analysis was conducted across the crisis window (2007–2014).
At Lag 0, the correlation was weakly negative (\(r = −0.443\)), suggesting same-year economic conditions did not directly suppress output. The correlation reversed and strengthened at Lag 1 (\(r = +0.409\)) and peaked at Lag 2 (\(r = +0.672\)), confirming a two-year lagged response.
This lag is consistent with how newsroom operations function. Editorial financial budgets are typically set annually, employers contracts periods, and declines in advertising revenue take time to translate into cost reductions.
As a result, the economic downturn in 2008–2009 did not immediately affect output. Instead, its impact appeared later, with article volumes falling most sharply in 2012–2013 — a pattern clearly reflected in the year-on-year change table.
These two ideas don’t contradict each other; they just happened one after the other. Furthermore, even when the economy recovered, article volumes didn’t bounce back. This is likely because social media began to replace traditional news websites.
The economic crisis then triggered the contraction phase (2010–2020), with a characteristic 1–2 year lag, reflecting the response of possible budgets to advertising revenue issues.
In summary, the 25-year trend in Irish online news wasn’t caused by one single event. Instead, it was a mix of the internet boom, the delayed effects of the economic crash, and the rise of social media.