Newspaper data will be used for three different purposes:
Extracting protest events, which will represent the treatment. For this, only articles that contain keywords such as “protest”, “demonstration”, etc. are relevant. Text of the whole articles may not be necessary, snippets around the keywords may suffice.
Representing unknown confounders. Here I want to use the full texts of all articles of one or multiple or all newspapers. This is much more data than for the above purpose, and finding a good data source here will be challenging.
Extracting discourse characteristics, which will represent the effect. Here it could be useful to use full texts and apply topic models, but counting keywords could also suffice, at least for a first try.
I want to focus:
on Germany: because I am familiar with the language, the politics, and the protest movements of thus country,
during the past decade (2013-2022): because I want to reduce time bias as much as possible, and because I believe that for the last decade there will be good coverage of online media,
on all protest movements, and especially the climate justice movements: mostly because I am personally interested in them,
via either print or online news or both: wherever I can find the best consistent dataset.
How much data is needed?
Before comparing potential data sources, it is useful to estimate how much data I will need to retrieve, because this will have an effect on the cost and thereby the feasibility of some sources.
from datetime import date, timedeltafrom faker import Fakerimport randomseed =20230116Faker.seed(seed)# generate random dates in the past decadedates = [ Faker().date_between(start_date="-10y", end_date=date(2022, 12, 31))for i inrange(10)]dates
from protest_impact.data.news.sources.bing import search as bing_searchfrom protest_impact.data.news.sources.google import search as google_searchfrom protest_impact.data.news.sources.mediacloud import search as mediacloud_searchengines = [ ("google", google_search), ("mediacloud", mediacloud_search),]sites = ( [ # some hand-selected news sites, with their mediacloud id ("bild.de", 22009), ("spiegel.de", 19831), ("zeit.de", 22119), ("bz-berlin.de", 144263), ("stuttgarter-zeitung.de", 39696), ("lvz.de", 518104), ])
import pandas as pddata = []for date in dates[:3]:for site in sites:for name, engine in engines:print(f"{name}{site[0]}{date}") site_args =dict(media_id=site[1]) if name =="mediacloud"elsedict(site=site[0]) n_articles =len(engine(None, date=date, **site_args)) data.append( {"engine": name,"site": site[0],"date": str(date),"n_articles": n_articles, } )df = pd.DataFrame(data)df.head()
google bild.de 2021-06-08
mediacloud bild.de 2021-06-08
google spiegel.de 2021-06-08
mediacloud spiegel.de 2021-06-08
google zeit.de 2021-06-08
mediacloud zeit.de 2021-06-08
google bz-berlin.de 2021-06-08
mediacloud bz-berlin.de 2021-06-08
google stuttgarter-zeitung.de 2021-06-08
mediacloud stuttgarter-zeitung.de 2021-06-08
google lvz.de 2021-06-08
mediacloud lvz.de 2021-06-08
google bild.de 2017-02-16
mediacloud bild.de 2017-02-16
google spiegel.de 2017-02-16
mediacloud spiegel.de 2017-02-16
google zeit.de 2017-02-16
mediacloud zeit.de 2017-02-16
google bz-berlin.de 2017-02-16
mediacloud bz-berlin.de 2017-02-16
google stuttgarter-zeitung.de 2017-02-16
mediacloud stuttgarter-zeitung.de 2017-02-16
google lvz.de 2017-02-16
mediacloud lvz.de 2017-02-16
google bild.de 2016-02-10
mediacloud bild.de 2016-02-10
google spiegel.de 2016-02-10
mediacloud spiegel.de 2016-02-10
google zeit.de 2016-02-10
mediacloud zeit.de 2016-02-10
google bz-berlin.de 2016-02-10
mediacloud bz-berlin.de 2016-02-10
google stuttgarter-zeitung.de 2016-02-10
mediacloud stuttgarter-zeitung.de 2016-02-10
google lvz.de 2016-02-10
mediacloud lvz.de 2016-02-10
engine
site
date
n_articles
0
google
bild.de
2021-06-08
87
1
mediacloud
bild.de
2021-06-08
168
2
google
spiegel.de
2021-06-08
66
3
mediacloud
spiegel.de
2021-06-08
114
4
google
zeit.de
2021-06-08
72
import altair as altalt.Chart(df).mark_boxplot().encode( x="engine:N", y="n_articles:Q", column="site:N",)
One newspaper publishes roughly \(32 \cdot 365 \approx 10{\small,}000\) articles per year, so 100k articles for the envisioned 10-year period. Most search engines have a limit of 100 articles per query, thus this corresponds to 1000 queries per newspaper. The minimum amount of different newspapers might be 5, so that different regions could be compared. More regions (say, 10) with 3 newspapers on average, plus 20 national newspapers, would result in 50 newspapers, thus 50k queries.
Under the assumption (not verified) that an article contains bout 1 Kilobyte of text, that would amount to full text data in the order of 5GB.
Which data sources are suitable?
The following potential sources of newspaper articles have been considered. The 💰 symbol gives the price for 50k queries.
Data suitability: 🟡 perfect 🟠 okay 🔴 not suitable
🟡 Media Cloud - open source portal with many corpora, very convenient, quality should be tested, completely free
🟠 Google News via 3rd party API (max. 100 results/query each)
🟠 Süddeutsche - article lists for each year/topic, could be scraped
🟠 Welt - article lists for each day, could be scraped
🟠 Bild - article lists for each day, could be scraped
🔴 FAZ - can send them statistical queries, no full-text access
🔴 Tagesspiegel - cannot find archive
🟠 Internet Archive - could be scraped, bigger newspaper = more coverage: Zeit coverage goes ~10y back, Süddeutsche ~5y, Berliner Zeitung ~3y, Gäubote ~1y
🟠 Deutsches Referenzkorpus - print publications until 2020, access via KorAP, is conceived for retrieving linguistic statistics, retrieving full-texts seems possible but would be hacky
🔴 Bing News - only 14 days into the past
🔴 LexisNexis / Nexis Uni - great selection of print newspapers with full texts, no proper API
🔴 Genios (via Berlin Library) - great content, but no API, even BibBot uses Puppeteer for access of single articles
🔴 Google Search / News - possible to search just by site and date, but no API
🔴 RSS feeds - very shallow history
🔴 Reuters archives - can only find images and videos, no press statements
🔴 DPA - cannot find archive
How is the quality of the Media Cloud data?
How consistent is the Media Cloud data?
data = []for name, engine in engines:for site in sites[:3]:for i, start_date inenumerate(dates[:1]):print(f"{name}{site[0]} date {i}") start_date_ = start_date - timedelta(days=start_date.weekday()) # choose monday before the date num_days =14 end_date = start_date_ + timedelta(days=14) site_args =dict(media_id=site[1]) if name =="mediacloud"elsedict(site=site[0]) results = engine(None, date=start_date_, end_date=end_date, **site_args)for num_days_ inrange(num_days): date_ = start_date_ + timedelta(days=num_days_) n_articles =len([r for r in results if r.date == date_]) data.append( {"engine": name,"site": site[0],"date": date_,"n_articles": n_articles, } )df = pd.DataFrame(data)df.head()
google bild.de date 0
google spiegel.de date 0
google zeit.de date 0
mediacloud bild.de date 0
last_processed_stories_id: 2365040910
last_processed_stories_id: 2371291110
mediacloud spiegel.de date 0
last_processed_stories_id: 2368085021
mediacloud zeit.de date 0
/Users/david/Repositories/protest-impact/.venv/lib/python3.10/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
for col_name, dtype in df.dtypes.iteritems():
How does the Media Cloud data compare to other data sources?
Can all data be collected?
Can all full texts be scraped?
# sources = {}# items = {}# for name, search in engines:# items[name] = search("protest", date.fromisoformat("2022-11-27"))# print(name, len(items[name]))# sources[name] = {website_name(a.url) for a in items[name]}# core = set.intersection(*sources.values())# print("core:", *core)# extended_core = set.union(# sources["bing"].intersection(sources["google"]),# sources["bing"].intersection(sources["mediacloud"]),# sources["google"].intersection(sources["mediacloud"]),# )# print("extended_core:", *extended_core)# for name, sources_ in sources.items():# print(name.upper())# print()# print("additional")# print()# additional = sources_.difference(extended_core)# print(*additional)# print()# stories_add = [# f"{website_name(item.url)}: {item.title}"# for item in items[name]# if website_name(item.url) in additional# ]# print("\n".join(stories_add))# print()# print("missing")# print()# missing = extended_core.difference(sources_)# print(*missing)# print()# # stories_miss = [# # f"{website_name(item.url)}: {item.title}"# # for item in items[name]# # if website_name(item.url) in missing# # ]# # print("\n".join(stories_miss))# # print()