from PIL import Image
from matplotlib.offsetbox import OffsetImage, AnnotationBbox
from wordcloud import WordCloud
from nltk.corpus import stopwords
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
import json
Dutch general elections are coming up at the end of this month. A lot of people in the Netherlands use the Stemwijzer website to figure out which political party to vote for. You answer 30 statements with Eens (agree), Oneens (disagree), or Geen van beide (neutral), and it shows which parties align most with your answers.
For each statement you can see which political parties agree, disagree, or are neutral about it, and why.
Still, even with tools like this, voting often feels like picking the least bad option. That frustration got me wondering: how different are these parties, really? And in what way? To explore that, I scraped the positions of the various parties on those statements to do some data exploration. I’ll visualize parties in 2D using t-SNE to see if any clusters arise. I’ll also perform a PCA and try to interpret the first two principal components and map them to underlying ideologies.
How the data was collected
I wrote a Python script that scrapes the party responses to the 30 statements from the Stemwijzer website. For each statement we get:
- The theme (broad topic label)
- The statement text
- A short info blurb
- Each party’s stance: Eens (agree), Oneens (disagree), Geen van beide (neutral)
- A short explanation per party (why they answered that way)
The script saves the scraped data as tweedekamer2025.json
.
Importing libraries
We are going to need the following dependencies.
Loading the JSON
We start our analysis by loading the JSON file produced by the scraper. Each entry contains a statement plus per‑party responses. We’ll reshape it into a dataframe where each row is one issue and columns hold party positions and their explanation texts.
with open('tweedekamer2025.json', 'r', encoding='utf-8') as f:
= json.load(f)
data
# We flatten the 'positions' list, keeping 'theme', 'title', and 'info' as identifying metadata.
= pd.json_normalize(
df_long
data, ='positions',
record_path=['theme', 'title', 'info']
meta
)
# Rename the columns for clarity before pivoting
'position_col'] = df_long['party'] + '_position'
df_long['explanation_col'] = df_long['party'] + '_explanation'
df_long[
# We create one table for 'position' data
= df_long.pivot_table(
df_position =['theme', 'title', 'info'],
index='position_col',
columns='position',
values='first'
aggfunc
).reset_index()
# We create a second table for 'explanation' data
= df_long.pivot_table(
df_explanation =['theme', 'title', 'info'],
index='explanation_col',
columns='explanation',
values='first'
aggfunc
).reset_index()
# We merge on the common descriptive columns: 'theme', 'title', and 'info'
= pd.merge(
df_merged
df_position,
df_explanation, =['theme', 'title', 'info']
on
)
# Sort the columns to get party columns adjacent (e.g., SP_position, SP_explanation)
= [col for col in df_merged.columns if col not in ['theme', 'title', 'info']]
party_cols =lambda x: (x.split('_')[0], x.split('_')[1] == 'explanation'))
party_cols.sort(key
# Combine the descriptive columns with the sorted party columns
= ['theme', 'title', 'info'] + party_cols
final_columns = df_merged[final_columns] df_final
A quick peek at the data
Before diving into visualizations let’s sanity‑check what we have so far:
Let’s look at the first few rows.
df_final.head()
theme | title | info | 50PLUS_position | 50PLUS_explanation | BBB_position | BBB_explanation | BIJ1_position | BIJ1_explanation | BVNL_position | ... | SP_position | SP_explanation | VVD_position | VVD_explanation | Volt_position | Volt_explanation | Vrede voor Dieren_position | Vrede voor Dieren_explanation | Vrij Verbond_position | Vrij Verbond_explanation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aanwezigheid bij demonstratie | De politie moet bijhouden wie aanwezig is bij ... | In Nederland heb je het recht om te demonstrer... | Eens | Betere handhaving en strenger optreden bij ove... | Oneens | Standaard moeten bijhouden wie een verboden de... | Oneens | Iedereen moet veilig kunnen demonstreren. Het ... | Oneens | ... | Oneens | Demonstreren is een grondrecht. Iedereen heeft... | Eens | De VVD wil dat de politie alle mogelijkheden k... | Oneens | Volt beschermt het recht op protest. Demonstra... | Oneens | Demonstratierecht is een groot goed. | Oneens | Vrij Verbond kiest voor maximale bescherming v... |
1 | Abortus | Abortus moet uit het Wetboek van Strafrecht . | Abortus is het afbreken van een zwangerschap d... | Eens | 50PLUS ziet abortus niet als strafrecht maar m... | Oneens | Abortus hoort in het Wetboek van Strafrecht th... | Eens | Abortus is een mensenrecht en hoort bij basisg... | Oneens | ... | Eens | Abortuszorg is zorg en hoort daarom niet thuis... | Eens | Abortus is geen misdrijf. Als VVD vinden we he... | Eens | Volt vindt dat vrouwen het recht hebben om ove... | Eens | Zelfbeschikking is geen misdaad maar een recht. | Eens | Abortus is een medische keuze, geen misdaad. V... |
2 | Bouwen op landbouwgrond | De regering moet het bouwen van woningen op la... | Gemeentes die woningen willen laten bouwen, mo... | Eens | Als een paar % van de totale landbouwgrond voo... | Oneens | Onze vruchtbare landbouwgrond is waardevol. BB... | Eens | Met name de (melk)veehouderij neemt ontzettend... | Oneens | ... | Eens | De woningnood is groot en vraagt om nieuwe bou... | Eens | De VVD wil flink schrappen in het aantal bouwr... | Eens | Volt wil sneller woningen bouwen, ook op landb... | Eens | We willen landbouwgrond voor 90% omzetten in n... | Eens | Meer woningen zijn hard nodig. Laat grondeigen... |
3 | Controle op religieuze les | De overheid moet strenger controleren wat jong... | De overheid heeft regels gemaakt voor scholen ... | Eens | Onderwijs controle wordt uitgevoerd door de in... | Eens | Buitenlandse mogendheden steken graag hun lang... | Oneens | De surveillance van religieuze groepen belemme... | Eens | ... | Eens | Iedereen mag geloven wat hij wil, maar onderwi... | Eens | Voor het aanleren van onvrije waarden is in ge... | Oneens | Volt staat voor gelijke behandeling. Discrimin... | Oneens | Nederland heeft vrijheid van godsdienst. | Oneens | Vrijheid van onderwijs en geloof staat centraa... |
4 | Eigen risico zorgverzekering | Het eigen risico bij zorgverzekeringen wordt p... | In Nederland ben je verplicht een zorgverzeker... | Oneens | Dit leidt tot toename koopkracht. Zeer wenseli... | Oneens | Het is belangrijk dat zorg toegankelijk en bet... | Oneens | We zetten een Nationaal Zorgfonds op dat de pr... | Eens | ... | Oneens | Het eigen risico is een boete op ziek zijn. Da... | Eens | Het eigen risico is voor ons belangrijk om de ... | Eens | Volt wil dat zorg betaalbaar blijft voor ieder... | Oneens | De zorg moet voor iedereen betaalbaar zijn. | Eens | Vrij Verbond kiest voor een eerlijk zorgstelse... |
5 rows × 51 columns
Each row equals one statement (issue) from Stemwijzer. For every party we have two columns: one categorical position (_position
) and one short text explanation (_explanation
).
We should have 30 issues for 24 parties.
= len([col for col in df_position.columns if col.endswith('_position')])
num_parties = df_position.shape[0]
num_issues
print(f"Number of parties: {num_parties}")
print(f"Number of issues: {num_issues}")
Number of parties: 24
Number of issues: 30
What do the various parties think about each issue?
We’ll plot a bar chart per statement showing counts of parties that agree, disagree, or stay neutral.
We add logos to each bar so we can see which party thinks what.
# List of parties in the order they appear in the image
= [
parties 'PVV', 'GroenLinks-PvdA', 'VVD', 'NSC', 'D66', 'BBB', 'CDA',
'SP', 'DENK', 'Partij voor de Dieren', 'FvD', 'SGP', 'ChristenUnie',
'Volt', 'JA21', 'Vrede voor Dieren', 'BVNL', 'BIJ1', 'Libertaire Partij',
'50PLUS', 'Piratenpartij', 'FNP', 'Vrij Verbond', 'De Linie'
]
# Load the image
= Image.open('parties.png')
img
# Get image dimensions
= img.size
width, height
# Find the logo's height:
= len(parties)
num_parties = height // num_parties
logo_height
# Cut each logo and store in a dictionary
= {}
party_logos for i in range(num_parties):
= i * logo_height
top = (i + 1) * logo_height
bottom = img.crop((0, top, width, bottom))
logo = logo party_logos[parties[i]]
Next we create a bar chart for each of the 30 issues.
# Get all party position columns
= [col for col in df_position.columns if col.endswith('_position')]
position_cols
# Plotting
= plt.subplots(10, 3, figsize=(12, 60), constrained_layout=True)
fig, axes = axes.flatten()
axes
def text_wrap(text, width=40):
"""Wrap text at word boundaries."""
import textwrap
return "\n".join(textwrap.wrap(text, width=width))
def bold_text_with_spaces(text):
"""Make text bold while preserving spaces"""
return ' '.join(f'$\\mathbf{{{word}}}$' for word in text.split())
def add_party_logos(ax, row):
"""Add party logos for each position"""
= [0] * 3 # Track y-offset for each column (Eens, Oneens, Geen van beide)
y_offsets
# First, count how many parties are in each position to calculate bar heights
= {'Eens': 0, 'Oneens': 0, 'Geen van beide': 0}
counts for pos in [val for col, val in row.items() if col.endswith('_position')]:
if pd.notna(pos):
+= 1
counts[pos]
# Draw the bars
= np.arange(3)
x 'Eens'] + 1) // 2) * 2.2, ((counts['Oneens'] + 1) // 2) * 2.2, ((counts['Geen van beide'] + 1) // 2) * 2.2],
ax.bar(x, [((counts[='white', edgecolor=['green', 'red', 'grey'], alpha=0.8, width=0.65)
color# Add counts to the top of each bar (except when count is 0)
for idx, (label, count) in enumerate(counts.items()):
if count > 0:
ax.text(
idx, + 1) // 2) * 2.2 + 0.2,
((count str(count),
='center',
ha='bottom',
va=12,
fontsize='bold'
fontweight
)
# Now add logos
for party, pos in [(col.replace('_position', ''), val) for col, val in row.items() if col.endswith('_position')]:
if party in party_logos and pd.notna(pos):
if pos == 'Eens':
= 0
x elif pos == 'Oneens':
= 1
x else: # Geen van beide
= 2
x
# Convert logo to array and add to plot
= party_logos[party]
logo = OffsetImage(logo, zoom=0.2)
imagebox if y_offsets[x] % 2 == 0:
= x + 0.15
shift_x else:
= x - 0.15
shift_x = (y_offsets[x] // 2) * 2.2 + 1.1
y = AnnotationBbox(imagebox, (shift_x, y),
ab =False,
frameon=(0.5, 0.5))
box_alignment
ax.add_artist(ab)+= 1
y_offsets[x]
for i, ax in enumerate(axes):
if i < len(df_position.index):
= df_position.iloc[i]
row
# Set theme as main title and wrapped title as subtitle
ax.set_title(f"{bold_text_with_spaces(row['theme'])}\n{text_wrap(row['title'])}",
=10,
fontsize=15,
pad=True
wrap
)
# Set up the axes
-0.5, 2.5)
ax.set_xlim(0.0, len(position_cols))
ax.set_ylim(0, 1, 2])
ax.set_xticks(['Eens', 'Oneens', 'Geen van beide'], rotation=0)
ax.set_xticklabels([
# Add party logos
add_party_logos(ax, row)
# Remove y-axis
ax.set_yticks([])else:
'off')
ax.axis(
'Distribution of party positions per theme', fontsize=16)
plt.suptitle( plt.show()
The theme of Illegaal verblijf (“Illegal stay”) is the most divisive issue, with an even split of parties agreeing and disagreeing. Conversely, the Heffing op plastic (“Plastic tax”) is the least divisive, showing a strong consensus among most parties.
What reasons do parties give when they explain their positions?
To get an idea of what some reasons are for agreeing or disagreeing with a statement I’ll plot a word cloud for each issue. In word clouds the size of a word indicates how frequently this word is used.
We remove stopwords to make sure the word clouds aren’t crowded with irrelevant words.
# Ensure NLTK stopwords are downloaded and get Dutch stopwords
'stopwords')
nltk.download(= set(stopwords.words('dutch')) dutch_stopwords
[nltk_data] Downloading package stopwords to /home/aswin/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
I’m color coding the words as follows: A word is shown in green if it’s more frequently used among agreeing parties and in red if disagreeing parties use this word more.
def generate_color_func(agree_words, disagree_words):
"""Generate a color function based on word frequencies in agree/disagree groups"""
def color_func(word, **kwargs):
# Get frequencies (default to 0 if word not in dict)
= agree_words.get(word, 0)
agree_freq = disagree_words.get(word, 0)
disagree_freq
# Calculate ratio (avoiding division by zero)
= agree_freq + disagree_freq
total if total == 0:
return 'grey'
= agree_freq / total
ratio
# Create RGB color based on ratio (green to red gradient)
= int(255 * (1 - ratio))
r = int(255 * ratio)
g = 0
b
return f'rgb({r}, {g}, {b})'
return color_func
def create_word_frequencies(text_list):
"""Create word frequency dictionary from list of texts"""
from collections import Counter
import string
# Create translation table to remove punctuation
= str.maketrans('', '', string.punctuation)
translator
= []
words for text in text_list:
if isinstance(text, str):
# Remove punctuation and split into words
= text.translate(translator)
clean_text # Split text into words and filter out stopwords
for word in clean_text.split()
words.extend([word.lower() if word.lower() not in dutch_stopwords
and len(word) > 2])
return Counter(words)
# Create word clouds for each issue
= plt.subplots(6, 5, figsize=(18, 23))
fig, axes = axes.flatten()
axes
for idx, (_, row) in enumerate(df_final.iterrows()):
if idx >= len(axes):
break
# Get explanations for parties that agree and disagree
= []
agree_explanations = []
disagree_explanations
for party in parties:
= f'{party}_position'
pos_col = f'{party}_explanation'
exp_col
if pos_col in row and exp_col in row and pd.notna(row[pos_col]) and pd.notna(row[exp_col]):
if row[pos_col] == 'Eens':
agree_explanations.append(row[exp_col])elif row[pos_col] == 'Oneens':
disagree_explanations.append(row[exp_col])
# Create word frequencies for each group
= create_word_frequencies(agree_explanations)
agree_words = create_word_frequencies(disagree_explanations)
disagree_words
# Combine all words for the word cloud
= {}
all_words for word in set(list(agree_words.keys()) + list(disagree_words.keys())):
= max(agree_words.get(word, 0), disagree_words.get(word, 0))
all_words[word]
if all_words: # Only create word cloud if we have words
# Create and generate word cloud
= WordCloud(width=400, height=300, background_color='white',
wc =generate_color_func(agree_words, disagree_words),
color_func=50)
max_words
wc.generate_from_frequencies(all_words)
# Display word cloud
='bilinear')
axes[idx].imshow(wc, interpolation'off')
axes[idx].axis(
axes[idx].set_title(f"{bold_text_with_spaces(row['theme'])}\n{text_wrap(row['title'])}",
=10,
fontsize=15,
pad=True
wrap
)
# Remove empty subplots
for idx in range(len(df_final), len(axes)):
'off')
axes[idx].axis(
'Words used in party explanations\n(Green: Words from agreeing parties, Red: Words from disagreeing parties)',
plt.suptitle(=16, y=0.99)
fontsize
plt.tight_layout() plt.show()
When we, for instance, look at the word cloud for the issue of kerncentrales (“nuclear power plants”), parties that support building extra nuclear power plants, use words like reliable, stable, and clean, while opponents use words like costly, and slow refering to the building process.
How similarly do parties answer across all 30 statements?
For each pair we compute the % of statements where both gave the same response (agree, disagree, or neutral).
# Get the raw positions (without encoding)
= df_position[position_cols]
df_positions
# Clean up party names by removing '_position' suffix
= [col.replace('_position', '') for col in position_cols]
party_names
# Initialize agreement matrix with zeros, explicitly setting dtype to float64
= pd.DataFrame(0.0, index=party_names, columns=party_names, dtype='float64')
agreement_matrix
# Calculate agreement percentages
for i, party1 in enumerate(position_cols):
for j, party2 in enumerate(position_cols):
# Get positions for both parties where neither is null
= df_positions[party1].notna() & df_positions[party2].notna()
mask = df_positions[party1][mask]
positions1 = df_positions[party2][mask]
positions2
if len(positions1) > 0: # Only calculate if we have valid positions
# Count where both agree:
# - both Eens
# - both Oneens
# - both Geen van beide
= ((positions1 == 'Eens') & (positions2 == 'Eens')) | \
agreements == 'Oneens') & (positions2 == 'Oneens')) | \
((positions1 == 'Geen van beide') & (positions2 == 'Geen van beide'))
((positions1
# Calculate percentage
= (agreements.sum() / len(positions1)) * 100
agreement_pct
# Store in matrix using cleaned party names
= agreement_pct
agreement_matrix.iloc[i, j]
# Round to 1 decimal place
= agreement_matrix.round(1) agreement_matrix
We use a heatmap to visualize the result.
=(15, 15))
plt.figure(figsize= plt.imshow(agreement_matrix, cmap='RdYlBu', vmin=0, vmax=100)
im =0.046, pad=0.04, label='Agreement %')
plt.colorbar(im, fractionrange(len(party_names)), party_names, rotation=90)
plt.xticks(range(len(party_names)), party_names)
plt.yticks('Percentage of issues on which parties agree')
plt.title(
# Add percentage text to each cell
for i in range(len(party_names)):
for j in range(len(party_names)):
= plt.text(j, i, f'{agreement_matrix.iloc[i, j]}%',
text ='center', va='center')
ha
plt.tight_layout() plt.show()
Cells show % agreement. The 100% values on the main diagonal are expected, as a party always agrees with its own answers.
What parties are the most and the least in agreement with each other?
Ignoring the 100% self-agreement, what are the highest and lowest agreement pairs?
= agreement_matrix.values.min()
min_agreement
# Find the second largest unique agreement value
= np.unique(agreement_matrix.values)
unique_agreements if len(unique_agreements) > 1:
= unique_agreements[-2]
max_agreement else:
= unique_agreements[0]
max_agreement
print(f"Minimal agreement percentage: {min_agreement}%")
print(f"Maximal agreement percentage: {max_agreement}%")
Minimal agreement percentage: 13.3%
Maximal agreement percentage: 93.3%
These are the parties with highest agreement levels:
# Find all party pairs with the second largest agreement
= []
second_max_pairs for i in range(len(party_names)):
for j in range(i + 1, len(party_names)):
if agreement_matrix.iloc[i, j] == max_agreement:
second_max_pairs.append((party_names[i], party_names[j]))
for p1, p2 in second_max_pairs:
print(f"{p1} ❤️ {p2}: {max_agreement}%")
BIJ1 ❤️ Piratenpartij: 93.3%
The parties with the lowest agreement levels are:
# Find party pairs for min_agreement
= []
party_pairs for i in range(len(party_names)):
for j in range(i + 1, len(party_names)):
= agreement_matrix.iloc[i, j]
value if value == min_agreement:
party_pairs.append((party_names[i], party_names[j]))
for p1, p2 in party_pairs:
print(f"{p1} 💔 {p2}: {min_agreement}%")
FvD 💔 GroenLinks-PvdA: 13.3%
PVV 💔 Volt: 13.3%
We have to take into consideration that the statements on the Stemwijzer website are chosen such that we can differentiate between parties. It’s likely that on many issues, seemingly very dissimilar parties like the FvD and GroenLinks-PvdA, actually agree with each other. It wouldn’t make sense to use those issues as statements for Stemwijzer as they wouldn’t help a voter distinguish between parties.
Squeezing 30 issues into 2D using t-SNE
By encoding Eens (agree) as 1, Oneens (disagree) as -1, and Geen van beide (neutral) as 0, we transform each party’s stance into a quantifiable, 30-dimensional vector.
'future.no_silent_downcasting', True)
pd.set_option(
# Create encoded dataframe by mapping positions to numerical values
= {
position_mapping 'Eens': 1.0,
'Oneens': -1.0,
'Geen van beide': 0.0
}
# Get only the position columns and convert to numeric values
= [col for col in df_position.columns if col.endswith('_position')]
position_cols = df_position[position_cols].copy()
df_encoded = df_encoded.replace(position_mapping)
df_encoded
# Convert to float64
= df_encoded.astype('float64') df_encoded
We can now project the parties into 2D using t-SNE (t-distributed Stochastic Neighbor Embedding).
The idea behind t-SNE is that local neighborhoods are reserved by making sure that points close together in original space stay close in 2D.
Let’s run it to see if clusters emerge.
# Prepare the encoded party positions
= df_encoded.values
X
# Run t-SNE
= TSNE(n_components=2, random_state=4, perplexity=5)
tsne = tsne.fit_transform(X.T) # transpose so shape is (n_parties, n_issues)
X_embedded
# Plot the parties in 2D space
=(10, 10))
plt.figure(figsize
# Add party logos and points
for i, name in enumerate(party_names):
# Add the scatter point
0], X_embedded[i, 1], color='white', edgecolor='black', s=100)
plt.scatter(X_embedded[i,
# Add the party logo
if name in party_logos:
= OffsetImage(party_logos[name], zoom=0.25)
imagebox = AnnotationBbox(imagebox, (X_embedded[i, 0], X_embedded[i, 1]),
ab =False,
frameon=(0.5, 0.5))
box_alignment
plt.gca().add_artist(ab)
# Add party name below the logo
0], X_embedded[i, 1] - 5.5, name,
plt.text(X_embedded[i, =8, ha='center', va='top')
fontsize
't-SNE visualization of party positions')
plt.title('t-SNE dimension 1')
plt.xlabel('t-SNE dimension 2')
plt.ylabel(True)
plt.grid(
plt.tight_layout() plt.show()
Nothing too surprising here. Parties that were similar in the heatmap above seem to lie closer together in this t-SNE plot as well. Two shortcomings of t-SNE are that it sometimes creates artificial spacing and that axes don’t really represent anything. Different runs produce different projections.
Doing a Principal Component Analysis (PCA)
PCA (Principal Component Analysis) finds orthogonal axes capturing maximal variance. In contrast to t-SNE, axes do have somewhat interpretable meanings. Early components might capture broad ideological splits. Let’s see if this is the case.
# Prepare the encoded party positions (already created in cell 13)
= df_encoded.values
X
# Run PCA
= PCA(n_components=2, random_state=42)
pca = pca.fit_transform(X.T) # transpose so shape is (n_parties, n_issues)
X_pca
# Plot the parties in 2D space
=(10, 10))
plt.figure(figsizefor i, name in enumerate(party_names):
0], X_pca[i, 1])
plt.scatter(X_pca[i,
# Add the party logo
if name in party_logos:
= OffsetImage(party_logos[name], zoom=0.25)
imagebox = AnnotationBbox(imagebox, (X_pca[i, 0], X_pca[i, 1]),
ab =False,
frameon=(0.5, 0.5))
box_alignment
plt.gca().add_artist(ab)
# Add party name below the logo
0], X_pca[i, 1] - 0.20, name,
plt.text(X_pca[i, =8, ha='center', va='top')
fontsize
'PCA visualization of party positions')
plt.title('PCA dimension 1')
plt.xlabel('PCA dimension 2')
plt.ylabel(True)
plt.grid(
plt.tight_layout() plt.show()
At first glance PC1 looks like a left–right axis, but we’ll validate that by inspecting which issues load strongly on it. PC2 might reflect a libertarian–authoritarian or progressive–traditional divide. We’ll try to interpret these components below.
How much variance do components capture?
The scree plot below shows how much variance is explained by each principal component (PC).
# Fit PCA with all possible components (up to number of parties)
= PCA(n_components=len(position_cols), random_state=42)
pca_full = df_encoded.values
X # shape: (n_parties, n_issues)
pca_full.fit(X.T)
# Explained variance ratio for each component
= pca_full.explained_variance_ratio_
explained_variance
# Plot the explained variance ratio
=(8, 5))
plt.figure(figsizerange(1, len(explained_variance) + 1), explained_variance, marker='o', label='Individual')
plt.plot(range(1, len(explained_variance) + 1), explained_variance.cumsum(), marker='s', label='Cumulative')
plt.plot('Number of PCA components')
plt.xlabel('Explained variance ratio')
plt.ylabel(
# Annotate each bar with the percentage of explained variance
for i, v in enumerate(explained_variance):
+ 1, v + 0.005, f"{v*100:.1f}%", ha='center', va='bottom', fontsize=9)
plt.text(i range(1, len(explained_variance) + 1), explained_variance, color='skyblue')
plt.bar(
plt.legend().remove()'Explained variance by PCA components')
plt.title(
plt.legend()True)
plt.grid(
plt.tight_layout() plt.show()
Most variance is captured by PC1 (38.2%). Explanatory power then drops quickly, but PC2 still captures 13.3% of the variance.
Interpreting the axes
Let’s look at the loading scores of the first two PCs.
# Get PCA loadings (components) for each issue on each factor
# Create DataFrame with loadings scaled by explained variance
= pd.DataFrame(
loadings =pca.components_.T * np.sqrt(pca.explained_variance_),
data=[f"PC{i}" for i in range(1, pca.n_components_ + 1)],
columns=df_position['theme']
index
)
= plt.subplots(2, 1, figsize=(10, 8), sharex=True, sharey=True)
fig, axs = ["#1C3041", "#9B1D20"]
colors
for i, ax in enumerate(axs):
if i >= pca.n_components_:
'off')
ax.axis(continue
= pca.explained_variance_ratio_[i] * 100
explained_variance = f"PC{i+1}"
pc = ax.bar(range(len(loadings.index)), loadings[pc], color=colors[i], edgecolor="#000000", linewidth=1.2)
bars f"{pc} Loading Scores ({explained_variance:.2f}% Explained Variance)", fontdict={"weight": "bold"}, pad=20)
ax.set_title(if i == len(axs) - 1: # Only add xlabel to the bottom subplot
"Thema")
ax.set_xlabel("Loading Score")
ax.set_ylabel(="y")
ax.grid(axisrange(len(loadings.index)))
ax.set_xticks(if i == len(axs) - 1: # Only add x-tick labels to the bottom subplot
=90, ha='right', fontsize=9)
ax.set_xticklabels(loadings.index, rotation-1, 1)
ax.set_ylim(
for j, bar in enumerate(bars):
= bar.get_height()
yval if abs(yval) > 0.1: # Only show labels for significant values
= yval + 0.02 if yval > 0 else yval - 0.02
offset + bar.get_width() / 2, offset, f"{yval:.2f}",
ax.text(bar.get_x() ="center", va="bottom" if yval > 0 else "top", fontsize=7)
ha
plt.tight_layout() plt.show()
Strong loadings tell us which statements “pull” most on a component and can help us discover underlying ideological factors.
Let’s look at the five strongest absolute loadings of PC1 and PC2.
# Get absolute loadings for PC1 and PC2
= loadings['PC1'].abs()
pc1_loadings = loadings['PC2'].abs()
pc2_loadings
# Get top 5 themes for each component
= pc1_loadings.sort_values(ascending=False).head(5)
top5_pc1 = pc2_loadings.sort_values(ascending=False).head(5)
top5_pc2
print("Top 5 most important themes for PC1:")
for theme in top5_pc1.index:
print(f"- {theme} (loading: {loadings.loc[theme, 'PC1']:.2f})")
print("\nTop 5 most important themes for PC2:")
for theme in top5_pc2.index:
print(f"- {theme} (loading: {loadings.loc[theme, 'PC2']:.2f})")
Top 5 most important themes for PC1:
- Illegaal verblijf (loading: 0.88)
- Geld naar publieke omroep (loading: 0.87)
- Minder stikstof (loading: -0.87)
- Wolven (loading: 0.79)
- Profileren op nationaliteit (loading: -0.77)
Top 5 most important themes for PC2:
- Leeftijdsgrens sociale media (loading: 0.65)
- Sekswerk (loading: 0.61)
- Aanwezigheid bij demonstratie (loading: 0.59)
- Meer geld voor defensie (loading: 0.51)
- Meer geld voor ontwikkelingshulp (loading: 0.51)
PC1 loads strongly on issues like immigration, environmental rules, and cultural liberalism. In the past, “left” and “right” mainly referred to views on how much the state should intervene in the market. Today, though, these terms also seem to cover social and cultural topics like immigration, religion, and the environment. So I would say that PC1 does indeed reflect a Left-Right divide, however not the classic economic one.
PC2 loads primarily on issues involving social regulation, moral permissiveness, and authority (e.g., restrictions on social media use, sex work, and protest). I would therefor interpret it as an Libertarian-Authoritarian dimension, reflecting preferences for individual freedom and permissiveness versus state control and traditional morality.
Some final remarks
I want to end this post with a few remarks, a disclaimer of sorts.
I am not a political scientist. I tried to interpret the PCs to the best of my understanding, but someone more knowledgable might interpret the loading scores differently and come up with more accurate labels for the PCs.
I don’t know how the 30 issues on the Stemwijzer website are selected but it’s unlikely that they are an accurate representation of a party’s viewpoints. By reducing positions to three possible answers we lose any nuance in a party’s stance. Furthermore, the 30 issues are selected in such a way that they are discriminative: They’re meant to help you choose between parties. It’s likely that parties are more similar than the distance matrix/heatmap above suggests.
Another thing to keep in mind is that when parties supply their positions to the Stemwijzer website, they likely do so strategically by aligning with what they think most voters would prefer. They’re trying to rake in votes and might misrepresent their actual stance on certain issues. Politicians can be manipulative and dishonest. Shocking, I know.
Despite these shortcomings, I did find it insightful to see how the Stemwijzer data shows underlying left-right and libertarian-authoritarian axes. It helped me realize that my vote shouldn’t just be about agreeing or disagreeing with a list of statements, but also about what level of left-right/libertarian-authoritarian I’m comfortable with.
You can find the code for both the analysis and webscraper in the following GitHub repository:
Don’t forget to vote!