Skip to content

Model Topology

This page explains how the models used in VersoVector are connected across the notebook pipeline.

The topology is based on the public repository documentation in docs/model_topology.md and connects the conceptual ML workflow with the executable notebooks.

Project question

VersoVector explores the relationship between the semantic and emotional meaning of poetry through modern NLP and machine learning workflows.

It combines two learning approaches:

  • Unsupervised learning: clustering poems by style, tone, topic, or semantic proximity.
  • Supervised learning: classifying poems by emotion, theme, or poetic tone.

The project seeks to answer:

Can a language model perceive the emotion behind a poem, as a human reader does?

Notebook-to-model map

NotebookMain responsibilityTopology stage
01_cleaning_pipeline.ipynbCleans poems, normalizes text, prepares processed datasetsOriginal poems → preprocessing
02_feature_pipeline.ipynbBuilds the shared feature representationFeatureUnion, CountVectorizer, TF-IDF, DictVectorizer, Normalize
03_embeddings_supervised.ipynbTrains supervised multilabel modelsStackingClassifier, OneVsRest, Naive Bayes, Logistic Regression
04_embeddings_unsupervised.ipynbBuilds unsupervised analysisClustering, LDA topics, cosine/correlation similarity, projections
05_supervised_unsupervised_integration.ipynbIntegrates both branchesSupervised + unsupervised results integration
06_visualizations.ipynbProduces final interpretation figures2D/3D plots, labels, interpretation assets

General model workflow

Stage 1 — Cleaning

Implemented mainly in:

text
notebook/01_cleaning_pipeline.ipynb

This stage prepares raw poems and metadata for modeling.

Typical responsibilities:

  • normalize text;
  • clean punctuation or formatting noise;
  • prepare tokenization;
  • build stable poem identifiers;
  • generate processed datasets.

Stage 2 — Feature representation

Implemented mainly in:

text
notebook/02_feature_pipeline.ipynb

The project builds a combined representation using FeatureUnion.

The feature union combines:

  • CountVectorizer;
  • TfidfVectorizer;
  • custom dictionary features through TextToDictTransformer;
  • DictVectorizer;
  • normalization.

This stage produces the shared representation used by both supervised and unsupervised branches.

Stage 3 — Supervised branch

Implemented mainly in:

text
notebook/03_embeddings_supervised.ipynb

The supervised branch predicts emotion, theme, or poetic tone labels.

The topology includes:

  • OneVsRest;
  • StackingClassifier;
  • MultinomialNB;
  • ComplementNB;
  • LogisticRegression as final estimator.

This branch gives names to the patterns learned from labeled examples.

Stage 4 — Unsupervised branch

Implemented mainly in:

text
notebook/04_embeddings_unsupervised.ipynb

The unsupervised branch discovers emergent relationships without relying only on previous labels.

It includes:

  • clustering with KMeans, GaussianMixture, DBSCAN, and AgglomerativeClustering;
  • topic modeling with LatentDirichletAllocation;
  • similarity search with cosine similarity and correlation;
  • dimensionality reduction or projections using PCA, t-SNE, or UMAP.

Note: UMAP and t-SNE are dimensionality reduction methods, not clustering algorithms.

Note: LDA needs an interpretable document-term matrix, so it uses an independent count-based or TF-IDF representation.

Stage 5 — Results integration

Implemented mainly in:

text
notebook/05_supervised_unsupervised_integration.ipynb

This stage connects what the model predicts with what the model discovers.

It integrates:

  • predicted labels;
  • clusters;
  • topics;
  • topic terms;
  • semantic neighbors;
  • projection coordinates.

Conceptually:

text
Clustering + LDA + Similarity + Supervised classification

Results integration

Final interpretation: emotion + emergent themes

Stage 6 — Visual interpretation

Implemented mainly in:

text
notebook/06_visualizations.ipynb

The visualization stage helps interpret the combined results.

Useful outputs include:

  • 2D/3D projections;
  • cluster plots;
  • topic charts;
  • tag distributions;
  • integrated result tables;
  • nearest-neighbor examples.

Why supervised and unsupervised models are combined

Poetry often blends emotions, symbols, and tones within the same text.

The unsupervised approach discovers emergent patterns through clustering, topics, and similarity.

The supervised approach predicts emotional or thematic labels from known categories.

Together, they make it possible to compare:

ApproachRole
ClusteringDiscovers emergent groups
Topic modelingSuggests latent thematic structures
Similarity searchFinds semantic neighbors
Supervised classificationAssigns interpretable tags
IntegrationCompares discovered patterns with predicted labels

The unsupervised approach discovers resonances. The supervised approach gives them names.

Language note

Although the project was originally described in Spanish, the datasets and models are trained with poems in English because NLP resources are more available in English.

Built with VitePress and deployed with GitHub Pages.