Skip to content

Notebook Guide

VersoVector is notebook-first for exploration and validation.

The notebooks should be read and executed in order because each stage depends on outputs from the previous one.

Before running the notebooks, review the Model Topology. It explains how the feature union, supervised branch, unsupervised branch, and integration stage connect across the full workflow.

Before running notebooks

Make sure the dataset exists at:

text
data/vallejo_poems_en.csv
data/PoetryFoundationData.csv

See Dataset for download instructions.

The first notebook, 01_cleaning_pipeline.ipynb, expects this raw CSV and produces the processed corpus used by the following notebooks.

Notebook sequence

StepNotebookPurposeRelated topology stage
101_cleaning_pipeline.ipynbClean raw poetry data and generate the processed corpusOriginal poems → preprocessing
202_feature_pipeline.ipynbFit and inspect the shared feature pipelineFeatureUnion, CountVectorizer, TF-IDF, DictVectorizer
303_embeddings_supervised.ipynbTrain and evaluate supervised multilabel tag predictionStackingClassifier + OneVsRest
404_embeddings_unsupervised.ipynbGenerate similarity, topics, clustering, and projectionsLDA, clustering, similarity, PCA/t-SNE/UMAP
505_supervised_unsupervised_integration.ipynbCombine supervised and unsupervised outputsResults integration
606_visualizations.ipynbBuild final visualizations and interpretation assetsFinal interpretation
text
01_cleaning_pipeline.ipynb

02_feature_pipeline.ipynb

03_embeddings_supervised.ipynb

04_embeddings_unsupervised.ipynb

05_supervised_unsupervised_integration.ipynb

06_visualizations.ipynb

Viewing notebooks

Some notebook outputs may not render fully on GitHub, especially HTML diagrams or rich model visualizations.

Use nbviewer when you want a more reliable rendered view of the executed notebooks.

What to inspect

When reviewing the notebooks, focus on:

  • how the text is cleaned and normalized;
  • how the feature pipeline is built;
  • how sparse matrices are handled;
  • how the supervised branch predicts poetic tags;
  • how clustering and topics are generated;
  • how semantic neighbors are computed;
  • how supervised and unsupervised results are joined;
  • what final visualizations reveal.

Expected local artifacts

The notebooks and scripts may generate artifacts under:

text
artifacts/
data/
figs/

Heavy binary artifacts are not expected to be committed to Git. They should be regenerated locally.

Built with VitePress and deployed with GitHub Pages.