Dataset
VersoVector expects the main poetry dataset to be available locally under the repository data/ directory.
Expected file
data/PoetryFoundationData.csvThis file is not committed to the public repository.
It should be downloaded locally before running the notebook pipeline.
Source
The dataset used by the public repository is:
tgdivy/poetry-foundation-poemsKaggle page:
https://www.kaggle.com/datasets/tgdivy/poetry-foundation-poems/dataThe dataset contains poems from Poetry Foundation with metadata such as poem title, poet, poem text, and tags.
Why the dataset is not versioned
The public repository avoids committing raw dataset files and heavy generated artifacts.
This keeps the repository lightweight and makes the data acquisition step explicit for reproducibility.
Expected local-only files include:
data/PoetryFoundationData.csv
artifacts/
figs/
*.joblib
*.pkl
*.npy
*.npzOption 1: Manual download
Download the dataset from Kaggle and place the CSV here:
VersoVector/data/PoetryFoundationData.csvThen run the notebook sequence starting from:
notebook/01_cleaning_pipeline.ipynbOption 2: Download with kagglehub
Install kagglehub:
pip install kagglehubThen run:
import os
import ssl
import shutil
from pathlib import Path
import kagglehub
# SSL workaround for local environments with certificate issues.
# Do not use this workaround in CI or production environments.
if not os.environ.get("CI"):
ssl._create_default_https_context = ssl._create_unverified_context
here: Path = Path.cwd().absolute()
data_dir: Path = here / "data"
data_dir.mkdir(exist_ok=True)
target_file: Path = data_dir / "PoetryFoundationData.csv"
if not target_file.is_file():
downloaded_path = Path(
kagglehub.dataset_download(
"tgdivy/poetry-foundation-poems"
)
)
if downloaded_path.is_file():
source_file = downloaded_path
else:
source_file = downloaded_path / "PoetryFoundationData.csv"
shutil.copyfile(source_file, target_file)
print("Dataset ready at:", target_file)
else:
print("Dataset already exists at:", target_file)Expected project layout after download
VersoVector/
├── data/
| ├── vallejo_poems_en.csv
│ └── PoetryFoundationData.csv
├── notebook/
│ ├── 01_cleaning_pipeline.ipynb
│ ├── 02_feature_pipeline.ipynb
│ ├── 03_embeddings_supervised.ipynb
│ ├── 04_embeddings_unsupervised.ipynb
│ ├── 05_supervised_unsupervised_integration.ipynb
│ └── 06_visualizations.ipynb
└── src/Next step
After placing the dataset in data/, continue with:
notebook/01_cleaning_pipeline.ipynbThe cleaning notebook prepares the processed corpus used by the rest of the pipeline.