I lead a team of undergraduate researchers in software engineering and NLP tool development at the Neural Tuning of Reading Lab, advised by Dr. Jeremy Purcell at the University of Maryland. My work centers on building computational linguistics pipelines for analyzing spelling-sound relationships, developing open source tooling for the research community, and running technical analyses for upcoming publications on reading and language processing.

Research Focus

The Neural Tuning of Reading lab investigates how neural mechanisms tune to spelling-sound relationships during reading acquisition. A core focus is the computational modeling of phonographeme mappings – the systematic correspondences between phonemes (speech sounds) and graphemes (letter sequences) – at multiple linguistic grain sizes (phoneme-grapheme, onset-rime, syllable level). English is an opaque orthography, meaning spelling-sound mappings are inconsistent, which makes it a particularly rich domain for building NLP tools that quantify and analyze these irregularities.

Key research areas include:

  • Orthography-phonology coupling in the brain
  • Developmental trajectories of reading-related brain regions
  • Sublexical vs. lexical processing during reading
  • Pseudoword spelling as a window into sublexical representations

The Sublexical Toolkit

My primary engineering contribution is the Sublexical Toolkit, an upcoming open source language analysis package available in both R and Python. The toolkit provides researchers with comprehensive measures for analyzing spelling-sound relationships at multiple grain sizes.

Technical Contributions

My most impactful technical contributions to the toolkit involve building the computational linguistics infrastructure from the ground up:

  • NLP Data Pipeline: Built an end-to-end data ingestion and processing pipeline in Python that tokenizes words into phoneme-grapheme mapping units, resolves ambiguous segmentations, and produces structured representations suitable for downstream probability computation. This involved webscraping and parsing tens of thousands of words from several accredited dictionaries, handling edge cases in linguistic data, and ensuring data quality across multiple sources.

  • Phonographeme Probability Engine: Engineered the system that computes mapping and probability tables across syllabic positions and grain sizes – effectively a specialized language model for sublexical structure. The engine processes each word through syllabification (Maximum Onset Principle), segments it into phoneme-grapheme units, and accumulates positional frequency and probability statistics.

  • Dataset Expansion: Scaled the Toolkit’s dataset from 20,000 to 40,000+ words during the webscraping phase, ultimately curating a 22,474-word master list after deduplication and pronunciation conflict resolution.

  • Backend Refactoring: Refactored the Toolkit’s computational backend from R to Python, improving performance and maintainability while preserving the existing R interface for users who prefer it.

Multi-Level Word Decomposition

The toolkit parses each word at multiple linguistic grain sizes simultaneously. The diagram below illustrates how a single word flows through the pipeline – from raw letters to structured phoneme-grapheme, onset-rime, and onset-nucleus-coda representations.

brain Letters b · r · a · i · n Phonemes /b/ · /ɹ/ · /eɪ/ · /n/ PG Units b→/b/ · r→/ɹ/ · ai→/eɪ/ · n→/n/ OR br onset · ain rime ONC br onset · ai nucleus · n coda

This multi-level decomposition is the foundation of the toolkit’s 80+ derived measures. Each grain size captures different aspects of spelling-sound structure, and the probability engine computes positional statistics at every level.

Data Curation Pipeline

Building a research-quality dataset required significant data engineering. The pipeline below shows the journey from raw dictionary scrapes to the curated master word list.

Data Curation Pipeline 1 Web Scraping CMU Dict Cambridge Dict SUBTLEX-US 40,000+ words 2 { } Tokenization & Parsing Phoneme IDs Grapheme seg. IPA standard. 3 source formats 3 Syllabi- fication Max. Onset Principle Boundaries set 1-7 syllables 4 Conflict Resolution Pron. variants Duplicate audit Manual review 1,380 resolved 5 Final Dataset Curated & validated 22,474 words 80+ measures Data volume through pipeline ~40,000 22,474 44% reduction through curation Sources: CMU Pronouncing Dictionary · Cambridge Dictionary · SUBTLEX-US

Each stage involved custom tooling: the webscraper handled three different dictionary formats, the tokenizer resolved ambiguous phoneme-grapheme alignments, and the conflict resolution system flagged 1,380 pronunciation variants for manual review.

Reverse Mapping: Pseudoword Spelling

One of the toolkit’s most novel capabilities is pw_spell – a reverse phoneme-to-grapheme mapping function that generates plausible English spellings from phoneme sequences. This supports research on pseudoword spelling, where participants’ spelling choices reveal their internalized sublexical knowledge.

Reverse Phoneme-to-Grapheme Mapping pw_spell: probabilistic spelling from phoneme input phoneme selected candidates (by probability) /s/ s c sc ss /k/ c k ck ch /ɹ/ r rr wr /ɪ/ i y e /m/ m mm mb output scrim s c r i m concatenated grapheme selected by highest probability in PG mapping table

The function selects graphemes probabilistically based on the toolkit’s learned positional frequency tables, producing spellings that reflect real English orthographic patterns.

Dataset Characteristics

The toolkit’s 22,512-word vocabulary spans a wide range of word structures and frequency levels. The following visualizations were generated from the toolkit’s master sublexical unit spreadsheet.

Syllable Count Distribution (22,512 words) Number of Words 0 2,000 4,000 6,000 8,000 10,000 Number of Syllables 1 5,562 24.7% 2 9,797 3 4,886 4 1,755 5 456 6+ 56

Two-syllable words dominate the dataset (43.5%), with single-syllable and three-syllable words forming the next largest groups. Words with 6+ syllables are rare in English and account for less than 0.3% of the vocabulary.

Phoneme-Grapheme Unit Count per Word (22,512 words) Number of Words 0 1,000 2,000 3,000 4,000 Phoneme-Grapheme Units per Word 1-2 349 3 1,986 4 3,444 5 4,164 6 4,178 7 3,133 8 2,365 9 1,435 10 769 11 418 12 187 13+ 74

Phoneme-grapheme unit counts follow a roughly normal distribution centered around 5-6 units per word. This reflects the average complexity of English spelling-sound correspondences in the toolkit’s vocabulary.

Word Frequency Distribution (SUBTLEX-US) Number of Words 0 2,000 4,000 6,000 8,000 10,000 SUBTLEX-US Frequency (log scale) 1-9 555 10-99 9,252 100-999 9,903 1K-10K 2,313 10K+ 489

Word frequencies from the SUBTLEX-US corpus show the expected Zipfian pattern – most words in the dataset occur relatively infrequently in natural language, while a small number of high-frequency words (10K+) account for the bulk of everyday usage.

Toolkit Guide

I authored a comprehensive LaTeX reference document (the Toolkit Guide) that serves as the primary documentation for the toolkit’s measures, grain sizes, datasets, and functions. The guide covers the full taxonomy of 80+ derived measures, explains the syllabic position encoding scheme, and provides usage examples for both the R and Python interfaces.

Available Measures

The toolkit computes various sublexical measures including:

  • Phonographeme (PG) probabilities across syllable positions
  • Grapheme-phoneme (GP) mappings
  • Frequency tables for syllable-initial, syllable-medial, and syllable-final positions
  • Word-initial and word-final probability distributions
  • Age-of-acquisition statistics

NTR Orthogonalization Library

I developed the NTR Orthogonalization Library, a specialized MATLAB toolkit for constructing optimized word lists for fMRI block-based studies. The library ensures minimal correlation between words, helping researchers design more precise neuroimaging experiments.

Features

  • Minimized Word Correlation: Constructs word lists with minimal inter-word correlations for cleaner fMRI signals
  • Adjustable Parameters: Customize word list length while maintaining minimal correlation
  • Visualization Tools: Generate distance matrices and correlation visualizations to aid study design
  • Extensibility: Modular structure designed for easy adaptation to related projects

Algorithm

The core algorithm minimizes correlation between words by:

  1. Calculating pairwise correlations between words based on their semantic vectors (using GloVe embeddings)
  2. Creating a triangular distance matrix representing semantic distances
  3. Iteratively adjusting the word list to minimize overall correlation while maintaining desired length

The library draws on data from multiple sources including the English Lexicon Project (ELP) for word frequency data, IPHOD for biphone probabilities, and custom grapheme-phoneme mapping tables.

NTR Developmental Project

I contributed to the NTR Developmental Project, which generates linguistic measures across different age bins to analyze how reading develops over time. This work supports research into how sublexical processing changes as children learn to read.

Tools and Scripts

The project includes several analysis tools written in R:

  • Age Bin Measures: Streamlined computation of measures across developmental age groups
  • PG Probability Calculator: Computes phonographeme probabilities for lexical items
  • GP Table Generator: Creates frequency and probability tables for grapheme-phoneme mappings at various syllable positions

Output tables cover syllable-initial, syllable-medial, syllable-final, word-initial, and word-final positions for frequency, PG, and GP measures.

Publications & Presentations

  • Poster presented at the 2023 Society for Neuroscience (SfN) conference on the NTR Orthogonalization Library
  • Poster presented at the 2025 Society for the Neurobiology of Language (SNL) conference on the Sublexical Toolkit (with Brooks as co-author)
  • Technical analyses I’ve run support upcoming publications including work on pseudoword spelling and insights into sublexical representations and lexical interactions. These studies examine how measures of onset/rime consistency relate to lexical skill compared to phonographeme-level measures.

SfN 2023 — NTR Orthogonalization Library

SfN 2023 Poster — Identifying an orthogonalized list of written words for an fMRI multi-voxel pattern analysis study

SNL 2025 — Sublexical Toolkit & Pseudoword Spelling

SNL 2025 Poster — Reading Skill Predicts Variability in Pseudoword Pronunciation: An Experience-Dependent Basis for Neuroimaging Analyses

Technologies

The lab’s computational work spans multiple languages and tools:

  • Python: NLP pipelines, tokenization, webscraping, data validation, backend development
  • R/RMarkdown: Statistical analysis, probability calculations, visualization
  • MATLAB: Orthogonalization algorithms, matrix operations, fMRI study design tools
  • LaTeX: Toolkit Guide documentation