Unveiling Hidden Musical Semantics: Compositionality in Music Ngram Embeddings

PGR JGI Seed Corn Funding Project Blog 2023/24: Zhijin Guo

Introduction

The overall aim of this project is to analyse music scores by machine learning. These of course are different from sound recordings of music, since they are symbolic representations of what musicians play. But with encoded versions of these scores (in which the graphical symbols used by musicians are rendered as categorical data) we have the chance to turn these instructions in various sequences of pitches, harmonies, rhythms, and so on.

What were the aims of the seed corn project?

CRIM concerns a special genre of works from sixteenth century Europe in which a composer took some pre-existing piece and adapted the various melodies and harmonies in it to create a new but related composition. More specifically, the CRIM Project is concerned with polyphonic music, in which several independent lines are combined in contrapuntal combinations. As in the case of any given style of music, the patterns that composers create follow certain rules: they write using stereotypical melodic and rhythmic patterns. And they combine these tunes (‘soggetti’, from the Italian word for ‘subject’ or ‘theme’) in stereotypical ways. So, we have the dimensions of melody (line), rhythm (time), and harmony (what we’d get if we slice through the music at each instant.

A network of musical notations — *Figure 1. An illustration of music graph, nodes are music ngrams and edges are different relations between them.* *Image generated by DALL·E*.

We might thus ask the following kinds of questions about music:

Starting from a given composition, what would be its nearest neighbour, based on any given set of patterns we might chose to represent? A machine would of course not know anything about the composer, genre, or borrowing involved in those pieces, but it would be revealing to compare what a machine might tell us about this such ‘neighbours’ in light of what a human might know about them.

What communities of pieces can we identify in a given corpus? That is, if we attempt to classify of groups works in some way based on shared features, what kinds of communities emerge? Are these communities related to Style? Genre? Composer? Borrowing?

In contrast, if we take the various kinds of soggetti (or other basic ‘words’) as our starting point, what can we learn about their context? What soggetti happen before and after them? At the same time as them? What soggetti are most closely related to them? And through this what can we say about the ways each kind of pattern is used?

Interval as Vectors (Music Ngrams)

How can we model these soggetti? Of course they are just sequences of pitches and durations. But since musicians move these melodies around, it will not work simply to look for strings of pitches (since as listeners we can recognize that G-A-B sounds exactly the same as C-D-E). What we need to instead is to model these as distances between notes. Musicians call these ‘intervals’ and you could think of them like musical vectors. They have direction (up/down) and they have some length (X steps along the scale).

Here is an example of how we can use our CRIM Intervals tools (a Python/Pandas library) to harvest this kind of information from XML encodings of our scores. There is more to it than this, but the basic points are clear: the distances in the score are translated into a series of distances in a table. Each column represents the motions in one voice. Each row represents successive time intervals in the piece (1.0 = one quarter note).

An ngram for a section of music — *Figure 2. An example of ngram: [-3, 3, 2, -2], interval as vectors.*

Link Prediction

We are interested in predicting unobserved or missing relations between pairs of ngrams in our musical graph. Given two ngrams (nodes in the graph), the goal is to ascertain the type and likelihood of a potential relationship (edge) between them, be it sequential, vertical, or based on thematic similarity.

Sequential is tuples that come near each other time. This is Large Language Model which computes ‘context’. LLM then produces the semantic information that is latent in the data.

Vertical is tuples that happen at the same time. It is ANOTHER kind of context.

Thematic is based on some measure of similarity.

Upon training, the model’s performance is evaluated on a held-out test set, providing metrics such as precision, recall, and F1-score for each type of relationship. The model achieved a prediction accuracy of 78%.

Beyond its predictive capabilities, the model also generates embeddings for each ngram. These embeddings, which are high-dimensional vectors encapsulating the essence of each ngram in the context of the entire graph, can serve as invaluable tools for further musical analysis.