Cosine Similarity

There’s a phrase that people like to quote, “labels are for cans”. While the statement’s intentions — either “stereotyping is bad” or “I’m a special snowflake”–are good and relatively inoffensive, respectively, it makes for bad epistemology. It’s a terrible approach to organizing information.

To understand, we need to generalize. To understand the course of any field, we need a broader understanding, a concept of a movement or a style. Sometimes, this division can take on an objective aspect. At an extreme, an artistic group like the Pre-Raphaelite Brotherhood or the Wu-Tang Clan has a defined set of artists who comprise it. However, even that can quickly break down. Ford Madox Brown is stylistically part of the Pre-Raphaelite Brotherhood, he hung out with them a lot, and his work is displayed with theirs, but he was never a member. Broader characteristics run into issues like this too. Kanye West may be from the Midwest, but to the accepted meaning of a “Midwest Rap” style, as exemplified by Twista, Tech N9ne, Krizz Kaliko, Royce Da 5’9”, or Eminem, an emphasis on technical mastery, speed, and precision, with a smattering of themes from horrorcore, he’s certainly not that.

So when I try and discuss data-informed categorization, it’s important to clear up my intent up front. In terms of intellectual categories, I’m not trying to remove subjective judgements. I’m trying to inform. The goal here is to present another variable that can be incorporated into a broader stylistic judgement. The actual measured effect of a posek in terms of area of direct influence, implied or otherwise, should certainly factor into any intellectual taxonomy, and certainly ought to dominate a taxonomy of the landscape.

Our maps have their limits. When we have areas of influence that are completely disjunct, it’s trivial to draw the appropriate conclusions with the eyeball test. However, what to do with somewhat overlapping sets of a couple hundred points each? How do we meaningfully assess the relative similarities of multiple sets of a few hundred points of different sizes, all weaving in and out of each other?That’s where the math comes in.

Our basic metric is cosine similarity. For readers who don’t remember much about sines and cosines, here’s a little refresher. The cosine of 0 is 1, and the cosine of 90 degrees is 0. The more acute an angle, the closer it gets to 1.

Now, let’s imagine two poskim. Posek A writes responsa only to Minsk, and Posek B writes responsa only to Pinsk. Imagine that we plot this on a two-dimensional grid, with the X-axis representing responsa to Minsk, and the Y-axis representing responsa to Pinsk. Each posek can then be expressed as a point in the grid: Posek A as (M, 0) and Posek B as (0, P), because Posek A writes 0 responsa to Pinsk, and Posek B writes 0 to Minsk. That is, Posek A is expressed as a point on the Minsk axis, and Posek B as a point on the Pinsk axis. We can then think of our poskim as line segments, or “vectors”, from the origin to the grid coordinate. It’s obvious that the two vectors in our case are orthogonal. They form a right angle, and thus have a cosine of 0. This means that they are perfectly dissimilar; they have no places in common.

Now imagine Posek C who also writes only to Minsk. Her vector will form an angle of 0 degrees with Posek A’s vector, so they will have a similarity score of 1, which is the cosine of 0.

This exercise is meant to show how the cosine of two vectors provides a good metric for scoring similarity. It’s not perfect, but it’s good.

Two dimensional space is pretty easy to envision, but dealing with 500 place names requires a 500-dimensional vector space, which is impossible to envision. Fortunately, thanks to math, we don’t need to envision it. And since there cannot be any negative numbers (because it’s impossible to send a subzero number of responsa to a place), the angle between the vectors will always be between 0 and 90. We can compare any two poskim to obtain a similarity score between 0 and 1.

Let’s walk through the basic process again with vectors in 4 dimensions. We start with the data from two poskim.

So this table will become two vectors, for Posek D [7 4 7 1] and for Posek E [3 0 9 2]. The order of the cities doesn’t matter, provided that they are respective — that the nth place in each refers to the same place. We then take the angle between the two vectors. In this case, the angle is about 34.2 degrees, and the cosine of the angle is 0.827.[1] Since the cosine goes from 0 to 1, the similarity between the poskim is high. This passes the eyeball test, too; there is no city to which E writes that D does not write to, and only one that D writes to but not E. This is a lot more similar than we’d expect in reality. As we have seen, the career of a posek is dynamic; they move, and their sphere of authority grows and shrinks and shifts over time, and communities likewise change. When we divide a posek’s career in half chronologically[2] and compare the first half to the second, the cosine tends to be in about the 0.35-0.5 range (typically around 0.4). Therefore, when two poskim score 0.3 or above, it means they are very similar in terms of geographical reach. A score above 0.4 means that the geographic reach of the two poskim are as similar as two halves of the career of a single posek, or about as similar as can reasonably be expected.

We can formulate another version of this, where we turn the vectors into binary vectors — D now gets [1 1 1 1] and E [1 0 1 1], the effect here being to just ask about where, without regard to distribution. We call this “unweighted”. We use a third type here — “mixed” — a simple average of the two. Crucially, the size of the vector — the distance from the origin — doesn’t impact the angle, so we can measure between people with very different corpus sizes.

What can we get from this? For starters, can we justify traditional divisions? Let’s take a look.

We can see a pretty clear division here — with Hungarians and Galicianers following the expected division, and a showing for our Poles (Avnei Nezer and Divrei Malkiel) of “close but no cigar”. (Divrei Malkiel was the Rav of Łomża, close to Lithuania; some may protest his being lumped with the Poles.) Let’s drop them for now.

So we see a really clear division between the Hungarian and the Galicianers. The Galicianers are all quite similar to each other, and all mostly dissimilar to the Hungarians. But there’s another thing going here too. Let’s take a look at just the Hungarians. And we’ll rearrange it here.

We can tease out a couple more things here. We can see the Sofer family — Oberland cluster clearly; the Maharam Schick (who moved from Oberland to Unterland mid-career);belongs there too, but a little less; the two most Unterland Unterlanders pair nicely, and the Levushei Mordechai (who also moved) straddles the two.

Further data would be nice (and we’re working on it), but what we have thus far suggests that we can view Hungarians as a distinct species, as it were, with Unterland and Oberland subspecies (and those who straddle both).

I’ll add another point — preliminary results thus far delineate that Hasidic psak is a subgroup within regions. Hasidic poskim do not command broad loyalty beyond their region. This may also well be the case beyond just the geographical spread data, in terms of methodology and style as well; for one example that comes to mind immediately, we hypothesize that Chabad’s tendency to rule strictly about eruvin is linked with their being a Hasidic subset of Lithuanians, not a Lithuanian subset of Hasidim.

[*] Thank you to R. Dan Margulies, who hashed out the math here with me (Moshe).

[1] In reality, we don’t ever take the actual angle, we calculate it as dot(u,v)/(norm(u)*norm(v)).

[2] Meaningful divisions here are less similar than random divisions. People die, people move, and various other events mean that we see a grouping that is “lumpy” — dividing a corpus chronologically does not approximate an even division. In a random division, any discrepancy between the similarity and 1 would be pure noise. Here, there are signals going on too.

Tag: Cosine Similarity

Measuring the Geographic Similarity of Poskim*

Share this: