Many of our readers are probably familiar with JewishGen, the premier resource for Jewish genealogical research. For quite some time, we’ve had our eye on their Communities Database, which contains information on the history, names, coordinates, environs, and population for Jewish communities in Europe, North Africa, and the Middle East. We have often used it to help us identify places, which involves a lot of guesswork since their search engine only allows Latin characters without diacritics.
You may have noticed the JewishGen logo to the right. We put that there because we recently met with the good folks at JG, and we agreed to all help each other out by sharing data and resources with each other and with the public.1Be advised: Moshe will happily go the full NASCAR for datasets.
What does this mean? It means new and better toys. For instance, that thing about not being able to search for places by Hebrew characters? Well check out our searchable map of Hebrew place names:
As of now, this table has a bit under 4000 place name variants in Hebrew characters. Once we complete the merge of our list with JG’s list, that number will more than double. And we have also started merging these lists with Berl Kagan’s Sefer Prenumeranten. Play around with it. There’s nothing like it, and this is just an “alpha” version.
It also means that Moshe got to play around with the population data in the Communities Database. We have wondered for some time whether there is any relationship between the population of a community and the number of responsa sent there.
So is there a relationship? The short answer: It’s complicated.
Let’s compare some of our favorites (a note: we used 1900 for availability reasons, surprisingly, there’s not a strong penalty for correlation when using earlier poskim). We’ve dropped communities with over 20,000 Jews from the graph, and also because there might be other effects going on over there.2I have a very strong suspicion that this is subject to a major prewar / postwar gap.
If this reads as a horrible mess to you, then you’ve read it correctly. This is the picture of statistical noise.
[We’re going to use a lot of numbers here, so for those who aren’t into mathy stuff, here’s the baalebatish version: A perfect positive correlation between number of responsa and population would mean that the bigger the city, the more responsa, no exceptions. It would have a score of 1. If it had a perfect relationship but it wasn’t a straight line, its Pearson correlation coefficient would be a bit lower while Spearman would remain at 1. A perfect negative correlation would mean that the bigger the city, the fewer responsa (or the more responsa, the smaller the city), no exceptions. It would have a score of -1 (again, with Pearson being lower if it isn’t linear). A score of zero means that there’s no correlation at all. With this, the numbers that express the correlation should be basically intelligible and always between -1 and 1.]
The strongest individual correlation here is Mahari Aszod at a whopping R=0.175, and he’s not even near contemporaneous. Among the poskim who were active around then, we have Avnei Nezer at R=0.04, Beit Yitzchak at R=0.11, Divrei Malkiel at R=-0.04(!), and leading the pack, Levushei Mordechai at R=0.14 (Pearson). Using Spearman it teases out a little higher, but still nothing awe-inspiring.
Let’s keep going: what happens when we sum the place counts together?
As evidenced by the trendline (or the eye test), it’s pretty grim.
Even just looking at the count of books we have, it doesn’t really get better. Regardless of whether you use Pearson, Kendall3For the not mathematically inclined: yeah, you can forget about Kendall, don’t bother., or Spearman, R<0.1.4I thought of using more, but I’m scared of P-hacking it by throwing more metrics at it.
I don’t really know quite what to make of it. The main thing I suspect: as a place becomes bigger and more independent, it needs to ask fewer questions (i.e., larger towns “clear the neighborhood”), offsetting the increase in populations (or at least roughly). In that case, there would be a population “sweet spot” in which a town is big enough that it generates lots of questions but not so big that local talent can handle them adequately. And then we might see something like the curve we get if we wildly overfit a trendline:
This remains an open question for me, but I still wanted to publish this. Let me explain myself. Firstly, given the amount of noise here, it’ll take a long time for us to fully clarify the issue.
Elli asked me the following questions when I showed him the draft, and I think they’re interesting:
- Maybe we should simply disregard towns that were known to have rabbis who wrote responsa, and then look at the rest?
- There’s a “nudnik effect”: Like Levushei Mordechai to his son-in-law in Galante.
- Or maybe it’s not about cities at all, but about people. The carryover we saw in Hungary – maybe it was really carryover of individuals, not cities.
With regards to (1), well, it wouldn’t bump off enough places to make a dent, and you’d probably just drop it even further. As for (2-3), well, it’s actually all the more striking. These are both very real effects (look for Yaavetz’s over the top disses of some of his questioners(!) in She’elat Yaavetz), but strangely, even this doesn’t bear some obvious statistical linkage to population. These are all real questions, and it’s really very possible the answer could change with more data, but given the data we have at the moment, it’s clear we’d need a lot more data to truly get clarity on this issue.
So why discuss this at all? Well, one of the scourges of modern science is ‘P-Hacking’. To quote Wikipedia: “[P-hacking] is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. This is done by performing many statistical tests on the data and only reporting those that come back with significant results.”
For a simple example, if we look at statistically significant as being P < 0.05 (less than a 5% probability of occurring by random chance), well, if we look at 50 different foods in a diet study, we’ve now got over a 90% chance of finding something ‘statistically significant’ by random chance alone.5This is not a random example, those articles about diet studies showing ‘kale causes cancer’ or whatever are almost always p-hacking.
We’ve published stuff with attempts at very concrete findings — take our post on the handover of rabbinic leadership in Hungary, for example. Honesty dictates that we also on occasion say: ‘it’s hard to see a signal in the noise here’, even if you can’t get a journal to publish ‘nothing much to see here, folks’.
I wanted to title this post “Baby Keep It Real With His People”, referencing the hit song ‘Baby‘ by Lil’ Baby (feat. DaBaby). Sadly, despite my best efforts, the number of fans of both responsa and Atlanta hip-hop remains small, so it went. Suffice it to say, in both data and rap, HaMapah supports Quality Control.