Perhaps malapropos for the NBA Finals, Prof. Michael Jordan gave the first plenary talk at ISIT. It was a great overview of nonparametric Bayesian modeling. In particular, he covered his favorite Chinese restaurant process (also known as the Pitman-Yor stick-breaking process), hierarchical Dirichlet priors, and all the other jargon-laden elements of modeling. At the end he covered some of the rather stunning successes of this approach in applications with lots of data to learn from. What was missing for me was a sense of how these approaches worked in the data-poor regime, so I asked a question (foolishly) about sample complexity. Alas, since that is a “frequentist” question and Jordan is a “Bayesian,” I didn’t quite get the answer to the question I was trying to ask, but that’s what happens when you don’t phrase things properly.
One nice thing that I learned was the connection to Kingman’s earlier work on characterizing random measures via non-homogeneous Poisson processes. Kingman has been popping up all over the place in my reading, from urn processes to exchangeable partition processes (also known as paintbox processes). When I get back to SD, it will be back to the classics for me!
Cool. Kingman’s never come up for me, I’ll have to check him out.
What applications were you impressed by?
I’ve always felt the complex Bayesian stuff was much stronger in the data-poor regimes than in the data rich ones. I think that’s more of an empirical statement than a theoretical one. I feel like I rarely if ever see sample complexity bounds in the Bayesian framework, possibly because the algorithms are too complex? Note that you’re almost always training via Monte Carlo, so not only do you not know how good the best fit is going to be as a function of how much data you have [because the model’s complex], you usually can’t even tell how close what you’ve got is to the best fit for the data you have [because your optimization problem is #P-complete or something].
In practice, these hierarchial Bayesian models feel quite slow to train, which I view as at least something of a problem, although they have gotten much faster.
He had some work on speaker diarization and protein folding that were pretty great (at least from the plots he showed).
See, the intuition I have is that hierarchical Bayesian methods have a gazillion parameters and so require lots of data, but then are better at finding structure when they do have lots of data. But I haven’t seen (probably because I’m a n00b) a quantification of that idea, or a quantification of something that says the opposite.
Too much to read, too little time.
Wow! Sounds like a great talk. I’ll ask you for more details when you get back to SD.
Pingback: Feller’s anti-Bayesianism « An Ergodic Walk