# ISIT 2015 : statistics and learning

The advantage of flying to Hong Kong from the US is that the jet lag was such that I was actually more or less awake in the mornings. I didn’t take such great notes during the plenaries, but they were rather enjoyable, and I hope that the video will be uploaded to the ITSOC website soon.

There were several talks on entropy estimation in various settings that I did not take great notes on, to wit:

• OPTIMAL ENTROPY ESTIMATION ON LARGE ALPHABETS VIA BEST POLYNOMIAL APPROXIMATION (Yihong Wu, Pengkun Yang, University Of Illinois, United States)
• DOES DIRICHLET PRIOR SMOOTHING SOLVE THE SHANNON ENTROPY ESTIMATION PROBLEM? (Yanjun Han, Tsinghua University, China; Jiantao Jiao, Tsachy Weissman, Stanford University, United States)
• ADAPTIVE ESTIMATION OF SHANNON ENTROPY (Yanjun Han, Tsinghua University, China; Jiantao Jiao, Tsachy Weissman, Stanford University, United States)

I would highly recommend taking a look for those who are interested in this problem. In particular, it looks like we’re getting towards more efficient entropy estimators in difficult settings (online, large alphabet), which is pretty exciting.

QUICKEST LINEAR SEARCH OVER CORRELATED SEQUENCES
Javad Heydari, Ali Tajer, Rensselaer Polytechnic Institute, United States
This talk was about hypothesis testing where the observer can control the samples being taken by traversing a graph. We have an $n$-node graph (c.f. a graphical model) representing the joint distribution on $n$ variables. The data generated is i.i.d. across time according to either $F_0$ or $F_1$. At each time you get to observe the data from only one node of the graph. You can either observe the same node as before, explore by observing a different node, or make a decision about whether the data from from $F_0$ or $F_1$. By adopting some costs for different actions you can form a dynamic programming solution for the search strategy but it’s pretty heavy computationally. It turns out the optimal rule for switching has a two-threshold structure and can be quite a bit different than independent observations when the correlations are structured appropriately.

MISMATCHED ESTIMATION IN LARGE LINEAR SYSTEMS
Yanting Ma, Dror Baron, North Carolina State University, United States; Ahmad Beirami, Duke University, United States
The mismatch studied in this paper is a mismatch in the prior distribution for a sparse observation problem $y = Ax + \sigma_z z$, where $x \sim P$ (say a Bernoulli-Gaussian prior). The question is what happens when we do estimation assuming a different prior $Q$. The main result of the paper is an analysis of the excess MSE using a decoupling principle. Since I don’t really know anything about the replica method (except the name “replica method”), I had a little bit of a hard time following the talk as a non-expert, but thankfully there were a number of pictures and examples to help me follow along.

SEARCHING FOR MULTIPLE TARGETS WITH MEASUREMENT DEPENDENT NOISE
Yonatan Kaspi, University of California, San Diego, United States; Ofer Shayevitz, Tel-Aviv University, Israel; Tara Javidi, University of California, San Diego, United States
This was another search paper, but this time we have, say, $K$ targets $W_1, W_2, \ldots, W_K$ uniformly distributed in the unit interval, and what we can do is query at each time $n$ a set $S_n \subseteq [0,1]$ and get a response $Y_n = X_n \oplus Z_n$ where $X_n = \mathbf{1}( \exists W_k \in S_n )$ and $Z_n \sim \mathrm{Bern}( \mu(S_n) + b )$ where $\mu$ is the Lebesgue measure. So basically you can query a set and you get a noisy indicator of whether you hit any targets, where the noise depends on the size of the set you query. At some point $\tau$ you stop and guess the target locations. You are $(\epsilon,\delta)$ successful if the probability that you are within $\delta$ of each target is less than $\epsilon$. The targeting rate is the limit of $\log(1/\delta) / \mathbb{E}[\tau]$ as $\epsilon,\delta \to 0$ (I’m being fast and loose here). Clearly there are some connections to group testing and communication with feedback, etc. They show there is a significant gap between the adaptive and nonadaptive rate here, so you can find more targets if you can adapt your queries on the fly. However, since rate is defined for a fixed number of targets, we could ask how the gap varies with $K$. They show it shrinks.

ON MODEL MISSPECIFICATION AND KL SEPARATION FOR GAUSSIAN GRAPHICAL MODELS
Varun Jog, University of California, Berkeley, United States; Po-Ling Loh, University of Pennsylvania, United States
The graphical model for jointly Gaussian variables has no edge between nodes $i$ and $j$ if the corresponding entry $(\Sigma^{-1})_{ij} = 0$ in the inverse covariance matrix. They show a relationship between the KL divergence of two distributions and their corresponding graphs. The divergence is lower bounded by a constant if they differ in a single edge — this indicates that estimating the edge structure is important when estimating the distribution.

CONVERSES FOR DISTRIBUTED ESTIMATION VIA STRONG DATA PROCESSING INEQUALITIES
Aolin Xu, Maxim Raginsky, University of Illinois at Urbana–Champaign, United States
Max gave a nice talk on the problem of minimizing an expected loss $\mathbb{E}[ \ell(W, \hat{W}) ]$ of a $d$-dimensional parameter $W$ which is observed noisily by separate encoders. Think of a CEO-style problem where there is a conditional distribution $P_{X|W}$ such that the observation at each node is a $d \times n$ matrix whose columns are i.i.d. and where the $j$-th row is i.i.d. according to $P_{X|W_j}$. Each sensor gets independent observations from the same model and can compress its observations to $b$ bits and sends it over independent channels to an estimator (so no MAC here). The main result is a lower bound on the expected loss as s function of the number of bits latex $b$, the mutual information between $W$ and the final estimate $\hat{W}$. The key is to use the strong data processing inequality to handle the mutual information — the constants that make up the ratio between the mutual informations is important. I’m sure Max will blog more about the result so I’ll leave a full explanation to him (see what I did there?)

More on Shannon theory etc. later!

# Student Promotion: Signal Processing Society Provides Steep Price Slash

Or SPSPSPSPS, for short. I’ve been over-busy and lax on posting, but I’ll provide some recap of ITA soon, as well as some notes from the Bellairs workshop I just came back from. The winter is a bit jarring. To the point of the subject:

In case you hadn’t heard, the IEEE Signal Processing Society is currently running a campaign that allows IEEE Student and Graduate Student members to join the SPS for free for the 2015 membership year. The promotion is running now through 15 August 2015. Only IEEE Student and Graduate Students are eligible, as this offer does not apply to SPS Student or Graduate Student members renewing their membership for 2015.

This link directs to the IEEE website with both IEEE Student membership and the free SPS Student membership in the cart.

If a student is already an IEEE Student of Graduate Student member, he/she can use the code SP15STUAD at checkout to obtain his/her free membership.

If you have any questions regarding the SPS Free Student Membership campaign or other membership items, please don’t hesitate to contact Jessica Perry at jessica.perry@ieee.org.

# WIFS 2014

This week I took a quick jaunt down to Atlanta to attend part of WIFS 2014 (co-located with GlobalSIP 2014). Kamalika and I were invited to give a talk on differential privacy and machine learning, based on our IEEE Signal Processing Magazine article. I’ve uploaded the slides of the tutorial to my website and we’re planning on making a video (audio over slides) version for SigView as well as on YouTube.

Much like last year, GlobalSIP had a somewhat disjointed, semi-chaotic feel (exacerbated by tiredness, I am sure) — it’s really a collection of semi-interacting workshops in the same space, and I knew people in several of the other workshops. Since I was there for a day and giving a tutorial at WIFS, I decided to stick with WIFS for the day. To give a sense of how confusing it all was, here’s a picture of the guide to deciphering the program book:

Overly-complicated rules for encoding sessions

The keynote for GlobalSIP was given by Vince Poor on information-theoretic privacy via rate distortion (this is the work with Lalitha). Vince did a good job of not over-IT-ing it I think, which was good because the audience was pretty diverse and it’s not clear that many of the people there had even taken a course on information theory. This seems to be the big challenge in multi-disciplinary conferences like GlobalSIP (or large signal processing conferences in general) — everyone is in signal processing, but it’s a big tent and it’s hard to reach everyone.

Min Wu was the keynote speaker for the WIFS workshop on the day I attended. Her talk, on “Exploring Power Network Signatures for Information Forensics” was about how to glean information from power fluctuations in networks, or electronic network frequency (ENF). Different processes or operations have different power demands — by matching these signatures to an observed signal (e.g. a video), one can make inferences about the time/location/integrity of the data. For example, were the audio and visual tracks in a video taken at the same time or merged later? This whole area is quite interesting, and while I was sort of aware of this work I hadn’t really read up on much of it.

Perhaps it was the end of the semester kicking in, but I sort of took terrible notes on most of the talks and poster sessions at the conference, so I can’t really write coherently about the papers I saw. Unfortunately I had to run back to teach the penultimate lecture in my class. I guess now that I have a “real job” this is going to be the way it works from now on. Kind of sad, really.

# Transactions on Signal and Information Processing Over Networks: now accepting papers

I’m an Associate Editor for the new IEEE Transactions on Signal and Information Processing Over Networks, and we are accepting submissions now. The Editor-In-Chief is Petar M. Djurić.

The new IEEE Transactions on Signal and Information Processing over Networks publishes high-quality papers that extend the classical notions of processing of signals defined over vector spaces (e.g. time and space) to processing of signals and information (data) defined over networks, potentially dynamically varying. In signal processing over networks, the topology of the network may define structural relationships in the data, or may constrain processing of the data.

To submit a paper, go to Manuscript Central.

Topics of interest include, but are not limited to the following:

• Distributed detection and estimation
• Distributed learning over networks
• Distributed target tracking
• Bayesian learning; Bayesian signal processing
• Sequential learning over networks
• Decision making over networks
• Distributed dictionary learning
• Distributed game theoretic strategies
• Distributed information processing
• Graphical and kernel methods
• Consensus over network systems
• Optimization over network systems

Communications, Networking, and Sensing

• Distributed monitoring and sensing
• Signal processing for distributed communications and networking
• Signal processing for cooperative networking
• Signal processing for network security
• Optimal network signal processing and resource allocation

Modeling and Analysis

• Performance and bounds of methods
• Robustness and vulnerability
• Network modeling and identification
• Simulations of networked information processing systems
• Social learning
• Bio-inspired network signal processing
• Epidemics and diffusion in populations

Imaging and Media Applications

• Image and video processing over networks
• Media cloud computing and communication
• Multimedia streaming and transport
• Social media computing and networking
• Signal processing for cyber-physicalsystems
• Wireless/mobile multimedia

Data Analysis

• Processing, analysis, and visualization of big data
• Signal and information processing for crowd computing
• Signal and information processing for the Internet of Things
• Emergence of behavior

Emerging topics and applications

• Emerging topics
• Applications in life sciences, ecology, energy, social networks, economic networks, finance, social sciences, smart grids, wireless health, robotics, transportation, and other areas of science and engineering

# Detection and Estimation: book recommendations?

It’s confirmed that I will be teaching Detection and Estimation next semester so I figured I would use the blog to conjure up some book recommendations (or even debate, if I can be so hopeful). Some of the contenders:

• Steven M. Kay, Fundamentals of Statistical Signal Processing – Estimation Theory (Vol. 1), Prentice Hall, 1993.
• H. Vincent Poor, An Introduction to Signal Detection and Estimation, 2nd Edition, Springer, 1998.
• Harry L. Van Trees, Detection, Estimation, and Modulation Theory (in 4 parts), Wiley, 2001 (a reprint).
• M.D. Srinath, P.K. Rajasekaran, P. K. and R. Viswanathan, Introduction to Statistical Signal Processing with Applications, Prentice Hall, 1996.

Detection and estimation is a fundamental class for the ECE graduate curriculum, but these “standard” textbooks are around 20 years old, and I can’t help but think there might be more “modern” take on the subject (no I’m not volunteering). Venu Veeravalli‘s class doesn’t use a book, but just has notes. However, I think the students at Rutgers (majority MS students) would benefit from a textbook, at least as a grounding.

Srinath et al. is what my colleague Narayan Mandyam uses. Kay is what I was leaning to before (because it seems to be the most widely used), but Poor’s book is the one I read. I think I am putting up the Van Trees as a joke, mostly. I mean, it’s a great book but I think a bit much for a textbook. So what do the rest of you use? Also, if you are teaching this course next semester, perhaps we can share some ideas. I think the curriculum might be ripe for some shaking up. If not in core material, at least in the kinds of examples we use. For example, I’m certainly going to cover differential privacy as a connection to hypothesis testing.

# Towards multi-sensor characterizations of pianos

As an undergraduate I became interested in how timbre can be used to identify musical instruments. This was largely due to my first UROP (undergraduate research gig) with Keith Martin at the MIT Media Lab. Keith’s thesis was on identifying musical instruments from spectral features, and I worked a bit on this under Ryan Rifkin in a later UROP. I’ve been catching up on podcasts during my commute to campus this week, and a semi-recent Science Friday piece on the Steinway factory was on deck for this morning.

The piece talks about work in Agnieszka Roginska‘s lab at NYU, and in particular work from a paper from last year on measuring radiation patterns in piano soundboards. The radiation patterns are pretty but a bit hard to interpret, largely because I’m way out of the acoustical signal processing world. However, what’s interesting to me is that we’re still largely focused on overtones/cepstral coefficients. I wonder about how one might discover more interesting features to characterize this data. (I know someone will suggest deep learning but color me a little skeptical).

As a side note, one of the recent popular articles from JASA is on the acoustics of coffee roasting.

# SPCOM 2014: some more talks (and a plenary)

I did catch Greg Wornell’s plenary at SPCOM, which was called When Bits Absolutely, Positively, Have to be There as Soon as Possible, a riff on this FedEx commercial, which is older than I am. The talk was on link-aware PHY-layer design– basically looking at how ARQ enables incremental redundancy, and how to do a sort of layered superposition + incremental redundancy scheme in the sequential setting as well as a “multi-path” setting where blocks can arrive out of order. This was really digging into the signal issues in a way that a lot of non-communication engineering information theorists may get squeamish about. The nice thing is that I think the engineering problem is approachable without knowing a lot of heavy-duty math, but still requires some careful analysis.

Communication and Compression Via Sparse Linear Regression
Ramji Venkataramanan
This was on building codewords and codebooks out of a lower-complexity code dictionary $A \in \mathbb{R}^{n \times ML}$ where each codeword is a superposition of $L$ columns, one each from groups of size $M$. Thus encoding is $A \beta$ where $\beta$ is a sparse vector. I saw a talk by Barron and Joseph from a previous ISIT about this, but the framework extends to rate distortion (achieving the rate distortion function), and channel coding. The main point is to lower the complexity of the code at the expense of the gap to optimal rate — encoding and decoding are polynomial time but the rate gap for rate-distortion goes to zero as $1/\log n$. Ramji gave a really nice and clear talk on this — I hope he puts the slides up!

An Optimal Varentropy Bound for Log-Concave Distributions
Mokshay’s talk was also really clear and excellent. For a distribution $f(X)$ on $\mathbb{R}^n$, we can define $\tilde{h}(X) = - \log f(X)$. The entropy is the expectation of this random variable, and the varentropy is the variance. Their main result is a upper bound on the varentropu of log concave distributions $f(X)$. To wit, $\mathrm{Var}(\tilde{h}) \le n$. This bound doesn’t depend on the distribution and is sharp if $f$ is a product of exponentials. They then use this to prove a universal bound on the deviation of $\tilde{h}$ from its expectation, which gives a AEP that doesn’t really assume anything about the joint distribution of the variables except for log-concavity. There was more in the talk, but I eagerly await the paper.
The insertion/deletion channel model is notoriously hard. Ankur proposed a new model where $0$‘s are “indestructible” — they cannot be inserted or deleted. This asymmetric model leads to new asymptotic bounds on the capacity. I don’t really work on this channel model so I can’t get the finer points of the results, but once nice takeaway was that asymptotically, each indestructible $0$ in the codeword lets us correct around $1/2$ a deletion more.