Quick thoughts on Trailhead

If you’re attending ISIT then you probably got an email about Trailhead, a graphical system which links papers at ISIT “based on how many authors they have in common in the references, and each paper is linked to the 4 closest neighbors.” It’s written by Jonas Arnfred, a student at EPFL. The search feature doesn’t seem to be working, but it’s a fun little app.

I wonder how different the graph would look using something like the Toronto Paper Matching System, which is used by NIPS and ICML to match papers to reviewers. One could even imagine a profiler which would help you pick out papers which would be interesting to you — you could upload 10 papers of your own or that you find interesting, and it could re-visualize the conference through that viewpoint.

I was interested in the 19 papers which had no connections. Here are a few, randomly sampled:

  • Fish et al., Delay-Doppler Channel Estimation with Almost Linear Complexity
  • Song and İs ̧can, Network Coding for the Broadcast Rayleigh Fading Channel with Feedback
  • Bilal et al., Extensions of \mathbb{Z}_2\mathbb{Z}_4-Additive Self-Dual Codes Preserving Their Properties
  • Bernal-Buitrago and Simón, Partial Permutation Decoding for Abelian Codes
  • Kovalev and Pryadko, Improved quantum hypergraph-product LDPC codes
  • Price and Woodruff, Applications of the Shannon-Hartley Theorem to Data Streams and Sparse Recovery
  • Willems, Constrained Probability
  • Nowak and Kays, On Matching Short LDPC Codes with Spectrally-Efficient Modulation

They seem to run the gamut, topic wise, but I think one would be hard-pressed to find many unlinked multi-user information theory papers.

On the other side, there’s a little cluster of quantum information theory papers which all have similar citations, unsurprisingly. They show up as a little clique-ish thing on the bottom right in my rendering (it may be random).

Who are my neighbors in the graph?

  • Strong Secrecy in Compound Broadcast Channels with Confidential Messages — Rafael F. Wyrembelski, Holger Boche
  • Lossy Source-Channel Communication over a Phase-Incoherent Interference Relay Channel — H. Ebrahimzadeh Saffar, M. Badiei Khuzani and P. Mitran
  • Shannon Entropy Convergence Results in the Countable Infinite Case — Jorge Silva, Patricio Parada
  • Non-adaptive Group Testing: Explicit bounds and novel algorithms — Chun Lam Chan, Sidharth Jaggi, Venkatesh Saligrama, Samar Agnihotri
  • Non-coherent Network Coding: An Arbitrarily Varying Channel Approach — Mahdi Jafari Siavoshani, Shenghao Yang, Raymond Yeung

Thresholds in random integer programs

Karthik Chandrasekaran gave a talk at TTI today on the feasibility of integer programs. Given a polytope defined by the inequalities A x \le b in dimension n, can we say whether the polytope contains an integer point? In general, the problem is NP-hard, but efficient algorithms are known for special sub-cases. The goal in this talk was to understand if random instances of the problem are also hard.

The first thing to figure out is what do we mean by a random instance? Consider a point x_0 and the sphere S of radius R around x_0. Now draw m vectors x_1, x_2, \ldots, x_m uniformly from the surface of the unit sphere, and consider the polytope defined by faces which are tangent to S at x_0 + x_i for i = 1, 2, \ldots, m. That is, the vector x_0 + x_i is the normal vector of that face. This defines a random polytope whose distribution depends on the parameters (n,m,x_0,R). The goal is to find how R scales with n and m such that with high probability, the polytope contains an integer point for all x_0.

If the radius R is too small, then this will be hard to do because guaranteeing that an integer point is in the interior for all x_0 becomes hard. If R = \sqrt{n}/2, then we will always have an integer point. What should the real scaling for R look like?

The simple form of the main result is that if n \le m \le 2^{\sqrt{n}} and R \ge c_1 \sqrt{log(m/n)}, with high probability the polytope will have an integer point for every x_0. Conversely, if R \le c_0 \sqrt{ log(m/n) }, then with high probability the polytope “centered” at x_0 = (1/2,1/2,\ldots,1/2) with not contain an integer point. Note that if the number of faces is linear in n, then a constant radius is sufficient. I’m trying to square that with the “infinite dimensional spheres have vanishing volume and expanding surface area” but I think the fact that the polytope is “pointy” means that L_1 geometry gives better intuition.

To prove these bounds on R, they make a connection to the discrepancy of random Gaussian matrices (which approximate the random unit vector row matrices). The paper is on arxiv for those who want to take a look.

EDM Forum 2012

On Saturday I attended the Electronic Data Methods (EDM) Forum Symposium in Orlando. The focus of the workshop was how to build infrastructure for sharing clinical data for improving patient care. This comes in two flavors : quality improvement (QI), which refers to learning from clinical data much like a feedback loop, patient-centered outcomes research (PCOR) or comparative effectiveness research (CER), which is looks at how patient outcomes vary across different treatments. There’s a lot of hope that moving to electronic health records (EHRs) can facilitate these kind of studies, but the upshot of the workshop was that there are a lot of practical impediments.

One big issue that came up was essentially how EHRs are used, and how the data in them is hard to get out in a consistent and quantifiable way. Physicians record results in idiosyncratic ways, and in order to get practicing physicians to buy-in, the data format of EHRs is rather flexible, resulting in huge headaches for people trying to extract a data table out of a databased of EHRs. Much of the data is in running text — NLP approaches are improving, but it’s far from automated.

Once the data is extracted, it turns out it’s quite noisy, and poorly validated. Sometimes it’s a case of garbage-in : the data was not recorded properly in the first place. Other times, it’s due to miscalibration. There were a number of talks (which I missed) dedicated to this. Then there are questions of whether the data you have collected is representative. If you are trying to draw inferences across multiple sites, how do we appropriately account for confounding factors such as demographic differences? This is the kind of thing that can plague even a single-site observational study, butit becomes particularly acute for multi-site investigations.

Finally, even if each site can extract a more-or-less clean data set, you have the problem of sharing this data. This raises headaches from a policy perspective as well as technological perspective. On the policy side, each site has its own IRB and own review, and many instituions are hesitant to cede authority to third party or federated IRBs. For a small number of sites, a policy and technology framework can be worked out, but scaling these systems up and providing oversight is going to raise new challenges that we probably cannot anticipate. Even if two sites want to share data, they have to implement privacy protections, and depending on the kind of data being shared, technologies may not even exist to mask patient identities — biological samples are inherently problematic in this regard, but even sharing a data table is non-trivial. Apart from the privacy concerns, creating a common schema for the data to be shared sounds like an obvious thing to do, but if the two sites are using different EHR software… well, let’s say it’s not as easy as sharing PowerPoint from Mac to PC.

All in all, I came away feeling like the state of the art is both depressing and invigorating — there’s a lot to do, and I just hope that the short time frame that people go on about doesn’t result in half-baked partial solutions becoming the standard. There are a lot of questions from basic statistics through distributed system design here, so maybe after chewing on it a while I’ll get some new problem ideas.

Linkage

Another cool optical illusion.

I recently visited Taos, NM, and the sky there was clear and you could see so many stars. I was listening today to Debussy’s Arabesque #1 and it brought back memories of Jack Horkheimer‘s Star Hustler (c.f. this episode from 1991). Horkheimer passed away in 2010, but his show was a PBS staple.

A series of blog posts about quantiatively assessing if America is becoming more secular : Parts one, two, and three.

Ian Hacking’s introduction to the new edition of Thomas Kuhn’s The Structure of Scientific Revolutions (via MeFi).

More reasons to miss California. I do like Chicago, but… dumplings!

CFP : IEEE Signal Processing Magazine Special Issue on Signal Processing for Cyber-security and Privacy

CALL FOR PAPERS
IEEE Signal Processing Society

IEEE SIGNAL PROCESSING MAGAZINE
Special Issue on Signal Processing for Cyber-security and Privacy

Aims and Scope:
Information technology and electronic communications have been rapidly applied to many spheres of human activity, including commerce, medicine and social networking. This has led to the creation of massive electronic repositories for distributed information storage and processing, which enables access by a large number of authorized users. The need for timely access to electronic data makes it imperative to guarantee the security and privacy of this data. Traditionally, electronic data security has been ensured via cryptographic techniques, but these distributed data systems require security and privacy mechanisms at all levels of the system. Thus, providing precise guarantees on the security and privacy of electronic information requires leveraging a range of information processing techniques beyond traditional cryptography to ensure secure distributed storage and access mechanisms. The problems of information exchange, interaction, and access lend themselves to fundamental information processing abstractions and theoretical analysis. The tools of rate-distortion theory, distributed compression algorithms, distributed storage codes, machine learning for feature identification and suppression, and compressive sensing and sampling theory are fundamental and can be applied to precisely formulate and quantify the tradeoff between utility and privacy in a variety of domains. Thus, while rate-distortion theory and information-theoretic security can provide fundamental bounds on privacy and security leakage of distributed data systems, the information and signal processing techniques of compressive sensing, machine learning, and graphical models are the key ingredients necessary to achieve these performance limits in a variety of applications involving streaming data (smart grid, intelligent data collection), distributed data storage (cloud), and interactive data applications across a number of platforms. This special issue seeks to provide a venue for ongoing research in information and signal processing for security and privacy applications across a wide variety of domains, including communication media (e.g. ranging from wireless networks at the edge to optical backbones at the core of the Internet), to computer systems (e.g. ranging from traditional computer architectures to distributed systems, including cloud computing).

Topics of Interest include (but are not limited to):

  • Signal processing for information-theoretic security
  • Data mining and analysis for anomaly and intrusion detection
  • Forensic analysis: device identification, recovery of lost/corrupted information
  • Information processing in the encrypted domain
  • Security in distributed storage systems
  • Codes for security in distributed storage and cloud computing
  • Location privacy and obfuscation of mobile device positioning
  • Physical layer security methods: confidentiality and authentication
  • Secure identity management
  • Formalized models for adversaries and threats
  • Techniques to achieve covert or stealthy communication
  • Stochastic models for large data repositories and for streaming data in cyber-physical systems

Submission Process:
Articles submitted to this special issue must contain significant relevance to signal processing and its application to security and privacy. All submissions will be peer reviewed according to the IEEE and Signal Processing Society guidelines for both publications.Submitted articles should not have been published or under review elsewhere. Manuscripts should be submitted online using the Manuscript Central interface. Submissions to this special issue of the IEEE SIGNAL PROCESSING MAGAZINE should have significant tutorial value. Prospective authors should consult the Magazine site for guidelines and information on paper submission.

Important Dates: Expected publication date for this special issue is September 2013.

Guest Editors:
Lalitha Sankar, Lead GE, Arizona State University, USA, lalithasankar@asu.edu
Vincent H. Poor, Princeton University, USA, poor@princeton.edu
Mérouane Debbah, Supelec, Gif-sur-Yvette, France, merouane.debbah@supelec.fr
Kannan Ramchandran, University of California Berkeley, USA, kannanr@eecs.berkeley.edu
Wade Trappe, Rutgers University, USA, trappe@winlab.rutgers.edu

Linkage

Somehow, I had never heard of the Arnold cat map. Meow.

I am definitely guilty of reading and walking at the same time.

Serious Eats Chicago ate all the things at Hot Doug’s, to which I have still not gone.

The Bombay Royale is an Australian band that covers 60s era Bollywood tunes. They have a new album and a video for the title track. You can also get the mp3.

PZ Myers takes Kevin Drum to task for lazy utilitarian arguments.

The Matrix Determinant Lemma

I had never really heard of this result, sometimes called the Matrix Determinant Lemma, but it came up in the process of answering a relatively simple question. Suppose I have an M-dimensional jointly Gaussian vector \mathbf{X} with covariance matrix A. The differential entropy of \mathbf{X} is \frac{1}{2} \log ( (2 \pi e)^M \det(A) . Suppose now I consider some rank-1 perturbation B = A + u u^T. What choice of u maximizes the differential entropy?

On the face of it, this seems intuitively easy — diagonalize A and then pick u to be the eigenvector corresponding to the smallest singular value of A. But is there an simple way to see this analytically?

Matrix Determinant Lemma. Let A be an M \times M positive definite matrix and U and V be two M \times k matrices. Then

\det(A + U V^H) = \det(A) \det(I_k + V^H A^{-1} U).

To see this, note that

\left[ \begin{array}{cc} A & -U \\ V^{H} & I \end{array} \right] = \left[ \begin{array}{cc} A & 0 \\ V^{H} & I \end{array} \right] \cdot \left[ \begin{array}{cc} I & - A^{-1} U \\ 0 & I + V^H A^{-1} U \end{array} \right],

and take determinants on both sides.

So now applying this to our problem,
\det(A + u u^T) = \det(A) ( 1 + u^T A^{-1} u )
But the right side is clearly maximized by choosing u corresponding to the largest singular value of A^{-1}, which in this case is the smallest singular value of A. Ta-da!

Teatro Luna : Living Large in a Mini Kind of Way

I saw Teatro Luna’s Living Large in a Mini Kind of Way, by Diane Rodriguez. Teatro Luna usually does pieces written by their own collective, so it was a shift for them to do someone else’s play. Living Large tells the story of Lilly, a Latina who has made it into a nice neighborhood in LA and is running for head of the neighborhood watch, but who has just lost her husband and cannot face the reality of her new lonely life — she hides bills in grocery bags in the closet and lives under the illusion that Joe has left her well-cared for. In the meantime, she tries to teach English and refinement to two domestic workers, Big Maria and Little Maria. She’s sure they have their papers (they don’t), and she strikes upon a brilliant idea to get one or both of them to move in with her. As the prospect of this increased contact looms, the comfortable deceits start to unravel. The play is a refreshing tragicomedy and strikes at the heart of the class differences and divisions in the Latino community. It’s well worth seeing, even if it is a little out of the way (The Viaduct near Western and Belmont).

Readings

A Moveable Feast (Kenneth F. Kiple) : The first half is a somewhat condensed version of the Cambridge World History of Food and covers different plants and foodstuffs from around the world. The rest of the book is about how eating habits changed over time as food exchange has diversified and now homogenized our eating habits. The only problem with the book is that it has a fair bit of apocrypha and debunked origin stories, so YMMV. I enjoyed it.

Are You My Mother? (Alison Bechdel) : Bechdel’s memoir about her relationship with her mother. It is stuffed to the brim with references to D.W. Winnicott, which can be a plus or minus depending on whether you like psychoanalysis. I thought it was engaging and worth reading, but to be honest I am not sure to whom I would recommend it. I feel like if you read the synopsis and think it sounds interesting, you will like it, and if not, you won’t.

The Learners (Chip Kidd) : This is a follow-up to The Cheese Monkeys, which I rather enjoyed. The Learners is a little leaner but still has those nerdy and fun (to me, tedious to others) asides on the art of graphic design and typography. The Milgram experiment features prominently, so if you are fascinated by that you might also like this as a piece of (sort of) historical fiction.

This Is A Bust (Ed Lin) : A novel set in New York’s Chinatown in the 1970’s and featuring Vietnam vet and alcoholic token Chinese cop Robert Chow as he struggles to turn his life around and find himself. It’s the first in a series and I will probably read the rest. Recommended for those who like detective novels.

Fisher, Neyman, and the Creation of Classical Statistics (Erich L. Lehmann) : The title says it all. It’s more about the personalities and their history than it is particularly about the math, but there’s a nice discussion at the end of Fisher’s confusing notion of fiducial probability and Neyman’s confidence intervals. I think it’s hard to put yourself back in that time period when people really didn’t have a good sense of how to think about these kind of statistical problems (maybe we still don’t have a good idea). Fisher’s work has become near-dogma now (unfortunately), but it’s interesting to see how these basic frequentist methods came about historically. Plus you get to learn more about the enigmatic “Student!” Recommended for those with an interest in the history of statistics.

IEEE PDF “Express”

So IEEE wants PDFs that appear on IEEExplore to have two properties:

  • all fonts are embedded
  • the compatibility level is 1.4

Seems simple, right? Except that their instructions for PDF Express are for those who use Adobe Distiller, which I don’t have. You’d think there would be a simple workaround, but no…

This post suggests using ps2pdf command line options, which works if all of your figures are in EPS, but not if you have PDF or JPG figures. Daniel Lemire suggests converting the PDF to PS and then back to PDF.

That didn’t really work for me — I alternately got errors saying they wanted Adobe version 5 or higher (corresponding to compatibility level 1.4) or that fonts were not embedded. I blame Mac OS. On the 10th attempt at uploading, I finally got it to work. Here’s what I did:

  1. Generate the PDF however you like (command line or TeXShop)
  2. Open the PDF in Preview, duplicate, and save a copy. This will embed the fonts but make the PDF version 1.3 or something. Say the file is called copy.pdf.
  3. In a terminal, run pdf2ps copy.pdf to generate copy.ps. This will create a PS file with the fonts embedded.
  4. Run pdf2ps14 -dEmbedAllFonts=true copy.ps to generate a new version of copy.pdf that is both 1.4 and has fonts.

This is dumb. I wasted about an hour on this idiocy and still don’t understand why it’s such a pain. It seems that on a Mac, dvips does not embed fonts properly by default, and pdflatex also cuts corners. Furthermore, it doesn’t seem like one can pass command line options (and make them default in TexShop) to automate this process.

I am sure there are better ways of doing this, but for the time being, this at least works.