Mental health in graduate school

I recently posted a link to an article on mental health in graduate school on Facebook (via a grad school friend of mine), and it sparked a fair bit of discussion there. The article is worth reading, and I am sure will echo with many of the readers. The discussion veered towards particularities of graduate school pressures in STEM, and the contributing factors to mental stress that are driven by funding structures and the advisor/student relationship. The starting point comes from this part of the article:

In this advisor-advisee arrangement, the student trades her labor as a researcher for the advisor’s mentorship and, ultimately, the advisor’s approval of her degree before she can graduate. For students seeking an academic position after graduate school, an advisor’s letter of recommendation can be the difference between landing a job and being left out in the cold, a harsh reality given today’s sparse academic job market. All of these factors mean that the faculty advisors hold tremendous power in the advisor-advisee relationships. They are the gatekeepers of success in the graduate endeavor.

This notion of “trading labor for mentorship” is most directly monetized in grant-funded fields like engineering, where graduate students are “working in the lab” on a project that is (hopefully) related to their thesis topic. In some cases, this works out fine, but in others, the research for the grant-relevant project does not contribute directly to their thesis. For funding agencies which want “deliverables,” this pressure to produce results on schedule creates a tension. The advisor becomes a boss.

Some of the points raised in the discussion on Facebook seemed important to bring out to a wider audience. One suggestion is to disentangle NSF support for projects and research from grad student salaries. So students could apply for NSF support and then they take their funding with them to find an advisor. In STEM this would be difficult, given the large number of international students who would not be qualified for such support, but it does give some power to students to walk away from a bad situation and more incentives for PIs to be more mentors than bosses. I am not entirely convinced it would help in terms of mental health though — students need more and better mentoring, not just the means to walk away. Also, Roy pointed out, having the student and advisor both convinced that a problem is important and solvable creates a shared commitment that helps students feel less isolated. For postdocs, though, this model would be a significant improvement over the status quo. Right now, there is almost no consensus on what a postdoc should be, and I’ve seen postdoc jobs that range from factotum to co-PI.

When one is on the other side (post-PhD), it’s tempting to say that grad school would have been easier if I had been a bit more organized or had better time-management skills. Perhaps the difficulties one has can be solved with “one weird trick.” I think that’s terribly naïve. As advisors, we definitely can do things to help students learn to work better — that’s the transition from being a student to being a researcher. But the notion that depression comes about as a result of simply not being productive enough, or feeling behind, or any other “outcomes”-based reason, misses the environmental and social factors that are equally important.

Graduate research is often very isolating. Perhaps some STEM students actually enjoy this kind of solitary work, but generalizing is dangerous. Having a grad student social organization, weekly happy hour, softball league, or other “outlet” isn’t enough. I used my startup funds to help buy a table-tennis table for my department at Rutgers, and while the students seem pretty happy about it, it’s not actually creating a community. One important question to ask is how the faculty and the department can help create and support that kind of community so that it can go on its own, organically.

In a department like mine, the majority of graduate students are international, and have a host of other stressors about being in a new (and often much more expensive) country. Using mental health resources may not be normalized in their home country or culture. Regardless of where they are from however, the big challenge is this:

…awareness of the existing resources among the graduate student population remains frustratingly low, due in part to the insular nature of traditional academic departments. A broader culture of wellness may prove even more elusive in the face of a rigidly hierarchical academic culture that often rewards drive and sacrifice without encouraging balance. In this climate, graduate student mental health advocates—students, staff, and administrators—face an uphill struggle in the years to come. The consequences of this struggle tear at the very fabric of the academic experience and suggest fundamental misalignment of priorities.

It’s only a misalignment of priorities if we don’t interrogate our priorities. This isn’t two trains crashing into each other, but it does require a “structural” recognition that graduate students are a part of the family, as it were, and treating them as such.

Linkage

My friend Cynthia her friends have a tumblr on inclusivity in STEM. See also the quarterly Model View Culture, which I think I had seen an article from but didn’t realize it was a whole journal. Thanks to Lily Irani for the link.

This list of streamable Errol Morris movies is dangerous.

Maybe when I am in Bangalore I will get to learn more about The Ugly Indian.

How Chicago’s neighborhoods got their names. It does not explain Mr. Wicker’s crazy hat though.

Alex Smola gave a talk at DIMACS recently where he talked about the alias method for generating biased random variables. I think he even snagged the figures from that website as well…

more not-so-recent hits from ArXiV

arXiv:1403.4696 [math.OC]
Design and Analysis of Distributed Averaging with Quantized Communication
Mahmoud El Chamie, Ji Liu, Tamer Başar

The goal of this paper is to analyze the “performance of a subclass of deterministic distributed averaging algorithms where the information exchange between neighboring nodes (agents) is subject to uniform quantization.” I was interested in the connections to Lavaei and Murray’s TAC paper. Here though, they consider a standard consensus setup with a doubly stochastic weight matrix W and deterministic, rather than randomized, quantization. They consider two types — rounding and truncation (essentially the floor operation). The update rule is x_i(t+1) = x_i(t) + \sum_{j} W_{ij} (Q(x_j(t)) - Q(x_i(t))) where Q is the quantization operation. They shoe that in finite time the agents either reach a consensus on the floor of the average of their initial values, or that the cycle indefinitely in a neighborhood around the average. They they show how to control the size of the neighborhood in a decentralized way. There are a lot of works on quantized consensus that have appeared in the last 5 years, and to be honest I haven’t really kept up on the recent literature, so I’m not sure how to compare this to the other works that have appeared, but perhaps some of the readers of the blog have…

arXiv:1403.4699 [math.OC]
A Proximal Stochastic Gradient Method with Progressive Variance Reduction
Lin Xiao, Tong Zhang

This paper looks at convex optimization problems of the form

\min_{x \in \mathbb{R}^d} F(x) + R(x)

where the overall objective P(x) + R(x) is strongly convex, the regularizer R(x) is lower semicontinuous and convex, and the term F(x) = \frac{1}{n} \sum_{i=1}^{n} f_i(x) separates into a sum of function f_i(x) which are Lipschitz continuous. The proximal gradient method is an iterative procedure for solving this program that does the following:

x_{t} = \mathrm{argmin}_{x \in \mathbb{R}^d} \left\{ \nabla F(x_{t})^{\top} x + \frac{1}{2 \eta_t} \| x - x_{t-1} \|^2 + R(x) \right\}

If we define the \mathrm{prox}_{R} function as

\mathrm{prox}_R(y) = \mathrm{argmin}_{x \in \mathbb{R}^d} \left\{ \frac{1}{2} \|x - y\|^2 + R(x) \right\}

then the step looks like:

x_t = \mathrm{prox}_{\eta_t R}(x_{t-1} - \eta_{t} \nabla F(x_{t-1}))

A stochastic gradient (SG) version of this is

x_t = \mathrm{prox}_{\eta_t R}(x_{t-1} - \eta_{t} \nabla f_{i_t}(x_{t-1}))

where i_t is sampled uniformly from \{1,2,\ldots, n\} at each time. The advantage of the SG variant is that it takes less time to do one iteration, but each iteration is much noisier. The goal of this paper is to adapt a previous method/approach to variance reduction to improve the performance of the Prox-SG algorithm. The approach is one or resampling points according to the Lipschitz constants. This sort of “sampling based adaptivity” was also used by my ex-colleague Samory Kpotufe and collaborators in their NIPS paper from 2012 (a longer version is under review). At least I think they’re related.

arXiv:1403.5341v1 [cs.LG]
An Information-Theoretic Analysis of Thompson Sampling
Daniel Russo, Benjamin Van Roy

In a multi-armed bandit problem we have a set of actions (arms) A and at each time the learner picks an action a and observes an outcome Y_t(a) \in \mathcal{Y} which is assigned a reward by a function R: \mathcal{Y} \to \mathbb{R}. The rewards are assumed to be i.i.d. across time for each action with distributions p(y | a) that are unknown to the learner. The goal is to maximize the reward, which is the same as finding the arm with the largest expected reward. This leads to a classical explore/exploit tradeoff where the learner has to decide whether to explore new arms which may have higher expected reward, or continue exploiting the reward offered by the current arm. Thompson sampling is a Bayesian approach where the learner starts with a prior on the best action and then samples actions at each time according to its posterior belief on the best arm. The authors here analyze the regret of such a policy in terms of what they call the information gain of the system. This gain depends on the ratio between two quantities that are functions of the outcome distributions p(y | a). One is what they call the “divergence in mean,” namely the difference in expected reward between arms, and the other is the KL divergence.

Differential privacy and the AUC

One of the things I’m always asked when giving a talk on differential privacy is “how should we interpret \epsilon?” There a lot of ways of answering this but one way that seems to make more sense to people who actually think about risk, hypothesis testing, and prediction error is through the “area under the curve” metric, or AUC. This post came out of a discussion from a talk I gave recently at Boston University, and I’d like to thank Clem Karl for the more detailed questioning.

Continue reading

technical writing and the “language barrier”

One thing that strikes me about US graduate programs in electrical engineering is that the student population is overwhelmingly international. For most of these students, English is a second or third language, and so we need to adopt more “ESL”-friendly pedagogical approaches to teaching writing. I came across a blog post from ATTW by Meg Morgan from UNC Charlotte that raises a number of interesting issues. For one, the term “ESL” is perhaps problematic. The linguistic and social differences in pedagogy between other countries and the US mean that we need to use different methods for engaging the students.

In terms of teaching technical writing at the graduate level, the issues may be similar but the students are generally older — they may have even had some writing experience from undergraduate or masters-level research. How should the “ESL” issue affect how we teach technical writing?

How many people have “met Shannon?”

I saw a paper on ArXiV yesterday called Kalman meets Shannon, which got me thinking: in how many papers has someone met Shannon, anyway? Krish blogged about this a few years ago, but since then Shannon has managed to meet some more people. I plugged “meets Shannon” into Google Scholar, and out popped:

Sometimes people are meeting Shannon, and sometimes he is meeting them, but each meeting produces at least one paper.

tracks (a Maundy Mélange)

A bit of the new, a bit of the old, for this Maundy Thursday.

  1. You Can Never Hold Back Spring (Tom Waits)
  2. Le Gars qui vont à la fête (Stutzmann/Södergren, by Poulenc)
  3. Judas mercator pessimus (King’s Singers, by Gesualdo)
  4. Calling (Snorri Helgason)
  5. Hold Your Head (Hey Marseilles)
  6. Soutoukou (Mamadou Diabate)
  7. A Little Lost (Nat Baldwin)
  8. Gun Has No Trigger (Dirty Projectors)
  9. Stranger to My Happiness (Sharon Jones & The Dap-Kings)
  10. Dama Dam Mast Qalandar (Red Baraat)
  11. Libra Stripes (Polyrhythmics)
  12. Jaan Pehechan Ho (The Bombay Royale)
  13. Jolie Coquine (Caravan Palace)
  14. The Natural World (CYMBALS)
  15. Je Ne Vois Que Vous (Benjamin Schoos feat. Laetitia Sadier)
  16. Romance (Wild Flag)

Teaching technical (re-)writing

I think it would be great to have a more formal way of teaching technical writing for graduate students in engineering. It’s certainly not being taught at (most) undergraduate institutions, and the mistakes are so common across the examples that I’ve seen that there must be a way to formalize the process for students. Since we tend to publish smaller things a lot earlier in our graduate career, having a “checklist” approach to writing/editing could be very helpful to first-time authors. There are several coupled problems here:

  • students often don’t have a clear line of thought before they write,
  • they don’t think of who their audience is,
  • they don’t know how to rewrite, or indeed how important it is.

Adding to all of this is that they don’t know how to read a paper. In particular, they don’t know what to be reading for in terms of content or form. This makes the experience of reading “related work” sections incredibly frustrating.

What I was thinking was a class where students learn to write a literature review (a small one) on a topic of their choosing. The first part will be how to read papers and make connections between them. What is the point of a literature review, anyway? The first objective is to develop a more systematic way of reading and processing papers. I think everyone I know professionally, myself included, learned how to do this in an ad-hoc way. I believe that developing a formula would help improve my own literature surveying. The second part of the course would be teaching about rewriting (rather than writing). That is, instead of providing rules like “don’t use the passive voice so much” we could focus on “how to revise your sentences to be more active.” I would also benefit from a systematic approach to this for my own writing.

I was thinking of a kind of once-a-week writing seminar style class. Has anyone seen a class like this in engineering programs? Are there tips/tricks from other fields/departments which do have such classes that could be useful in such a class? Even though it is “for social scientists”, Harold Becker’s book is a really great resource.

Some (not-so-)recent hits from ArXiV

I always end up bookmarking a bunch of papers from ArXiV and then looking at them a bit later than I want. So here are a few notes on some papers from the last month. I have a backlog of reading to catch up on, so I’ll probably split this into a couple of posts.

arXiv:1403.3465v1 [cs.LG]: Analysis Techniques for Adaptive Online Learning
H. Brendan McMahan
This is a nice survey on online learning/optimization algorithms that adapt to the data. These are all variants of the Follow-The-Regularized-Leader algorithms. The goal is to provide a more unified analysis of online algorithms where the regularization is data dependent. The intuition (as I see it) is that you’re doing a kind of online covariance estimation and then regularizing with respect to the distribution as you are learning it. Examples include the McMahan and Streeter (2010) paper and the Duchi et al. (2011) paper. Such adaptive regularizers also appear in dual averaging methods, where they are called “prox-functions.” This is a useful survey, especially if, like me, you’ve kind of checked in and out with the online learning literature and so may be missing the forest for the trees. Or is that the FoReL for the trees?

arXiv:1403.4011 [cs.IT]: Whose Opinion to follow in Multihypothesis Social Learning? A Large Deviation Perspective
Wee Peng Tay
This is a sort of learning from expert advice problem, though not in the setting that machine learners would consider it. The more control-oriented folks would recognize it as a multiple-hypothesis test. The model is that there is a single agent (agent 0) and K experts (agents 1, 2, \ldots, K). The agent is trying to do an M-ary hypothesis test. The experts (and the agent) have access to local (private) observations Y_k[1], Y_k[2], \ldots, Y_k[n_k] for k \in \{0,1,2,\ldots,K\}. The observations come from a family of distributions determined by the true hypothesis m. The agent 0 needs to pick one of the K experts to hire — the analogy is that you are an investor picking an analyst to hire. Each expert has its own local loss function C_k which is a function of the amount of data it has as well as the true hypothesis and the decision it makes. This is supposed to model a “bias” for the expert — for example, they may not care to distinguish between two hypotheses. The rest of the paper looks at finding policies/decision rules for the agents that optimize the exponents with respect to their local loss functions, and then looking at how agent 0 should act to incorporate that advice. This paper is a little out of my wheelhouse, but it seemed interesting enough to take a look at. In particular, it might be interesting to some readers out there.

arXiv:1403.3862 [math.OC] Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties
Ji Liu, Stephen J. Wright
This is another paper on lock-free optimization (c.f. HOGWILD!). The key difference, as stated in the introduction, is that they “do not assume that the evaluation vector \hat{x} is a version of x that actually existed in the shared memory at some point in time.” What does this mean? It means that a local processor, when it reads the current state of the iterate, may be performing an update with respect to a point not on the sample path of the algorithm. They do assume that the delay between reading and updating the common state is bounded. To analyze this method they need to use a different analysis technique. The analysis is a bit involved and I’ll have to take a deeper look to understand it better, but from a birds-eye view this would make sense as long as the step size is chosen properly and the “hybrid” updates can be shown to be not too far from the original sample path. That’s the stochastic approximator in me talking though.

Banaszczyk’s theorem on convex bodies

I meant to blog about this a while back, but somehow starting a new job/teaching are very time consuming (who knew?). Luckily, it’s about an older result of Banaszczyk (pronounced bah-nahsh-chik, I think):

Wojciech Banaszczyk. Balancing vectors and gaussian measures of n-dimensional convex bodies. Random Structures & Algorithms, 12(4):351–360, 1998.

This result came to my attention from a talk given by Sasho Nikolov here at Rutgers on his paper with Kunal Talwar on approximating hereditary discrepancy (see Kunal’s post from last year). The result is pretty straightforward to state.

Banaszczyk’s Theorem. There exists a universal constant C such that the following holds. Let A = [a_1\ a_2\ \cdots \ a_n] be an m \times n real matrix such that the i-th column a_i satisfies \|a_i\|_2 \le 1 for i = 1, 2, \ldots, n, and let \mathcal{K} be a convex body in \mathbb{R}^m such that \mathbb{P}( G \in K ) \ge 1/2, where G \sim \mathcal{N}(0, I_m). Then there exists a vector x \in \{-1,1\}^n such that Ax \in C \cdot \mathcal{K}.

This is a pretty cool result! Basically, if your convex body \mathcal{K} is big enough to capture half of the probability of a standard Gaussian, then if you blow it up by C to get C \cdot \mathcal{K}, then for any arbitrary collection of sub-unit-norm vectors \{a_i\}, I can find a way to add and subtract them from each other so that the result ends up in C \cdot \mathcal{K}.

I haven’t found a use for this result, but it’s a neat fact to keep in the bucket. Maybe it would be useful in alignment/beamforming schemes? Unfortunately, as far as I can tell he doesn’t tell you how to find this mysterious x, so…