We are ending the AMA at this point with over 50 questions answered!

Thanks for the great questions! - Akshay

Thanks all, many good questions. -John

Hi Reddit, we are Microsoft researchers Dr. John Langford and Dr. Akshay Krishnamurthy. Looking forward to answering your questions about Reinforcement Learning!

Proof: Tweet

Ask us anything about:

*Latent state discovery

*Strategic exploration

*Real world reinforcement learning

*Batch RL

*Autonomous Systems/Robotics

*Gaming RL

*Responsible RL

*The role of theory in practice

*The future of machine learning research

John Langford is a computer scientist working in machine learning and learning theory at Microsoft Research New York, of which he was one of the founding members. He is well known for work on the Isomap embedding algorithm, CAPTCHA challenges, Cover Trees for nearest neighbor search, Contextual Bandits (which he coined) for reinforcement learning applications, and learning reductions.

John is the author of the blog hunch.net and the principal developer of Vowpal Wabbit. He studied Physics and Computer Science at the California Institute of Technology, earning a double bachelor’s degree in 1997, and received his Ph.D. from Carnegie Mellon University in 2002.

Akshay Krishnamurthy is a principal researcher at Microsoft Research New York with recent work revolving around decision making problems with limited feedback, including contextual bandits and reinforcement learning. He is most excited about interactive learning, or learning settings that involve feedback-driven data collection.

Previously, Akshay spent two years as an assistant professor in the College of Information and Computer Sciences at the University of Massachusetts, Amherst and a year as a postdoctoral researcher at Microsoft Research, NYC. Before that, he completed a PhD in the Computer Science Department at Carnegie Mellon University, advised by Aarti Singh, and received his undergraduate degree in EECS at UC Berkeley.

Comments: 296 • Responses: 66  • Date: 

MicrosoftResearch157 karma

What advice do you have for aspiring Undergraduates and others who want to pursue research in Reinforcement Learning?

The standard advice is to aim for a phd. Let me add some details to that. The most important element of a phd is your advisor(s) with the school a relatively distant second. I personally had two advisors, which I enjoyed---two different perspectives to learn from and two different ways to fund conference travel :-) Nevertheless, one advisor can be fine. Aside from finding a good advisor to work with, it's very good to maximize internship possibilities by visiting various others over the summers. Reinforcement Learning is a great topic, because it teaches you the value of exploration. Aside from these things to do, the most important thing to learn in my experience is how to constructively criticize existing research work. Papers are typically not very good at listing their flaws and you can't fix things you can't see. For research, you need to cultivate an eye for the limitations, most importantly the limitations of your own work. This is somewhat contradictory, because to be a great researcher, you need to both thoroughly understand the limitations of your work and be enthusiastic about it. - John

nwmotogeek152 karma

ALOT of the papers I have read are so difficult to follow and understand. What is your strategy for reading and understanding papers?

MicrosoftResearch136 karma

This becomes easier with experience, but it is important to have a solid foundation. - Akshay

xxgetrektxx299 karma

RL results from papers are known to be notoriously hard to reproduce. Why do you think that is, and how can we move towards results that are more feasible to reproduce?

MicrosoftResearch146 karma

There seem to be two issues here - An engineering solution is to export code environments with all the hyperparameters (say in a Docker image), so that someone else can grab the Docker and run the code to exactly reproduce the plots in the paper. But this is a bandaid that is covering up a more serious issue - The more serious issue is that Deep RL algorithms are notoriously unstable and non-robust (A precursor problem is that DL itself is not very robust). Naturally this has an effect on reproducibility, but it also suggests that these methods have limited real-world potential. The way to address both of these issues is to develop more robust algorithms. -Akshay

NeedzRehab96 karma

What do you believe about Stephen Hawking suggesting machine learning and AI would be the greatest threat that humanity faces?

MicrosoftResearch138 karma

The meaning of "human" is perhaps part of the debate here? There is much more that I-as-a-human can accomplish with a computer an an internet connection than I-as-a-human could do without. If our future looks more like man/machine hybrids that we choose to embrace, I don't fear it that future. On the other hand, we have not yet really seen AI-augmented warfare, which could be transformative in the same sense as nuclear or biological weapons. Real concerns here seem valid but it's a tricky topic in a multipolar world. One scenario that I worry about less is the 'skynet' situation where AI attacks humanity. As far as we can tell research-wise, AI never beats crypto. -John

MicrosoftResearch44 karma

I might be an optimist but I like to think ML/AI and technology more broadly can create great value for humanity (technology arguably already has). Of course there are concerns/challenges/dangers here, but it seems to me like climate change is a much greater threat that is looming much more ominously on the horizon. - Akshay

MicrosoftResearch43 karma

What are some notable lesser known applications of reinforcement learning?

Well, "the internet" is a little snarky, but there is some truth to it.  Much of the internet runs off targeted advertising (as opposed to blanket advertising).  It annoys me, so I use ad blockers all the time and prefer subscription based models.  Nevertheless, targeted advertising is obviously a big deal as a business model that powers much of the internet.   You should assume that any organization doing targeted advertising is doing a form of reinforcement learning.   Another category is 'nudging' applications.  How do you best encourage people to develop healthy habits around exercise for example?  There are quite a few studies suggesting that a reinforcement approach is helpful, although I'm unclear on the state of deployment. -John

-Ulkurz-30 karma

Thank you for doing this AMA! My question is around applying RL for real-world problems. As we already know, oftentimes it's difficult to build a simulator or a digital twin for most real-world processes or environments, which kind of nullifies the idea of using online RL.

But this is where offline/batch RL can be helpful in terms of using large datasets collected via some process, from which a policy can be learned offline. We've already seen a lot of success in a supervised learning setting where an optimal model is learned offline from large volumes of data.

Although there has been a lot of fundamental research around offline/batch RL, I have not seen much real-world applications. Could you please share some of your own experiences around this, if possible, with some use cases related to the application of batch/offline RL in the real-world? Thanks!

MicrosoftResearch10 karma

One of the previous answers seems very relevant here---I view real world reinforcement learning as something that exists as of 10 years ago and is routinely available today (see http://aka.ms/personalizer ).

With regards to the strategy of learning in a simulator and then deploying in the real world, the bonsai project https://www.microsoft.com/en-us/ai/autonomous-systems-project-bonsai?activetab=pivot%3aprimaryr7 is specifically focused on this. -John

SorrowInCoreOfWin30 karma

How would you deal with the states that are underrepresented in the dataset (especially in offline RL)? Any strategies to emphasize learning in those states instead of just throwing them away?

MicrosoftResearch26 karma

I've found that memorization approaches become more useful the fewer examples you have.   Other than that, I know that many offline RL approaches simply try to learn policies that avoid unknown regions. -John

ihatemyself32112326 karma

How would you recommend getting started in learning to implement ML programs for someone who doesn’t want to necessarily go into research but more the functional aspect of programming it. Would a PhD still be a requirement? A masters? Or would you say experience counts just as much?

MicrosoftResearch45 karma

This depends a great deal on what you want to do programming-wise. If the goal is implementing things so that other people can use them (i.e. software engineering), then little background is needed as long as you can partner withone someone who understands the statistical side.

If the goal is creating your own algorithms, then it seems pretty essential to become familiar with the statistical side of machine learning. This could be an undergrad level course or there are many online courses available. For myself, I really enjoyed Yaser Abu-Mustafa's course as an undergrad---and this course is online now. Obviously, some mastery of the programming side is also essential, because ML often pushes the limits of hardware and embedding ML into other systems is nontrivial due to the stateful nature of learning processes. -John

MicrosoftResearch21 karma

There are so many methods in RL and there is little theoretical understanding on why it works and why it doesn't. What is the best way to solve this problem? How to get a job in MSR as a masters student working on RL in robotics?

This is why we're working on the theory =) But there are a couple of issues here. If you're talking about Deep-RL, well deep supervised learning itself already has this issue to some, lesser, extent. Even in the supervised setting my sense is that there is a lot of art/intuition in getting large neural networks to work effectively. This issue is only exacerbated in the RL context, due to poor exploration, bootstrapping, and other issues.

On the other hand, my experience is that the non-deep-RL methods are extremely robust, but the issue is that they don't scale to large observation spaces. I have a fun story here. When this paper (https://arxiv.org/abs/1807.03765) came out, I implemented the algorithm and ran it on an extremely hard tabular exploration problem. The first time I ran it, with no tuning, it just immediately found the optimal policy. Truly incredible!

In my opinion the best way to solve this problem is to develop theoretically principled RL methods that can leverage deep learning capabilities. Ideally this would make it so that Deep-RL is roughly as difficult to get working as DL for supervised learning, but we're not quite there yet. So while we are cooking on the theory, my advice is to try to find ways to leverage the simpler methods as much as possible. For example, if you can hand-code a state abstraction (or a representation) using domain knowledge about your problem and then use a tabular method on top of it, this might be a more robust approach. I think something like this is happening here: https://sites.google.com/view/keypointsintothefuture/home.

On the job front, at MSR we rarely hire non-PhDs. So my advice would be to go for a PhD =) - Akshay

MicrosoftResearch12 karma

Is anyone at MSR seriously pursuing AGI and/or RL as a path to AGI?

It depends on what you mean by 'serious'. If you mean something like "giant models with zillions of parameters in an OpenAI style", yes there is work going on around that, although it tends to be more product-focused.  If you mean something like "large groups of people engage in many deep philosophical discussions every day", not that I'm aware of.  There are certainly some discussions ongoing though.  If you mean something like "leading the world in developing AI", then I'd say yes and point at the personalizer service (http://aka.ms/personalizer ) which is pretty unique in the world as an interactive learning system.  My personal belief is that the right path to AI is via developing useful systems capable of addressing increasing complex classes of problems.   Microsoft is certainly in the lead for some of these systems, so I regard Microsoft as very "serious".  I expect you'll agree if you look past hype towards actual development paths. - John

MicrosoftResearch12 karma

The vast majority of RL papers benchmark on games or simulations. In your opinion what are the most impressive real world applications of RL? Let's exclude bandit stuff.

I really like the Loon project (https://psc-g.github.io/posts/research/rl/loon/), although Google recently discontinued the Loon effort entirely. Emma Brunskill's group has also done some cool work on using RL for curriculum planning in tutoring systems (http://grail.cs.washington.edu/projects/ordering/). There are also many examples in robotics, e.g., from Sergey Levine's group. The overarching theme is that these things take a lot of effort. - Akshay

MicrosoftResearch11 karma

Multi-agent RL seems to be a big part of the work that's being done at Microsoft and I've seen there's been a deep dive into complex games that feature multi-agent exploration or cooperation. While this is surely fascinating, it seems to me that the more complicated the environments, the more specific the solutions found by the agents are which makes it difficult to extract meaningful information about how agents cooperate in general or how they develop behaviour and its relevance in the real world. Since the behaviours really are driven heavily by what types of interactions are even allowed in the first place, how much information can we really extract from these multi-agent games that is useful in the real-world?

I think we will look back on our present state of knowledge for how to cooperate and consider it rather naive and simplistic. We obviously want generally applicable solutions and generally applicable solutions are obviously possible (see many social animals as well as humans as examples). As far as the path here, I'm not sure.  Games may be a part of the path there, because they form a much safer/easier testbed than real life.  It seems likely to me that games will not be only element on that path, because cooperation is not a simple problem easily addressed by a single approach. - John

Berdas_10 karma

Hey guys, thank you for the contributions to the RL field, much appreciated!

I'm a ML engineer and we're trying to implement Contextual Bandits (and Conditional Contextual Bandits) in our personalization pipeline using VowpalWabbit.
What are your advices/recommendations for someone in my position? Also, what are the most important design choices when thinking about the final, online pipeline?

Thank you!

MicrosoftResearch4 karma

Could you use aka.ms/personalizer? That uses VW (you can change the flags), and it has all the infrastructure necessary including dropping the logs into your account for you to play with.

My experience here is that infrastructure matter hugely. Without infrastructure you are on a multimonth odyssey trying to build it up and fix nasty statistical bugs. With infrastructure, it's a pretty straightforward project where you can simply focus on the integration and data science. - John

Own-Pattern81028 karma

Will it be possible to develop an artificial consciousness similar to our human consciousness in digitized structures of AI, if in particular structures of AI will digitally rebuild the artificial structures of neurons and the entire central nervous system of humans?

MicrosoftResearch11 karma

One of the paths towards AI that people speculate about is simply reading off a brain and then simulating it. I'm skeptical about this approach because it seems very difficult, in an engineering sense, to accurately read the brain (even in a destructive fashion) at that level of detail. The state of the art in brain reading is presently many, many orders of magnitude less information than that. -John

MicrosoftResearch6 karma

There have been nice theory works recently on exploration in RL, particularly with policy gradient methods. Are these theoretical achievements ready to be turned into practical algorithms? Are there particular domains or experiments that would highlight how these achievements are impactful beyond the typical hard exploration problems, e.g., Kakade's chain and the combination lock?

There's a large spectrum in terms of how theory ideas make their way into practice, so there is some subjectivity here. On one hand, you could argue that count-based exploration (which has been integrated with Deep-RL) is already based on well-studied and principled theory ideas, like the E3 paper. I think something similar is true for the Go-Explore paper. But for keeping very close-to-the-theory, I think we are getting there. We have done some experiments with, e.g., Homer, on visual navigation type problems and seen some success. PC-PG has been shown to work quite well in continuous control settings and navigation settings (in the paper) and I think Mikael and Wen have run some experiments on Montezuma's revenge. So we're getting there and this is something we are actively pursuing in our group.

As far as domains or experiments, our experience from contextual bandits suggests that better exploration improves sample efficiency in a wide range of conditions (https://arxiv.org/abs/1802.04064), so I am hopeful we can see something similar in RL. As far as existing benchmarks, the obvious ones are Montezuma's revenge, Pitfall and the harder Atari games, as well as visual navigation tasks where exploration is quite critical. (For Homer and PC-PG, our group has done experiments on harder variations on the combination lock.) - Akshay

MicrosoftResearch5 karma

Different research groups have very different strengths, what would you say is the forte of MSR in terms of RL research?

Microsoft has two RL strengths at present: the strongest RL foundations research group in the world and the strongest RL product/service creation strategy in the world.   There is quite a bit more going on from the research side.  I'd particularly point out some of the Xbox games RL work, which seems to be uniquely feasible at Microsoft.  There are gaps as well of course that we are working to address. -John

payne7475 karma

Does u/thisisbillgates ever wonder around the offices wondering what people are up to these days?

MicrosoftResearch11 karma

Well, both of us are in the New York City lab, so even if he were, we wouldn't see him too much. But we do have a yearly internal research conference (in non-pandemic years) that he attends and we have discussed our RL efforts and the personalizer service with him. -Akshay

MicrosoftResearch5 karma

Thank you so much for doing this AMA! Contextual bandits are clearly of great practical value, but the efficacy and general usefulness of deep RL is still an area fraught with difficulty. What, in your opinion, are the most practically useful parts of deep RL? Do you have any examples?

There are two dimensions to think about here. One is representational complexity---is it a simple linear model or something more complex? The other is the horizon---how many actions must be taken before a reward is observed? Representational complexity alone is something that deep learning has significantly tackled, and I've seen good applications of complex representations + shallow-to-1 horizon reinforcement learning.

Think of this as more-complex-than-the-simplest contextual bandit solutions.  Longer time horizon problems are more difficult, but I've seen some good results with real world applications around logistics using a history-driven simulator. -John

Jemoka4 karma

It seems like RL (or, for the matter, ML) models in general could sometimes be variable and uncontrolled in performance; what are some metrics (beyond good ol' machine validation) that y'all leverage to ensure that the model's performance is "up-to-par" especially in high-stakes/dangerous situations like the medical field or the financial sector?

MicrosoftResearch6 karma

In many applications, RL should be thought of as the "decision-maker of last resort". For example, in a medical domain, having an RL agent prescribe treatments seems like a catastrophically bad idea, but having an RL agent choose amongst treatments prescribed by multiple doctors seems potentially more viable.

Another strategy which seems important is explicitly competing with the alternative. Every alternative is fundamentally a decision-making system, and so RL approaches with guarantee competition with an arbitrary decision-making system provide an important form of robustness. - John

Bulky_Wurst3 karma

AI and ML are 2 different things. But to the observer, it seems basically the same thing (at least in my experience).

Where do you see the difference in real life applications of AI and ML?

MicrosoftResearch7 karma

I think the difference between AI and ML is mostly a historical artifact of the way research developed. AI research originally developed around a more ... platonic? approach where you try to think about what intelligence means and then create those capabilities. This included things like search, planning, SOAR, logic, etc... with machine learning considered perhaps one of those approaches.

As time has gone on machine learning has come to be viewed as more foundational---yes these other concerns exist, but they need to be addressed in a manner consistent with machine learning.   So, the remaining distinction (if there is one) is mostly about the solution elements: is it squarely in the "ML" category or does it incorporate other AI elements?  Or is it old school no-ML AI? Obviously, some applications are amenable to some categories of solution more than others. - John

MicrosoftResearch3 karma

Can you think of any applications of bandits (contextual or not) in the Oil & Gas/Manufacturing industry? I'm not thinking about recommender systems or A/B testing for websites - such companies have very few customers, which are themselves other companies. So the setting is very different with respect to a web company, for example, which has a huge crowd of individual customers. But bandits are such a beautiful framework 🙂 that I'd love to find an application for them in such a context. Any suggestions?

Almost certainly there is, although I am not super familiar with the industry (as John wrote elsewhere here, RL is a fundamental essentially universal problem of optimizing for value). One nice application of RL more generally is in optimizing manufacturing pipelines and Microsoft has some efforts in this direction.

I have also seen this toy experiment (https://arxiv.org/pdf/1910.08151.pdf section 7.3) where an RL algorithm is used to make decisions about where to drill for oil, but I'm not sure how relevant this actually is to the industry. Bandit techniques are also pretty useful in pricing problems (they share many similar elements), so maybe one can think about adjusting prices in some way based on contextual information? Here is one recent paper we did on this topic if you are interested (https://arxiv.org/abs/2002.11650). -Akshay

ks19103 karma

How will the advent of quantum computing affect the way we do ML & AI?

MicrosoftResearch2 karma

I expect relatively little impact from quantum computing. Some learning problems may become more tractable with perhaps a few becoming radically more tractable. -John

deadlyhausfrau3 karma

What steps are you taking to prevent human biases from affecting your algorithms, to test whether they have, and to mitigate any biases you find developing?

What advice would you give others on how to account for biases?

MicrosoftResearch1 karma

One obvious answer is "research".  See for example this paper: https://arxiv.org/abs/1803.02453 which helped shift the concept of fair learning from per-algorithm papers to categories.  I regard this as far from solved though.   As machine learning (and reinforcement learning) become more important in the world, we simply need to spend more effort addressing these issues. -John

MicrosoftResearch3 karma

Hi I am asking this from the perspective of an undergraduate student studying machine learning. I have worked on a robotics project using RL before but all the experimentation in that project involved pre existing algorithms. I have a bunch of related questions and I do apologise if it might be a lot to get through. I am curious about how senior researchers in ML really go about finding and defining problem statements to work on? What sort of intuition do you have when deciding to try and solve a problem using RL over other approaches? For instance I read your paper on CATS. While I understood how the algorithm worked, I would never have been able to think of such proofs before actually reading them in the paper. What led you to that particular solution? Do you have any advice for an undergraduate student to really get to grips with the mathematics involved in meaningful research that helps moves a field forward or really producing new solutions and algorithms?

  • Finding problems: For me, in some cases there is a natural next step to a project. A good example here is PCID (https://arxiv.org/abs/1901.09018) -> Homer (https://arxiv.org/abs/1911.05815). PCID made some undesirable assumptions so the natural next step was to try to eliminate those. In other cases it is about identifying gaps in the field and then iterating on the precise problem formulation. Of course this requires being aware of the state of the field. For theory research this is a back-and-forth process, you write down a problem formulation and then prove it's intractable or find a simple/boring algorithm, then you learn about what was wrong with the formulation, allowing you to write down a new one.

  • When to use RL: My prior is you should not use ""full-blown"" RL unless you have to and, when you do, you should leverage as much domain knowledge as you can. If you can break long-term dependencies (perhaps by reward shaping) and treat the problem like a bandit problem, that makes things much easier. If you can leverage domain knowledge to build a model or a state abstraction in advance, that helps too.

  • CATS was a follow-up to another paper, where a lot of the basic techniques were developed (a good example of how to select a problem as the previous paper had an obvious gap of computational intractability). A bunch of the techniques are relatively well-known in the literature, so perhaps this is more about learning all of the related work. As is common, each new result builds on many many previous ideas, so having all of that knowledge really helps with developing algorithms and proofs. The particular solution is natural (a) because epsilon-greedy is simple and well understand and (b) because tree-based policies/classifier have very nice computational properties, and (c) smoothing provides a good bias-variance tradeoff for continuous action spaces.

  • Getting involved: I would try to read everything, starting with the classical textbooks. Look at the course notes in the areas you are interested in and build up a strong mathematical foundation in statistics, probability, optimization, learning theory, information theory etc. This will enable you to quickly pick up new mathematical ideas so that you can continue to grow. -Akshay

TheLastGiant3 karma

What field is possibly booming for AI applications in the future?

MicrosoftResearch2 karma

All of them.  This might sound like snark, but consider: what field benefits from computers? - John

TechnicalFuel23 karma

Hello, perhaps this is a slight bit off-topic, but I was wondering what your favorite films of all time are, and if those had any bearing on your careers?

MicrosoftResearch3 karma

I loved Star Wars when I was growing up. It was lots of fun. I actually found reading science fiction books broadly to be more formative---you see many different possibilities for the future and learn to debate the merits of different ones. This forms some foundation for thinking about how you want to change the future. -John

moldywhale2 karma

Can you describe the sorts of problems one could expect to solve/work on if they worked in Data Science at MS?

MicrosoftResearch1 karma

"All problems" is the simple answer in my experience. Microsoft is transforming into a data-driven company which seeks to improve everything systematically.  The use of machine learning is now pervasive.

shepanator2 karma

How do you detect & prevent over-fitting in your ML models? Do you have generic tests that you apply in all cases, or do you have to develop domain specific tests?

MicrosoftResearch3 karma

I mostly have worked in online settings where there is a neat trick: you evaluate one example ahead of where you train. This average evaluation ("Progressive validation") deviates like a test set while still allowing you to benefit from it for learning purposes. In terms of tracking exactly what the performance of a model is, we typically use confidence intervals which are domain-independent.  Finding best confidence intervals is an important area of research (see https://arxiv.org/abs/1906.03323 ). -John

MasterAgent472 karma

[1] I implemented RL for pacman and it was pretty fun! Just curious, why are researchers interesting in gaming RL? [2] Are there any papers you'd recommend that cover recent efforts to make RL more explainable?

MicrosoftResearch1 karma

  1. Nice! I did the same thing in my undergrad AI course, definitely very fun =) Gaming is a huge business for Microsoft and gaming is also one of the main places where (general) RL has been shown to be quite successful, so it is natural to think about how RL can be applied to improve the business.

  2. If by explainable you mean that the agent makes decisions in some interpretable way, I don't know too much, but maybe this paper is a good place to start (https://arxiv.org/abs/2002.03478). If by explainable you mean accessible to you to understand the state of the field, I'd recommend this monograph (https://rltheorybook.github.io/) and checking out the tutorials in the ML conferences. -Akshay

livinGoat2 karma

How much of the research done on bandit problems is useful in practice? Every year there are a lot of papers published on this topic with small variations to existing settings. Seb Bubeck wrote in a blog post that at some point he thought there was not much left to do in bandits, however new ideas keep arising. What do you see as future direction that could be relevant in practice? What do you think about the model selection problem in contextual bandits?

MicrosoftResearch2 karma

Thanks for the question!

  • Things can be useful for at least two reasons. One is that it can introduce new ideas to the field even if the algorithm is not directly useful in practice. The other is that the algorithm or the ideas are directly useful in practice. Obviously I cannot comment on every paper, but there are definitely still some new ideas appearing in the bandit literature and I do think understanding the bandit version of a problem is an important pre-requisite for addressing the RL problem. There is also definitely some incremental work, but this seems true for many fields. I am sympathetic though, since it is very hard to predict what research will be valuable in advance.

  • Well, I love the model selection problem and I think it is super important. It's a tragedy that we do not know how to do cross validation for contextual bandits. (Note that cross validation is perhaps the most universal idea in supervised learning, arguably more so than GD/SGD.) And many real problems we face with deployments are model selection problems in disguise. So I definitely think this is relevant to practice and would be thrilled to see a solution. -Akshay

MicrosoftResearch2 karma

"How do you view the marginal costs and tradeoffs incurred by specifying and implementing 1) more complicated reward functions/agents and 2) more complicated environments? Naturally it depends on the application, but in your experience have you found a useful abstraction when making this determination conditioned on the application?"

I'm somewhat hardcore in that it's hard for me personally to be interested in artificial environments, so I basically never spend time implementing them. When something needs to be done for a paper, either taking existing environments or some mild adaptation of existing datasets/environments (with a preference for real-world complexity) are my go-to approaches. This also applies to rewards---I want reward feedback to representative of a real problem.

This hardcore RL approach means that often we aren't creating slick-but-fragile demos. Instead, we are working to advance the frontier of consistently solvable problems. W.r.t. agents themselves, I prefer approaches which I can ground foundationally. Sometimes this means 'simple' and sometimes 'complex'. At a representational level, there is quite a bit of evidence that a graduated complexity approach (where complexity grows with the amount of data) is helpful. - John

thosehippos2 karma

On the note of exploration: Even if we were able to get provably correct exploration strategies from tabular learning (like r-max) to work in function approximation settings, it seems like the number of states to explore in a real-ish domain is to high to exhaustively explore. How do you think priors play into this, especially with respect to provability and guarantees?

Thanks!

MicrosoftResearch5 karma

Two comments here:

  • Inductive bias does seem quite important. This can come in many forms like a prior or architectural choices in your function approximator.

  • A research program we are pushing involves finding/learning more compact latent spaces in which to explore. Effectively the objects the agent operates on are ""observations"" which may be high dimensional/noisy/too-many-to-exhaustively-explore, etc., but the underlying dynamics are governed by a simpler ""latent state"" which may be small enough to exhaustively explore. The example is a visual navigation task. While the number of images you might see is effectively infinite, there are not too many locations you can be in the environment. Such problems are provably tractable with minimal inductive bias (see https://arxiv.org/abs/1911.05815).

  • I also like the Go-Explore paper as a proof of concept w.r.t., state abstraction. In the hard Atari games like Montezuma's revenge and Pitful, downsampling the images yields a tractable tabular problem. This is a form of state abstraction. The point is that there are not-too-many downsampled images! -Akshay

MicrosoftResearch2 karma

Ok, I'll bite: What is "Responsible reinforcement learning"? What is "Strategic exploration"? Are you using Linux? :))))

From last to first: I (Akshay) use OS X and I think John uses Linux with a windows VM.

Strategic exploration was this name we cooked up to mean roughly ""provably sample efficient exploration."" We wanted to differentiate from the empirical work on exploration which sometimes is motivated by the foundations, but typically does not come with theoretical guarantees. Strategic is supposed evoke the notion that the agent is very deliberate about trying to acquire new information. This is intended to contrast with more myopic approaches like Boltzman exploration or epsilon-greedy. One concern with the adjective is that strategic often means game-theoretic in the CS literature, which it does not in this context.

Responsible reinforcement learning is about integrating principles of fairness accountability transparency and ethics (FATE) into our RL algorithms. This is of utmost importance when RL is deployed in scenarios that impact people and society, which I would argue is a very common case. We want to ensure that our decision making algorithms do not further systemic injustices, inequities, and biases. This is a highly complex problem and definitely not something I (Akshay) am an expert in, so I typically look to my colleagues in the FATE group in our lab for guidance on these issues. -Akshay

Human-Sugar18552 karma

How close are we to having home robots that can function almost as well as a human companion? Like just having someone/thing to talk to that could sustain a natural conversation.

MicrosoftResearch3 karma

Quite far in my view. The existing systems that we have (like GPT3) are sort of intelligent babblers. To have a conversation with someone, there really needs to be a persistent state / point of view with online learning and typically some grounding in the real world. There are many directions of research here which need to come to fruition. -John

spidergeorge2 karma

Is reinforcement learning suited to only certain types of problems or could it be used for computer vision or natural language processing?

I have used RL as part of the Unity ML agents package which makes it easy to make game AI with using RL but haven't seen many other use cases.

MicrosoftResearch2 karma

I think of RL as a way to get information for the purpose of learning.   Thus, it's not associated any particular domain (like vision), and is potentially applicable in virtually all domains. W.r.t. vision and language in particular, there is a growing body of work around 'instruction following' where agents learn to use all of these modalities together to accomplish a task, often with RL elements. -John

MicrosoftResearch1 karma

"Akshay your MINERVA integrated knowledge bases with RL https://arxiv.org/abs/1711.05851 Do you see that as promising going forward, and can you comment about progress in that direction since?"

I haven't really tracked the KB QA field too carefully since that paper. But I talked to Manzil Zaheer recently and he told me that "non-RL" methods are currently doing much better than MINERVA in these problems. Perhaps the reason is that this is not really an RL problem; RL is just used as a computationally-friendly form of graph search. But the graph is completely known in advance, so this is largely about computational efficiency. Indeed, Manzil told me that a "template matching" approach is actually better, but computational tricks are required to scale this up (https://www.aclweb.org/anthology/2020.findings-emnlp.427.pdf). To that end, I'm inclined to say that non-RL methods will dominate here. -Akshay

MicrosoftResearch1 karma

Best free resources to learn RL for games? (like chess)

I first learned about this stuff from Dan Klein's AI course at UC Berkeley which I took back in 2008 (here is a more recent iteration https://inst.eecs.berkeley.edu/~cs188/fa18/staff.html). The basic principles are quite well established and you can try things out yourself on much smaller games, like tic-tac-toe as a baby example. (I really enjoy learning by doing.) -Akshay

AbradolfLinclar1 karma

How do you deal with a machine learning task for which the data is not available or hard to get per se?

MicrosoftResearch2 karma

The practical answer is that I avoid it unless the effort of getting the data is worth the difficulty. Healthcare is notorious here because access to data is both very had and potentially very important. -John

ndtquang1 karma

What is latent state discovery and why do you think it is important in real world RL ?

MicrosoftResearch1 karma

Latent state discovery is an approach for getting reinforcement learning to provably scale to complex domains. The basic idea is to decouple of the dynamics which are determined by a simple latent state space, from an observation process, which could be arbitrarily complex. The natural example is a visual navigation task: there are far fewer locations in the world, than visual inputs you might see at those locations. The ""discovery"" aspect is that we don't want to know this latent state space in advance, so we need to learn how to map observations to latent states if we want to plan and explore. Essentially this is a latent dynamics modeling approach, where we use the latent state to drive exploration (such ideas are also gaining favor in the Deep-RL literature).

The latent state approach has enabled us to develop essentially the only provably efficient exploration methods for such complex environments (using arbitrary nonlinear function approximation). In this sense, it seems like a promising approach for real world settings where exploration is essential. -Akshay

neurog33k1 karma

Hi! Thanks for doing this AMA.

What is the status of Real World RL? What are the practical areas that RL is being applied to in the real world right now?

MicrosoftResearch1 karma

There are certainly many deployments of real world RL.  This blog post: https://blogs.microsoft.com/ai/reinforcement-learning/ covers a number related to work at Microsoft.  In terms of where we are, I'd say "at the beginning".  There are many applications that haven't even been tried, a few that have, and lots of room for improvement. -John

MicrosoftResearch1 karma

A commonly cited example of where one could use reinforcement learning is in the space of self-driving cars. It seems, at first, like a reasonable idea since this can easily be seen as a sequence of decisions that need to be made at every timestep, but we are still far away from self=driving cars being controlled by end-to-end reinforcement learning systems. Instead, these systems seem to be made up of many smaller machine learning models that don't necessarily even use any reinforcement learning and focus primarily on aspects of computer vision and favour other models for making decisions. The question here is how far away do you think we are from having actual end-to-end systems which are controlled by reinforcement learning and what do you think is the key advancement that will take us there?

Actual end-to-end systems controlled by RL have existed for over a decade(see http://arxiv.org/abs/1003.0146 ) .   These days, you can setup your own in a few minutes (see http://aka.ms/personalizer).  Of course, these are operating at a different level of complexity than a self-driving car.   When will we have a self-driving car level of complexity in end-to-end RL agents?  There are some serious breakthroughs required. 

It's difficult to imagine success without addressing model-based reinforcement learning much more effectively than we have done so far.   On top of that some form of model-based risk-aversion is required.  Cooperation is also a key element of car movement which is very natural for humans and required for any kind of self-driving car mass deployment.  A fourth element is instructability and to some degree comprehensibility.   When will all of this come together in a manner which is better than more engineered approaches?  I'm not sure, but this seems pretty far out in the decade+ category. -John

dumbjock251 karma

Hello, do you have any events in New York? I've been teaching myself for the last couple years on ML and AI theory and practice but would love accelerate my learning by working on stuff (could be for free). I have 7 years of professional programming experience and work as a lead for a large financial company.

MicrosoftResearch1 karma

Well, we have "Reinforcement Learning day" each year. I'm really looking forward to the pandemic being over because we have a beautiful new office at 300 Lafayette---more might start happening when we can open up. -John

greyasshairs1 karma

Can you share some real examples of how your work has made its way into MS products? Is this a requirement for any work that happens at MSR or is it more like an independent entity and is not always required to tie back into something within Microsoft?

MicrosoftResearch1 karma

A simple answer is that Vowpal Wabbit (http://vowpalwabbit.org ) is used by the personalizer service (http://aka.ms/personalizer ). Many individual research projects have impacted Microsoft in various ways as well. However, many research projects have not. In general, Microsoft Research exists to explore possibilities. Inherent in the exploration of possibilities is the discovery that many possibilities do not work. - John

CodyByTheSea1 karma

How is ML/AI improving Microsoft product? Is it applied outside of Microsoft and benefiting the society as a whole? Thank you

MicrosoftResearch2 karma

There isn't a simple answer here, but to a close approximation I think you should imagine that ML is improving every product, or that there are plans / investigations around doing so.   Microsoft's mission is to empower everyone so "yes" with respect to society as a whole? Obviously people tend to benefit more directly when interacting with the company, not even that is necessary.  For example, Microsoft has supported public research across all of computer science for decades. -John

MicrosoftResearch1 karma

How do you expect RL to evolve in the next years?

There are several threads of "RL". On the product tract, I expect more and more application domains to be addressed via RL because the fundamental nature of "make decisions to optimize performance" simply matches problems better than other formulations. On the research side, I expect serious breakthroughs in model-based approaches which can be very useful in robotics-like domains which are highly stateful. I also expect serious breakthroughs in human interaction domains where the goal is interpreting and acting on what someone wants. -John

MicrosoftResearch1 karma

Is RL in the real world limited today to problems where you can generate infinite data (e.g., games) and where failure is not costly/risky (e.g., not autonomous driving)? Or can it be applied also in other contexts? Would it be applicable to optimization of a sequential manufacturing process? For example, Additive Manufacturing is sequential by its own nature (it proceeds in layers). How would you go around applying RL to such a problem? Finally, Sutton & Barto is probably the most widely recommended reference for RL, even though its coverage of some topics such as Deep RL or offline (not off-policy) RL is seriously lacking. Which other references work you recommend?

It is applicable in manufacturing settings and Microsoft has some efforts in this direction (https://www.microsoft.com/en-us/ai/autonomous-systems-project-bonsai). The current approach Bonsai is taking is a simulation based approach, which leverages the fact that manufacturing pipelines are typically highly controlled environments, thus making it easier to build a simulator. Once you have a high-fidelity simulator, you are back into the ""infinite data"" case. (The challenge of course is to build high-fidelity simulators in a scalable manner, and Bonsai has some techniques and infrastructure to make this easier.)

Well, Sutton and Barto is classical and hence a somewhat old reference. It also doesn't cover many of the advances on the theoretical foundations of RL. For theory, I recommend the unpublished monograph of Agarwal-Jiang-Kakade-Sun (https://rltheorybook.github.io/). John, Alekh and I also did a tutorial for FOCS on the theoretical foundations (https://hunch.net/~tforl/). Deep-RL is a very new topic so I don't think any books have come out yet (Books often lag a few years behind the field and this field is moving extremely fast). For this I would recommend scanning the tutorials at the ML conferences (e.g., https://icml.cc/2016/tutorials/deep_rl_tutorial.pdf, which already seems outdated!) - Akshay

Knightmaster85021 karma

How do you think that reinforcement learning will affect gaming in the future? Will there be super smart NPC's that act almost like a player that fit into a particular world or do you think that A.I will be implemented differently?

MicrosoftResearch1 karma

I enjoyed watching Terminator too, but I find it unrealistic.  Part of this is simply because we are a long ways off from actually being able to build that kind of intelligence.   You see this more directly when you are working on the research first-hand.  It's also unrealistic because AI doesn't beat crypto---as far as we can tell super-intelligence doesn't mean the ability to hack any system.   

Given these things, I think it's more natural to be concerned about humans destroying the world.   Another aspect to consider here is AI salvation. How do you manage interstellar travel and colonisation? Space is incredibly inhospitable to humans and the timescales involved are outrageous on a human lifespan, so a natural answer is through AI. - John

MicrosoftResearch1 karma

There are some efforts in this direction already, and indeed this seems like the obvious way to plug RL/AI into games. But I imagine there are many other possibilities that may emerge as we start deploying these things. In part this is because games are quite diverse, so there should be many potential applications. -Akshay

The_Nightbeard1 karma

With the xbox series X having hardware for machine learning, what kind of applications of this apply to gaming?

MicrosoftResearch1 karma

An immediate answer is to use RL to control non-player-characters. -Akshay

MicrosoftResearch1 karma

Why do you seem to only hire PhDs? Getting a PhD is not accessible for many.

We actually have a team of engineers and data scientists working with researchers. They are incredibly valuable because they allow each person to specialize in their expertise.  The research aspect certainly does tend to require a phd.  Part of this is about how you develop enough familiarity with an area of research to contribute meaningfully, and part of a phd is about learning how to reason about and investigate the unknown.   Anyone who has mastered those two elements could contribute to research.   However, that's quite a mountain to climb without training. -John

MicrosoftResearch1 karma

Domain randomization has been shown to be powerful to improve generalization. Do you think DR will scale up to let us handle many factors of variation, or is it more of a band-aid for now?

In the long term, I expect any simulator-specific technology to be a band-aid in situations where we really need to learn from real-world interaction. With that said, I think it's plausible that when/where we learn to create RL agents with an internal model of the world, some form of robust policy encouragement likely makes sense, and it may be a derivative of domain randomization. -John

MicrosoftResearch1 karma

Can you comment about longer term plans for vowpal wabbit? Is the idea it will contain more SOTA RL or is it more focused on supporting existing features. Thanks!

Vowpal Wabbit is designed for interactive online learning. It seems very valuable to continue to improve capabilities here.   An analogy that I like to think of here is car vs train. In this analogy, a train is like batch-oriented supervised learning because it came first and is very capable of getting from some setup point A to some setup point B. Reinforcement learning (and, more generally interactive learning) is more like a car. It comes online later because more development is required, but it's much more adaptable, able to get you from many, many more points to many, many others. -John

MicrosoftResearch1 karma

Recently, there have been a few publications that try to apply Deep RL to computer networking management. Do you think this is a promising domain for RL applications? What are the biggest challenges that will need to be tackled before similar approaches can be used in the real world?

One of the things I find fascinating is the study of the human immune system.  Is network security going to converge on something like the human immune system?  If so, we'll see quite a bit of adaptive reinforcement-like learning (yes, the immune system learns).

In another vein, choosing supply for demand is endemic to computer operating systems and easily understood as a reinforcement learning problem. Will reinforcement learning approaches exceed the capabilities of existing hand-crafted heuristics here? Plausibly yes, but I'd expect that to happen first in situations where the computational cost of RL need not be taken into account. -John

Yeskar1 karma

Hello, during my last semester at college I did some research and implementation of an AI that used Hierarchical Reinforcement Learning to become a better bot at a shooting game (unreal tournament 2004) by practicing against other bots. I haven't followed the more recent updates in this topic (last 5 years), I remember this approach of RL to be promising due to its capabilities of making the environment (combination of states ) Hierarchical and reducing computation time. Has HRL become a thing or was it forgotten in it's original paper? Also do you have openings in your area for a software developer?

MicrosoftResearch2 karma

HRL is still around. Our group had a paper on it recently (https://arxiv.org/abs/1803.00590), but I think Doina Precup's group has been pushing on this steadily since the original paper. I haven't been tracking this sub-area recently but one concern I had with the earleir work was that in most setups the hierarchical structure needed to be specified to the agent in advance. At least the older methods therefore require quite a lot of domain expertise, which is somewhat limiting.

We usually list our job postings here: https://www.microsoft.com/en-us/research/theme/reinforcement-learning-group/#!opportunities - Akshay

croesys1 karma

Dr. Langford & Dr. Krishnamurthy,

Thank you for this AMA. My question:

From what I understand about RL, there are trade offs one must consider between computational complexity and sample efficiency for given RL algorithms. What do you both prioritize when developing your algorithms?

MicrosoftResearch1 karma

I tend to think first about statistical/sample efficiency. The basic observation is that computational complexity is gated by sample complexity because minimally you have to read in all of your samples. Additionally, understanding what is possible statistically seems quite a bit easier than understanding this computationally (e.g., computational lower bounds are much harder to prove that statistical ones). Obviously both are important, but you can't have a computationally efficient algorithm that requires exponentially many samples to achieve near-optimality, while you can have the converse (statistically efficient algorithm that requires exponential time to achieve near-optimality). This suggests you should go after the statistics first. -Akshay

ZazieIsInTheHouse1 karma

How can I prepare in order to be part of Microsoft Researcher in Reinforcement Learning?

MicrosoftResearch1 karma

This depends on the role you are interested in. We try to post new reqs here (http://aka.ms/rl_hiring ) and have hired in researcher, engineer, and applied/data scientist roles. For a researcher role, a phd is typically required. The other roles each have their own reqs. -John

NIRPL1 karma

After autonomous cars are fully developed, what will the next captcha subject be?

MicrosoftResearch5 karma

CAPTCHAs will eventually become obsolete as a technology concept. -John

edjez1 karma

What are some of the obstacles getting in the way of wide-spread applications of online and offline RL learning for real-world scenarios, and what research avenues look promising to you that could chip away at, or sidestep, the obstacles?

MicrosoftResearch1 karma

I suppose there are many obstacles and the most notable one is that we don't have sample efficient algorithms that can operate at scale. There are other issues like safety, stability, etc., that will matter depending on the application.

The community is working on all of these issues, but in the meantime, I like all of the side-stepping ideas people are trying. Leveraging strong inductive bias (via pre-trained representation or state abstraction or prior), sim-to-real, imitation learning. These all seem very worthwhile to pursue. I am in favor of the trying everything and seeing what sticks, because different problems might admit different structures, so it's important to have a suite of tools at our disposal.

On sample efficiency, I like the model based approach as it has many advantages (obvious supervision signal, offline planning, zero-shot transfer to a new reward function, etc.). So (a) fitting accurate dynamics models, (b) efficient planning in such models, and (c) using them to explore, all seem like good questions to study. We have some recent work on this approach (https://arxiv.org/abs/2006.10814) -Akshay

tchlux1 karma

AFAIK most model-based reinforcement learning algorithms are more data efficient than model-free (that don't create an explicit model of the environment). However, all the model-based techniques I've seen eventually "throw away" data and stop using it for model training.

Could we do better (lower sample complexity) if we didn't throw away old data? I imagine an algorithm that keeps track of all past observations as "paths" through perception space, and can use something akin to nearest neighbor to identify when it is seeing a similar "path" again in the future.

I.e., what if the model learned a compression from perception space into a lower dimension representation (like the first 10 principle components), could we then record all data and make predictions about future states with nearest neighbor? This method would benefit from "immediate learning". Does this direction sound promising?

MicrosoftResearch1 karma

Definitely. This is highly related to the latent space discovery research direction of which we've had several recent papers at ICLR, NeurIPs, ICML. There are several challenging elements here. You need to learn nonlinear maps, you need to use partial learning to gather information for more learning, and it all needs to be scalable. -John

MicrosoftResearch1 karma

What do you think about progress and research in meta-learning and algorithms like E-MAML? What would you say are downsides and upsides of meta-learning approaches?

The concept of 'meta-learning' confuses me somewhat, because I can't distinguish it from 'learning' very well. From what I can tell, people mean something like 'meta-learning is solving a sequence of tasks', but what creates the task definitions?  If the task definitions are given to the agent that seems a little artificial. 

If we think of the task definitions as embedded in the environment, then from the agent's view it is more like one big nonstationary problem.   Solving nonstationary learning problems better seems very important in practice because nonstationarity is endemic to real-world problems. -John

mayboss1 karma

I have a few questions:

What are you biggest fears in relation to ML or AI?

Where do you see the world heading in this field?

How dependent are we currently on ML and how dependent will we be in the next 10 to 15 years?

What is the coolest AI movie?

MicrosoftResearch2 karma

One of my concerns about ML is personal---there are some big companies that employ a substantial fraction of researchers. If something goes wrong at one of those companies, suddenly many of my friends could be in a difficult situation. Another concern is more societal: ML is powerful and so just like any poweful tool there are ways to use it well and vice-versa. How do we guide towards using it well? That's a question that we'll be asking and partially answering over and over because I see the world heading towards pervasive use of ML. In terms of dependence, my expectation is that it's more a question of dependence on computers than ML per se, with computers being the channel via which ML is delivered. -John

coredweller17851 karma

How are you going to make sure Microsoft doesn't just use it for more surveillance capitalism?

MicrosoftResearch1 karma

There are certainly categories of use for RL which fit 'surveillance capitalism', but there are many others as well: helping answer help questions, optimising system parameters, making logistics work, etc... are all good application domains. We work on the things that we want to see created in the future. -John

Buggi_San1 karma

RL seems to more strategy oriented/original than the papers I observe in other areas of ML and Deep Learning, which seems to be more about adding layers upon layers to get slightly better metrics. What is your opinion about it ?

Secondly I would love to know the role RL in real world applications.

MicrosoftResearch1 karma

By strategy I guess you mean "algorithmic." I think both areas are fairly algorithmic nature. There have been some very cool computational advancements involved in getting certain architectures (like transformers) to scale and similarly there are many algorithmic advancements in domain adaptation, robustness, etc. RL is definitely fairly algorithmically focused, which I like =)

RL problems are kind of ubiquitous, since optimizing for some value is a basic primitive. The question is whether ""standard RL"" methods should be used to solve these problems or not. I think this requires some trial-and-error and, at least with current capabilities, some deeper understand of the specific problem you are interested in. -Akshay

MicrosoftResearch1 karma

What are the biggest opportunities where RL can be applied? What are the biggest challenges standing in the way of more applications?

It's actually hard to pin down the "biggest" opportunity because it's such a target rich environment and because the nature of RL is that it's tricky to know how much you'll win until you try it.   Reinforcement learning is fundamental because it's the problem of learning to make decisions to optimize value.  We are simply naturally inclined to try to make things better.   

With that said, I believe it's natural to solve problems of steadily increasing complexity.  Maybe that begins with ranking results on the web, then grows to optimising system parameters, handling important domains like logistics, and eventually delves into robotics? Or maybe it looks like learning to nudge people into healthy habits, amplify e-learning, and mastering making a computer behave as you want?  The far path isn't clear, but perhaps as long as we can discover the next step on the path we'll get there.   Wrt obstacles, I think the primary obstacle is the imagination to try new ways to do things and the secondary obstacle is the infrastructure necessary to support that. -John

GhostOfCadia0 karma

I’m so technologically illiterate I have no idea what 90% of what you said even means. I just have one question.

When can you upload me into a robot?

MicrosoftResearch3 karma

Never sounds like a good bet to me. -John