Interpretable Machine Learning through Teaching

We've designed a method that encourages AIs to teach each other with examples that also make sense to humans. Our approach automatically selects the most informative examples to teach a concept — for instance, the best images to describe the concept of dogs — and experimentally we found our approach to be effective at teaching both AIs and humans.


Some of the most transformative applications of powerful AI will come from computers and humans collaborating, but getting them to speak a common language is hard. Think about trying to guess the shape of a rectangle when you’re only shown a collection of random points inside that rectangle: it’s much faster to figure out the correct dimensions of the rectangle when you’re given points at the corners of the rectangle instead. Our machine teaching approach works as a cooperative game played between two agents, with one functioning as a student and the other as a teacher. The goal of the game is for the student to guess a particular concept (i.e. “dog”, “zebra”) based on examples of that concept (such as images of dogs), and the goal of the teacher is to learn to select the most illustrative examples for the student.

It's far more efficient to teach someone the dimensions of a rectangle by defining the opposing points (right), rather than scattering points randomly within its borders (left).

Our two-stage technique works like this: a 'student' neural network is given randomly selected input examples of concepts and is trained from those examples using traditional supervised learning methods to guess the correct concept labels. In the second step, we let the 'teacher' network — which has an intended concept to teach and access to labels linking concepts to examples — to test different examples on the student and see which concept labels the student assigns them, eventually converging on the smallest set of examples it needs to give to let the student guess the intended concept. These examples end up looking interpretable because they are still grounded to the concepts (via the student trained in step one).

In contrast, if we train the student and teacher jointly (as is done in a lot of current communication games), the student and teacher can collude to communicate via arbitrary examples that do not make sense to humans. For instance, the concept of a "dog" might end up being encoded through some arbitrary vectors that may be showing images of llamas and motorcycles, or a rectangle could be composed of two dots that look random to a human, but encode a specific rectangle's dimensions.

To understand why our technique works, consider what happens when we use our method to teach the student to recognize concepts from example images that vary based on four properties: size (small, medium, large), color (red, blue, green), shape (square vs circle), and border (solid vs none).


In this case, a concept is a set of properties that define a subset of the examples as belonging to that concept; for example, if the concept is red circles, then red circles of any size or border fit the concept. Our teacher network eventually learns to pick examples whose only common properties are the ones required by the concept, so that the student can rule out the irrelevant properties. To impart the concept of “red”, for instance, our teacher selects a large red square with no border and then a small red circle with a border. The only property the two shapes have in common is red, so the concept must only consist of red.

Our system learns to pick examples that efficiently teach concepts to students.

This approach works across boolean, hierarchical, probabilistic, and rule-based concepts, with the teaching techniques invented by the teacher network frequently mirroring optimal strategies designed by humans. We also evaluated our approach on humans by giving them examples generated by the teacher network. We found that human subjects on mechanical turk given examples by our machine teacher were able to guess the correct concept more often than if they'd just been presented with random examples to guide them.

While we only looked at teaching via examples in this work, the ideas can apply to the creation of interpretable communication between agents or other ways in which we would wish to make interaction between agents to be more understandable by humans. If you’re interested in working on such efforts, consider joining OpenAI!