New modes of interaction for Flip videos Part 3

This past semester I’ve been experimenting with new modes of interaction for video. I’ve written about 2 previous test sessions here and here.

Annotating video is hard. Video is sound, imagery moving through time. It’s an immersive and some might say brain-short-circuiting medium. Watching 3 videos simultaneously may be the norm today. However, if you’re truly engaged in watching video content, in particular content that is chock full of new and complex ideas, it’s hard to do much else.

Watching video content makes our brains go bonkers.

“’Every possible visual area is just going nuts,’ she adds. What does this mean? It shows that the human brain is anything but inactive when it’s watching television. Instead, a multitude of different cortexes and lobes are lighting up and working with each other…”

“She” is Joy Hirsch, Dir. of fMRI Research at Columbia U, being cited by the National Cable & Communications Association who interpret her results to mean watching tv is good for our brains, like Sudoku. I’m not sure about that, but it’s reasonable to conclude that consuming video content occupies quite a lot of our brain.

Of course no one is saying reading doesn’t engage the brain. However, one key difference between text and video makes all the difference when it comes to annotation: With reading, we control the pace of reading, slowing down and speeding up constantly as we scale difficult passages or breeze through easy ones.

Video runs away from us on its own schedule whether or not we can keep up. Sure we can pause and play, fast-forward and slow down, but our ability to regulate video playback can only be clunky when compared to the dexterity with which we can control the pace of reading.

In fact the way researchers describe brain activity while watching tv sounds a lot like trying to keep up with a speeding train. All areas of the brain light up just to keep up with the action.

So what does that mean for those of us building video annotation tools?

Video annotation has all the same cognitive challenges of text annotation, but it comes with additional physiological hurdles as well.

STEM v. The Humanities

I’ve been working off the assumption that responding to STEM material is fundamentally different from The Humanities. For STEM subjects, the range of relevant responses is much more limited. It essentially amounts to different flavors of “I’m confused.” and “I’m not confused.”

I’m confused because:

  • e.g. I need to see more examples to understand this.
  • Syntax! I don’t know the meaning of this word.
  • How? I need this broken down step-by-step.
  • Why? I want to know why this is so.
  • Scale. I need a point of comparison to understand the significance of this.

I get this because:

  • Apt! Thank you. This is a great example.
  • Got it! This was a really clear explanation.

Humor is a commonly wielded weapon in the arsenal of good teaching so being able to chuckle in response to the material is relevant as well.

But as is often the case when trying to define heuristics, it’s more complicated than simply STEM versus not-STEM.

Perhaps a more helpful demarcation of territory would be to speak in terms of the manner and tone of the content (text or video) and more or less ignore subject matter altogether. In other words: The way in which I respond to material depends on how the material is talking to me.

For example, the manner and tone with which the speaker addresses the viewer varies dramatically depending on whether the video is a:

  •  “How-to” tutorial,
  • Expository Lecture
  • Editorializing Opinion
  • Edu-tainment

The tutorial giver is explaining how to get from A to Z by following the intervening steps B through Y. First you do this, then you do that.

The lecturer is a combination of explanatory and provocative. This is how you do this, but here’s some food for thought to get you thinking about why that’s so.

The editorializing opinion-giver is trying to persuade you of a particular viewpoint.

Edu-tainment is well, exactly that. Delivering interesting information in an entertaining format.

And of course, the boundaries between these categories are sometimes blurry. For example, is this Richard Feynman lecture Expository Lecture? or Editorializing Opinion?

I would argue it falls somewhere in the middle. He’s offering a world view, not just statements of fact. You might say that the best lecturers are always operating in this gray area between fact and opinion.

The Test Session

So in our 3rd test session, unlike the previous 2, I chose 3 very different types of video content to test.

Documentary on The Stanford Prison Guard Experiment (Category: Edu-tainment)

A 10-minute segment of the Biden v. Ryan 2012 Vice Presidential Debate re: Medicare starting at ~32:00. (Category: Editorializing Opinion)

Dan Shiffman’s Introduction to Inheritance from Nature of Code (Category: Expository Lecture)

You can try annotating these videos on Ponder yourself:

  1. Dan Shiffman’s Introduction to Inheritance from Nature of Code.
  2. Biden v. Ryan Vice-Presidential Debate.
  3. The Stanford Prison Experiment documentary.

The Set-up

There were 5 test subjects, watching 3 different videos embedded in the Ponder video annotation interface in the same room, each on their own laptop with headphones. That means unlike previous test sessions, each person was able to control the video on their own.

Each video was ~10 minutes long. The prompt was to watch and annotate with the intention of summarizing the salient points of the video.

2 students watched Dan Shiffman’s Nature of Code (NOC) video. 2 students watched the documentary on the Stanford Prison Experiment. And 1 student watched the debate.

The Results

The Stanford Prison Experiment had the most annotations: 15/user versus 12 for NOC and 5 for the debate, and the most varied use of annotations: 22 total versus 5 for NOC and 4 for the debate.

Unsurprisingly the prison documentary provoked a lot of emotional reactions (50% of the responses were emotional – 12 different kinds compared to 0 emotional reactions to the debate).

Again unsurprisingly, the most common response to the NOC lecture was “{ chuckle },” it was 12 of the 25 responses. There was only 1 point of confusion around, a matter of unfamiliarity with syntax: “What is extends?”

This was a pattern I noted in the previous sessions where in many STEM subjects, everything makes perfect sense in the “lecture.” The problem is oftentimes as soon as you try to do it on your own, confusion sets in.

I don’t think there’s any way around this problem other than to bake “problem sets” into video lectures and allow the points of confusion to bubble up through “active trying” rather than “passive listening.”

Intro to Inheritance - NOC Intro to Inheritance – NOCBiden v. Ryan Vice-Presidential Debate Biden v. Ryan Vice-Presidential Debate Stanford Prison Experiment Stanford Prison Experiment

Less is More?

There are 2 annotation modes in Ponder. 1 displays a small set of annotation tags (9) in a Hollywood Squares arrangement. A second displays a much larger set of tags. Again the documentary watchers were the only ones to dive into the 2nd set of more nuanced tags.

Less v. More Less v. More

However, neither student watching the documentary made use of the text elaboration field (they didn’t see it until the end) where you can write a response in addition to applying a tag whereas the Nature of Code and Biden-Ryan debate watchers did. This made me wonder how having the elaboration field as an option changes the rate and character of the responses.

Everyone reported pausing the video more than they normally would in order to annotate. Much of the pausing and starting simply had to do with the clunkiness of applying your annotation to the right moment in time on the timeline.

It’s all in the prompt.

As with any assignment, designing an effective prompt is half the battle.

When I tested without software, the prompt I used was: Raise your hand if something’s confusing. Raise your hand if something is especially clear.

This time, the prompt was: Annotate to summarize.

In retrospect, summarization is a lot harder than simply noting when you’re confused versus when you’re interested.

Summarization is a forest-for-the-trees kind of exercise. You can’t really know moment-to-moment as you watch a video what the salient points are going to be. You need to consume the whole thing, reflect on it, perhaps re-watch parts or all of it and construct a coherent narrative out of what you took in.

By contrast, noting what’s confusing and what’s interesting is decision-making you can do “in real-time” as you watch.

When I asked people for a summarization of their video, no one was prepared to give one (inspite of the exercise) and I understand why.

However, one of the subjects who watched the Stanford Prison Experiment documentary was able to pinpoint the exact sentence uttered by one of the interviewees that he felt summed up the whole thing.

Is Social Always Better?

All 3 tests I’ve conducted were done together, sitting in a classroom. At Ponder, we’ve been discussing the idea of working with schools to set up structured flip study periods. It would be interesting to study the effect of socialization on flip. Do students pay closer attention to the material in a study hall environment versus studying alone at home?

The version of Ponder video we used for the test session shows other users’ activity on the same video in real-time. As you watch and annotate, you see other people’s annotations popping up on the timeline.

For the 2 people watching the Stanford documentary, that sense of watching with someone else was fun and engaging. They both reported being spurred on to explore the annotation tags when they saw the other person using a new one. (e.g. “Appreciates perspicacity? Where’s that one?”)

By contrast, for the 2 people trying to digest Shiffman’s lecture, the real-time feedback was distracting.

I assigned an annotation exercise to another test subject to be done on her own time. The set-up was less social both in the sense that she was not sitting in a room with other people watching and annotating videos and she was also not annotating the video with anyone else via the software.

I gave the same prompt. Interestingly, from the way she described it, she approached the task much like a personal note-taking exercise. She also watched Shiffman’s Nature of Code video. For her, assigning predefined annotation tags got in the way of note-taking.

Interaction Learnings

  • The big challenge with video (and audio) is that they are a black box content-wise. As a result, the mechanism that works so well for text (simply tagging an excerpt of text with a predefined response tag) does less well on video where the artifact (an annotation tag attached to timecode) is not so compelling. So I increased emphasis on the elaboration field, keeping it open at all times to encourage people to write more.
  • On the other hand, the forest-for-the-trees view offered on the video timeline is I think more interesting to look at than the underline heatmap visualization for text so I’ll be looking for ways to build on that



  • There was unanimous desire to be able to drag the timecode tick marks after they had already submitted a response. We implemented that right away.
  • There was also universal desire to be able to attach a response to a span of time (as opposed to a single moment in time). The interaction for this is tricky, so we’ve punted this feature for now.
  • One user requested an interaction feature we had implemented but removed after light testing because we weren’t sure if it would prove to be more confusing than convenient: automatically stopping the video whenever you made mouse gestures indicating you’re intending on creating an annotation and then restarting the video as soon as you finished submitting. I’m still not sure what to do about this, but it supports the idea that the difficulty of pacing video consumption makes annotating and responding to it more onerous than doing the same with text.


  1. Annotating video is hard to do so any interaction affordance to make it easier helps.
  2. Dense material (e.g. Shiffman’s lecture) is more challenging to annotate. Primary sources (e.g. the debate) are also challenging to annotate. The more carefully produced and pre-digested the material (e.g. the documentary), the easier it is to annotate.
  3. With video, we should be encouraging more writing (text elaborations of response tags) to give people more of a view into the content.
  4. Real-time interaction with other users is not always desirable. Users should be given a way to turn it on/off for different situations.
  5. There may be a benefit to setting up “study halls” (virtual or physical) for consuming flip content, but this is mere intuition right now and needs to be tested further.

Last but not least, thank you to everyone at ITP who participated in these informal test session this semester and Shawn Van Every and Dan Shiffman for your interest and support.

