Automated Character Annotation in Multimedia

author:Andrew Zisserman, University of Oxford
published: Feb. 14, 2008,   recorded: February 2008,   views: 421
You might be experiencing some problems with Your Video player.

Slides

Slides
0:00 Automated Character Annotation in Multimedia
0:18 The Objective - 1
0:44 The Objective - 2
1:08 Multimedia (Vision and Text) Approach
1:52 The Need
2:57 Outline
4:02 Names and Faces in the News
5:08 Weak Supervision from Text
5:18 Running Example: Use Episodes from Buffy the Vampire Slayer
5:56 Textual Annotation: Subtitles/Closed-Captions
6:33 Textual Annotation: Script
7:07 Alignment by Dynamic Time Warping
7:18 Subtitle/Script Alignment
8:01 Virtually Free Source of Annotation
8:27 Ambiguity
10:23 Face Representation and Matching
10:28 Why This is Difficult: Uncontrolled Viewing Conditions
10:55 Matching Faces - 1
11:09 Matching Faces - 2
12:06 The Benefits of Video
12:39 Three Steps
12:56 Obtaining Sets of Faces Using Tracking within Shots
12:57 Face Detection
13:29 "Tracking" by Face Detection
13:44 Face Association
14:21 Connecting Face Detections Temporally
14:46 Face Association
14:56 Example Face Tracks
15:22 Face Vector Representation
15:23 Matching Faces
15:51 Detect Face Features for Rectification
16:08 Eyes/Nose/Mouth Detectors
16:15 Constellation Like Appearance/Shape Model
16:24 Face Normalization
17:21 Representing Faces
17:31 SIFT Descriptor
17:44 Face Feature Vector - Summary
17:58 Matching Face Sets - 1
18:00 Matching Face Sets - 2
18:12 Matching Face Sets - 3
18:28 Matching Face Sets within a Shot
18:51 Example: Buffy the Vampire Slayer
20:04 Raw Face Detections
20:37 Face Tubes (Tracking Only)
21:15 Intra-Shot Matching
21:17 Face Tubes (Tracking Only)
21:41 Intra-Shot Matching
22:24 Ambiguity Again
23:01 Speaker Detection - 1
23:20 Speaker Detection - 2
24:15 Correct "Non-Speaking" Classifications
24:37 Error in Speaker Classification
24:58 Resolved Ambiguity
25:40 Semi-Supervised Learning
26:07 Exemplar Extraction
26:37 Classification by Exemplar Sets
27:22 "Refusal to Predict"
27:54 Experiments
28:18 Example Results - 1
28:39 Example Results - 2
28:48 Precision/Recall
29:33 Example Video
31:08 Quantitative Results
31:33 Using an SVM Classifier – Noisy Labels
32:45 Classification Results (Inter-Episode)
32:47 Extensions
32:48 Improving Coverage – Beyond Frontal Faces
32:58 Feature Localization & Speaker Detection
33:08 Profile Speaker Detection
33:41 - Questions

Related content

Visitors who watched this lecture also watched...
01:17:03
Trainable visual models for object classification

835 views - Andrew Zisserman, 2004
00:52
101 Visual object classes - Introduction

318 views - Andrew Zisserman, 2005
30:40
Action class detection and recognition

343 views - Ivan Laptev, 2008
01:06:55
Generative Models for Visual Objects and Object Recognition via Bayesian Inference

5658 views - Fei-Fei Li, 2006
30:30
Person detection and recognition, tracking and analysis

432 views - Montse Pardàs, 2008
18:36
Recognising Animals

185 views - Allan Hanbury, 2008
27:27
A contrario matching of local features between images

142 views - Yann Gousseau, 2008
02:57:05
Computer Vision

2292 views - Andrew Blake, 2004
04:38
Interview with Fei-Fei Li

3415 views - Davor Orlič, Fei-Fei Li, 2006
17:43
Introduction to the Conference

94 views - Nozha Boujemaa, 2008

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.

Description

We describe progress in automatically identifying characters in films and TV series using their detected faces together with readily available annotation in the form of subtitles and transcripts. We describe how the subtitles and transcript can be aligned to give weak supervision on the characters present in a shot (as well as on the actions, emotions, locations etc). The supervision is weak because of correspondence problems and the character may not be visible. The visual problem of face recognition is challenging because faces appear in images at various sizes and pose, and also vary considerably in expression. Fortunately, videos contain multiple face examples of each person in a form that can easily be associated automatically using straightforward visual tracking. These multiple examples reduce the ambiguity of recognition. We show that the text supervision can be strengthened by speaker detection. Although the labelling is still incomplete and noisy, it is then sufficient to learn visual models for recognition, and achieve successful character identification. This is joint work with Mark Everingham and Josef Sivic.

Link this page  

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

make sure you have javascript enabled or clear this field: