Learning to See

Published on 2016-08-2310047 Views

Antonio Torralba

It is an exciting time for computer vision. With the success of new computational architectures for visual processing, such as deep neural networks (e.g., convNets) and access to image databases with

Deep Learning Summer School 2016 - Montreal

Related categories

Presentation

Learning to see00:00

Exciting times for computer vision00:14

A bit of history…00:31

The early optimism (1960-1970)00:38

50 years ago... - 101:24

50 years ago... - 201:28

25 years ago... - 101:47

25 years ago... - 201:51

The vision crisis {1970-2000) - 102:09

The vision crisis {1970-2000) - 202:25

The vision crisis {1970-2000) - 303:24

The vision crisis {1970-2000) - 403:58

The vision crisis {1970-2000) - 504:28

But 15 years ago... - 104:30

But 15 years ago... - 205:07

But 15 years ago... - 305:15

Advances in computer vision05:29

A short story of image databases05:49

Big data06:09

The time of big data09:47

In 2010, a new student gets into computer vision10:07

Who’s to blame? - 110:22

Who’s to blame? - 210:31

What does a detector sees? - 110:39

What does a detector sees? - 212:04

Can you tell which ones are not the object? - 113:56

Can you tell which ones are not the object? - 214:17

HOG visualization predicts SVM performance14:22

http://mit.edu/vondrick/ihog/15:17

Deep architectures16:05

Scene recognition demo16:13

Predictions - 116:41

Predictions - 216:51

Predictions - 317:15

Predictions - 417:34

Predictions - 517:41

Predictions - 617:57

Predictions - 718:05

Why is working so well?18:46

Visualizing the internal representation19:23

Visualizing and Understanding Convolutional Networks19:56

Generative Adversarial Nets20:01

Generated images21:11

Unsupervised Representation Learning21:41

Generator21:56

Synthesizing the preferred inputs for neurons22:47

Two components - 122:53

Two components - 223:18

Synthesizing Images Preferred by CNN24:05

Object detection vs. Scene recognition24:33

Ontology of images25:20

Places25:51

Two large databases, two tasks26:33

ImageNet CNN and Places CNN26:47

Possible internal representations27:02

Learning to Recognize Objects27:07

Learning to Recognize Scenes27:58

Places and objects28:32

Preferred images29:49

Estimating the receptive field - 132:20

Estimating the receptive field - 233:32

Generating segmentations - 134:41

Generating segmentations - 234:55

Generating segmentations - 335:00

Generating segmentations - 435:10

Crowdsourcing units - 136:00

Crowdsourcing units - 236:33

Annotating the Semantics of Units - 136:46

Annotating the Semantics of Units - 237:22

Annotating the Semantics of Units - 337:41

Annotating the Semantics of Units - 437:49

Annotating the Semantics of Units - 538:01

1 - Simple elements and colors38:45

2 - Texture or materials40:33

3 - Regions and surfaces40:44

4 - Object parts41:31

5 - Objects42:12

6 - Scenes43:08

What objects are found?43:23

ImageNet CNN units43:26

Places CNN units - 144:06

Places CNN units - 244:43

Places CNN units - 345:13

Histogram of Emerged Objects in Pool5 - 145:28

Histogram of Emerged Objects in Pool5 - 246:02

Object detectors emerge inside the CNN - 146:42

Object detectors emerge inside the CNN - 247:03

Strategies for training for new task - 148:10

Strategies for training for new task - 249:25

Strategies for training for new task - 350:03

Drawing Tool - 151:12

But what if you keep the task but change the input modality?51:18

Drawing Tool - 252:10

Line drawings - 152:31

Line drawings - 252:35

Line drawings - 352:36

Line drawings - 452:37

Line drawings - 552:38

Line drawings - 652:41

Aquarium52:42

Library52:46

Localized words - 152:48

Localized words - 253:02

Descriptions - 153:07

Descriptions - 253:21

Descriptions - 353:24

We collected a dataset53:33

Strategies for training for new task - 453:52

Strategies for training for new task - 554:23

Strategies for training for new task - 655:42

Unit 11556:18

Units in pool5 become multimodal56:22

Generating across modalities56:27

Cross-modal learning57:08

Zero shot Learning - 157:37

Zero shot Learning - 257:40

Zero shot Learning - 357:43

Red faced Cormorant - 158:25

Red faced Cormorant - 258:58

Strong supervision59:09

Weak supervision59:32

Cross modal: text and images - 159:56

Cross modal: text and images - 201:00:19

Soft - hard - 201:01:32

Soft - hard - 101:01:45

Soft - hard - 301:02:25

Soft - hard - 401:02:40

Soft - hard - 501:03:12

Visually Indicated Sounds01:03:30

Collecting a dataset of physical interactions01:03:42

The Greatest Hits dataset01:04:13

Can we predict material properties from sound? - 101:04:42

Can we predict material properties from sound? - 201:05:09

Can we predict material properties from sound? - 301:06:37

Predicting audio features - 101:07:06

Predicting audio features - 201:07:16

Real-or-fake study - 101:07:46

Real-or-fake study - 201:08:12

Real-or-fake study - 301:08:23

Predicted sound - 101:09:06

Predicted sound - 201:10:16

Predicted sound - 301:11:00

Predicted sound - 401:11:08

Predicted sound - 501:11:09

Predicted sound - 601:11:19