Playing for Data: Ground Truth from Computer Games

lucb1e · on Aug 5, 2016

Watching the video, I'm not sure what I'm looking at. On the left are a number of buttons with objects, on the right a cursor colors corresponding objects in the image. It looks very human in behavior but tells me nothing of what is happening here.

It also claimed they use communication to the GPU, but none of that is visible in the demo. It looks like a magic wand (like Gimp's) that selects pixels of similar color for videos, except much slower.

And finally the times mentioned: the first two images took an hour or more to label, the third seven minutes. I'm guessing that's their innovation but I'm wondering what object recognition program takes more than a few seconds to process a frame in the first place. They mention being 'pixel perfect' but any object recognition would be, given it can recognize each object in the image and thereby classify each part of the image.

L_ · on Aug 5, 2016

The first two images are from real-world datasets, where someone drove around a city, took pictures, and then labeled all pictures manually. That usually takes 60-90 minutes per image because you have no other information than the picture itself (depth data from lidar or stereo is much sparser and does not help much in fine-grained outlining of objects). If you had an algorithm that could do this perfectly, you would not need this kind of datasets. So the purpose of these datasets is being the training data for object detectors and the like. The problem here is that modern algorithms (e.g. CNNs) need tons of data to train (the more the better), but that training data is extremely costly if you need an hour per image.

Now they also create a dataset, but instead of recording and labeling the real world, they take images from GTA and use extracted mesh/texture/shader ids to automatically label all objects in an image.

However, the game does not provide any of these 'rendering resource to object class' associations by default (at least not at the level they are intercepting the game/gpu communication). So someone has to make this annotation in the first place. That is the 'magic wand' tool, where someone is still annotating, but the human effort is reduced by nearly 3 orders of magnitude (7 seconds per image) compared to the conventional way of creating those datasets.

stevebmark · on Aug 5, 2016

The language used on this page is surprisingly poorly written. Can someone explain what this paper is actually demonstrating? I assumed that phd holders knew the rules for writing paper abstracts, but this abstract doesn't follow any of those rules?

Macuyiko · on Aug 5, 2016

It's actually pretty simple/clever. Constructing labeled imagery costs a lot of time and effort. I assume the current approach is to have a bunch of humans (undergrads, most likely) sit through every image and label (color) them: these pixels are trees, these are cars. Probably it's relatively error prone as well.

The authors propose to just use <some open world game> to take a huge bunch of images. Since we're talking about a game, the computer has a perfect internal representation of entities and hence things that can be considered cars, trees, streets, etc. We can thus not only obtain an image per frame that looks close to the real-world, but immediately also one that is labelled.

Why is this helpful? To train computer vision models such as the ones used in self-driving cars. Of course, the assumption here is that the imagery obtained from a game is close enough to the real world, so that a trained model would continue to work in the real world. I haven't read the paper in full, but the authors experiments show that this is the case. They still use some original imagery though, so perhaps it's not possible to use game-imagery alone. I also don't think an experiment was performed to see if this method would still hold up when using games having older, worse looking engines (it would be interesting to see whether deep models could still generalize towards the real world from this).

Finally, the authors spend a lot of hacky efforts in forcing the game to outputting labelled images. As others have suggested here, they probably would have been better off contacting some mod authors (who could whip this up in a day, probably) or even the game developer itself (though I don't think Rockstar would be particularly interested to collaborate on this).

socialist_coder · on Aug 5, 2016

Seems like they should be writing some shaders / rendering mod that did all this in realtime... I thought that was what their solution was going to be, but they're still doing it semi-manually per image with that annotation tool.

microcolonel · on Aug 5, 2016

Yeah, I found that very strange, too. And you'd think they would at least propagate object types from texture image bindings (assuming GTA V doesn't use virtual texturing, though that could be worked around as well).

KidComputer · on Aug 4, 2016

Seems like building a GTA style simulator in UE4 or Unity would be a better solution in the long run rather than hacking GPU resources like an aimbot developer.