If you have a 3d world, and can identify the objects in the world, then software can navigate its way around the world and do tasks.
I'd reason that wiring up Natural Language when having a large database of objects(nouns) would still be rather difficult, but not as difficult as changing camera feeds into 3d world representation.
Finally you need to build a body for the robot, so it can do things in the world. By understanding Natural Language, anyone can tell a robot what to do in their native tongue. Also translation is more effective because the AI can think about sentences and know which word you mean when the word has two meanings.
Sorry, every time I see these technologies that turn camera images into 3d worlds, I can't help but think about Artificial Intelligence. I'm a pretty good programmer, but that is just one piece of software I didn't want to develop myself. I kinda put off actually making Artificial Intelligence in 2002 until someone makes a nice piece of software that you can walk around buildings and turn them into Quake levels.
And in the process of waiting for this software, I theorized the biggest use of AI might be to teach people. Eventually I realized, you don't actually need AI to teach people with computer, all you need is digitize books, make some videos and do some other tutorial software. So if I ever get enough money to buy rights to books, or enough money to live off of, I'm going to try and see this vision through. You gotta realize 200$ for a laptop is cheaper than thousands of dollars of books, and software can take the place of a teacher, so education is gonna be cheap enough that even people in poor countries will have access to it. The only limiting factor is getting the rights to books, and writing some tutorial software. It is a high cost to do this, but once it is done, the benefits for society are several orders of magnitude greater.