Facebook believes that artificial intelligence needs to develop an “egocentric perspective” to work in augmented and virtual reality.
To that end, the company announced on Thursday Ego4D, a set of 2792 hours of first-person video and a set of benchmarks for neural networks designed to stimulate the development of AI, which better understands what it’s like to navigate virtual worlds from a first-person perspective.
The project is the result of collaboration between Facebook Reality Labs and scientists from 13 research institutions, including academic institutions and research laboratories.. Details are set out in an article authored by Facebook Kristen Graumann, “Ego4D: Around the World in 2.8 Thousand Hours of Egocentric Video.”
Grauman is a scientist in AI’s Facebook research department. Her experience as a professor at UT Austin has focused on computer vision and machine learning in related topics.
The idea is that the data set will push researchers to develop neural networks that perform tasks perfectly from a first-person perspective – just as large data sets, such as ImageNet, move existing AI programs from a “viewer” perspective.
The essence of egocentric perception is to try to solve problems that arise in the neural network with basic tasks, such as image recognition, when the point of view changes from the third person to the first person, according to Facebook.
also: Facebook announces an investment of $ 50 million in the “responsible” development of the metaverse
Most image recognition systems that can detect objects that see from the side have high bounce rates if the object is represented from the perspective of the person encountering the object.
The Ego4D initiative specifically targets Metaverse, the future world of immersive social networking, which Facebook CEO Mark Zuckerberg mentioned in the company’s latest earnings report.
“These benchmarks will catalyze the study of the building blocks needed to develop smarter AI helpers who can understand and interact not only in the real world but also in the metacosm, where physical reality, AR and VR come together in a single space.” said Facebook.
2792 hours of video were collected by Facebook staff using various cameras. Vuzix’s Vuzix Blade augmented reality headset is just one, the others include GoPro, Pupil Labs, ZShades and Wee-view. The purpose of mixing different sets is to avoid “over-fitting,” Graumann and his colleagues write, a phenomenon in which a neural network remembers video footage rather than tune in to derive similarities between differences.
Facebook noted that the video “was shot by 750 unique camera users from 73 locations around the world and 9 different countries.” Part of this was done by Facebook staff on the company’s campus, and part by university staff.
also: Facebook adds a meta universe to work with Horizon Workrooms (and you thought Zoom fatigue was bad)
“4D” in Ego4D refers to the temporal aspect of the video. Facebook employees spent 250,000 hours watching and narrating, summarizing what was happening in the video, with time stamps added.
Facebook says the stories are “temporarily dense,” given that on average we received 13.2 sentences per minute of video, for a total of 3.85 million sentences. In total, the narratives describe Ego4D videos using 1,772 unique verbs (actions) and 4,336 unique nouns (objects).
The data set is intended for use to develop neural networks that will work in various new tests. To this end, Graumann and his colleagues describe several new tests they have created that require the neural network to be able to respond to: tasks in the past, such as recollection; current tasks, such as categorization of activities; or forecasting for the future, for example, creating a description of the outcome of the action.
For example, one of the tasks of a neural network may be to respond to a request in natural language, which requires the program to match the content of the request to the video frame. For example, ask the computer, “When did I read to my children?” The computer had to find a scene where the camera owner was reading to his children. Tasks are assigned to annotation staff who are provided with a pre-formatted list of tags, and they must assign them to clips.
Facebook said they have 74,000 queries designed in this way for up to 800 hours of video.
In a future prediction test, the computer may need to predict which object in the video frame the camera user will interact with next. So, if they roll out the dough at the table, the next supposed action may be to grab a ball of dough on the table. The program will make a prediction by selecting one of the pre-set lists of verbs that have been added to the footage by annotation staff, and adding a time estimate, such as spitting “take the dough in 0.8 seconds.”
also: Facebook already has your memories, smart glasses will get more of them
Datasets for Ego4D will be available on Github next month, according to Facebook. Users will need to sign a data usage agreement.