All-Day Wearable HMD?
Returning to the roots of what we are trying to achieve is always a good start. So here is a quote that I think is fundamental to the future of spatial computing.
“If you were supplied with a computer that worked all day long and was instantly responsive to every action you had, how much value could you derive from that?” Doug Engelbart 1968
Introduction
This is the quote where Engelbart introduces the modern computing demo, which was called “the mother of all demos”. It was the first time a graphical user interface and a mouse had been demonstrated together and shown to the world. It was basically Windows, like a PC experience with a spreadsheet and network PCs. Many pieces had been developed around the world in different labs, but this was the first time they had all been put together into a complete experience of a future computing lab where, for the first time, ‘paperwork’ could be done digitally.
This revelation led to the development of the Alto from 1968 to 1973, which then led to PCs and, ultimately, the Mac. Steve Jobs could see this was how people would interact with computers in the future. The mobile phone came along in the 1990s, and these days, there’s hardly a single person who doesn’t have a mobile in their pocket, and instead of a mouse, it’s a finger that drives a graphical user interface. It's connected to the internet, which generates billions and trillions of dollars, which floats the most powerful and influential company on the planet, basically in the business of building those devices and selling them to billions of people.
Apple_Vision_Pro
With a spirit of open curiosity, I went to try the Apple Vision Pro. Rather than get a boozy ad-hoc ‘demo’ from one of my friends, I booked myself into the Apple store for the official version. I’m not a tech reviewer by any means (see Marques for that), but here are a few thoughts.
Let’s begin with the thing itself. The Vision Pro is — as you would expect — an engineering masterpiece. It’s unquestionably beautifully designed and manufactured, but if anything, the unit does feel a little too precious. It was brought to the table on a little wooden plinth and laid down like a newborn. Throughout the demo, I was encouraged to handle it gently. We’ve become somewhat accustomed to Apple products, which feel overtly luxurious and technical, but it’s hard to see how this device will age well during daily use. This is tricky territory, however, as adding rubber bumpers and protective details would make the Vision Pro feel off-brand compared to the rest of the Apple lineup, so I’m unsure whether there’s an effective compromise to be had here.
A well-designed cover clips neatly onto the glass front, and a lovely, squishy carry-case is available for an extra $200, but the whole thing feels more like a piece of delicate lab equipment than a consumer product. The woven head strap is nicely resolved, comfortable and easy to use, with a simple clickity-clackity dial for adjustment, which makes a nice change from buckles, clasps and velcro. The external battery pack feels like an annoying necessity, and Apple would dearly like to find a way around it. There’s very little love in the under-resolved aluminium blob encasing it, and I’m sure Apple have this high on the kill list for V2.
Given all the brouhaha about the weight of the Vision Pro, I was expecting it to feel heavier in the hand, but I was pleasantly surprised. The unit felt reassuringly solid but not ‘weighty’. That said, after the 20-minute demo, I had a pretty significant pressure pain above my left eye, which was remarkably uncomfortable, and I was left with a slight rut in my forehead. I don’t doubt that this would be improved with a better fitting light seal, better headband adjustment, or by trying the ‘other’ strap, but this was my initial experience.
The sound quality is pretty good and better than I’d imagined, given the external ‘Audio Pods’ positioned above each ear. I would probably plug in some earbuds for movies and immersive content, but hearing the content and my guide’s voice was pleasant.
At the start of the demo, a staff member took my spectacles and measured their prescription. When the Vision Pro arrived, it was pre-fitted with a pair of corrective Zeiss lenses, but I’m unsure if we got it right. I wear two different prescriptions, one for daily use and one for sitting in front of a computer. She recommended that we take the daily use prescription, and whilst everything I saw was pretty good, it wasn’t perfect. I feel like I’d like to try the demo again with my other prescription.
Regarding video quality, there were absolutely no pixel gaps in any of the applications we tried, and the content's resolution, brightness and crispness were pretty spectacular. Apple is optimising the image rendering based on where your attention is pointed, meaning that some of the peripheral stuff is a little out of focus. Still, there was no RGB separation or rendering lag, even when I flicked my eyes around the environment or moved my head quickly. Now, onto the main event…
My word, it’s good. It’s hard to explain in words, but screens and windows feel completely fixed, opaque and solid in space. They don’t move, wobble or flicker; they feel like well-lit, architecturally tethered objects once positioned. The first time the animated ‘hello’ text appeared floating before me, I was taken aback. It’s really very, very good.
The gestures are remarkably responsive, too. I stumbled a bit here and there (as I no doubt did with pinch and zoom in the early days), but quite quickly, I could look at a handle, grab it, move objects and scale windows. The interaction mapping for the digital crown wasn’t immediately clear to me, and there was a bit of press / long press / rotate confusion during the demo, which I think would become clearer in time. What’s most impressive is that I could sit with my hand in my lap — almost out of sight — and still pinch, drag and zoom easily. The cameras on the device's exterior do a great job of picking up this subtle motion, so there’s no need to have your arms wafting out in front of you like John Anderton.
The immersive video segment formed the crescendo of the demo, and I can see why. The movie started playing, and Alicia Keys was singing right there before me. I could look around the studio and see the cables running across the floor and the fingerprints on the piano. A lady walked towards me on a tightrope suspended high above Yosemite Valley. I could see the lichen on the rocks, hear the wind whipping around my ears, and feel more than a hint of vertigo. Then we were underwater, diving with sharks. I could see their gills flapping, the seaweed rippling, and bubbles rising all around me from divers nearby.
Then came the sport. The feeling of sitting directly behind the goal at an MLS match was perhaps the most impressive thing in the demo… and I hate soccer. Much has been made about how this product might dominate the movie business. Still, sports could be the killer opportunity, where extreme immersion, multiple angles, statistical analysis and extra content are king.
Overall, the Vision Pro is an impressive device. It really feels like a true step change in the goggle/glass/spatial/VR/AR segment, and my overriding impression after the demo was positive. The levels of immersion and stability far exceeded my expectations, and at no point did anything feel janky, erratic or unsure. The Vision Pro feels a little too precious and weighty in this first iteration, and I’d like to see more content ideas beyond ‘wow’ in the next demo.
All in all, this is a solid job by Apple. Keep going.
But what I hope Apple is aiming for
We’re now here with virtual, mixed-reality devices. Apple has released the Apple Vision Pro, a mixed-reality device, and Meta has also released products in this space for the last eight years. So, much progress is being made in VR and mixed reality.
2D rectangles(magic paper)
These wearable computers push beyond the capabilities of those 2D rectangles that keep you in touch with the user interface by embodying you within the experience so that your body becomes part of the control. Also, your perception system can start to be liberated from the 2D windowing system to a full 3D experience.
We like to think about this because it provides a new class of wearable rendering surface, which takes the form of these three visual, auditory, and haptic modalities that are otherwise completely limited in a 2D windowing system. Of course, we enable a visual rendering system that, for 3D, needs many of the technologies that computer vision communities have been building for the last thirty-odd years.
Spatial audio
Auditory experiences, including spatialised audio, have become possible with these wearable devices, but this is not possible with laptops and mobile phones. And, actually, there's a future of haptic interaction, where you can effectively interact with virtual and real objects and blend or mix reality together.
Beyond the 2D screen
But something's missing from this experience. Whilst it clearly is an advancement beyond our mobile phones, it's not everything. Let's return to Doug Engelbart’s concept of a computer that works all day long and instantly responds to every action as needed. He was demonstrating the precursor of computers and mobile phones.
What would it mean to have a computer that works all day? Firstly, it must go everywhere you go - to work, in the car, to get a coffee and so on - otherwise it can't work all day long with you unless you limit your experience of reality, for example, to a desktop PC.
And what does it mean to respond to every action you take instantly? Whilst some things are essentially immediate mode interactions where you want the computer to give you an answer based on a very short amount of input, most of the things that you care about are contextualised not by a single moment of information but by minutes, hours, weeks, years, decades of understanding you.
Context
Your friends and relatives understand you, and your colleagues work more effectively with you once they get to know you because of that long context window. So, whereas PCs and mobile phones respond instantly to a very simple interaction - a simple context of where your finger is on the screen, where your mouse cursor is or what your keyboard says - they cannot possibly be instantly responsive to an action that requires context that may be months or years in the making.
What does it mean to have a computer help you with your health condition? If all it can get is a very simple ‘Hey, Siri, I want you to help me with my health condition’, and it knows nothing else about you, that's not going to be possible.
So AI is here
But that's not the be-all and end-all of computing. Many of you will have had a ChatGPT experience. With one year's worth of development in AI that is pretty interesting. But there's a gap here as well. If you've had that experience with GPT, you will know that it can answer those kinds of immediate questions or source information from the internet, but it can't know about you unless you literally type it in: ‘Hey, you know, I'm Q. I'm a 38-year-old. I like cycling a lot.’
It's missing context, and contextualised AI is what we need to develop. This can only happen if you have two things: your context, which comes from these components, all-day wearable devices, and machine perception. Take those devices and turn them into signals that things like AI can actually work with. But AI cannot consume an entire day's worth of egocentric imagery, it's just too much.
A new class of problem
When you combine those two things, you get contextualised AI, but it's a very new class of problem to solve. You have to solve the problems of wearable devices' machine perception and combine it with AI. These fields are not separate; they're going to be combined.
AI Challenge
The next great challenge facing us is the always-on AI challenge, but we must first put you back into reality. In reality, you might wake up and, like me, fumble around for your glasses or sunglasses depending on your first day's activity, or search for your phone to check your schedule.
Then, the day starts in different ways for different people. What would a machine perception system device be able to capture or understand that would be helpful to you? If we had the context throughout your entire day, multiplied this by weeks, months, and years, and allowed AI to interact with it, we could enable contextualised AI that is always on.
Where AI is currently
There are lots of AI and of what you can think of as very short-term contextualised AI, a video clip multimodal that lasts perhaps 20 seconds. That's not the same as machine perception and AI working together. That is never off. It has to be on; otherwise, you will miss those details that matter to you.
Quantified self
So what I claim is, essentially, that if we're going to enable useful, lifelong, always-on AI, we're going to have to solve these problems. We must solve that context gap to get a human's context into the AI’s memory. We're going to have to solve the wearable machine’s section properties, that context, because otherwise, we won't be able to get the data from the user's life. We will have to solve these new low-power, small-form factor constrained machine perception challenges to enable that wearable device to become real.
Mixed reality size
We can't walk around in those big mixed reality headsets; it becomes painful after a few hours and the battery runs out. And that leads to this new class of machine perception, which I think we're just all starting to be aware of. Perhaps this is one of the first times you have heard about distributed machine perception. So when we look at the really hard problems of solving for those constrained form factors, ultra-low power is always on.
The Challenge
The question has to focus on whether we had an algorithm that could get your context throughout your entire day, multiplied by three weeks, months, and ultimately years, and bring it to bear with AI to enable contextualised AI, so always-on AI context and machine perception.
This is not the same as machine perception and AI working together, which is never off. It has to be on; otherwise, it will miss those details that matter to you. So, what I claim is if we’re going to enable useful, lifelong, always-on AI, we’re going to have to solve these problems.
When we look at the really hard problems of solving for those constrained form factors, ultra-low power is always on. We find challenges that we don't think you can solve in a single device, but you may be able to solve them if you have a federation of devices working together to solve those machine perception problems. So I'm just going to leave those with you as basically the new grounding in computer vision. This first one, wearable machine perception, is exactly where the area kicks off. So going from something like a quest probe into glasses form factor is non-trivial. We're going to hear about the various different challenges today.
These are the three problems I want to leave you with and what I believe researchers should be working on:
Build algorithms that compress reality sufficiently (world and user modelling)
Overcome wearable device limitations (sensing/computing on device)
Introduce distributed always-on CV and MP
Basically, it is to solve this context gap and enable contextualised AI for the next decade.
First of all, we've got to get to grips with the fact that the data coming from these always-on devices are so massive that the current paradigm of machine learning, which is about end-to-end learning, cannot work. You cannot capture a person's entire year. With egocentric data from a head-mounted device currently on the market, we hope to train a transformer end-to-end. That dataset is somewhere between one to 10 petabytes for one sample. So, we’re going to have to solve the compression problem. That might be tokenisation, and things like deceiver IO that try to compress into a latent space might be very special, such as tuned algorithms for machine perception like slam and object detection, etc. But that is the challenge of compressing reality and going from petabytes of egocentric data down into something that can be consumed by current and future generations of machine learning.
Even if you can do it in principle, taking these datasets and running them on giant GPUs shows you can compress reality. You then need to do that based on these form factors, which are basically in the order of milliwatts of power.
To give you an idea, here is the sort of power we're talking about. If there's an LED that takes around five milliwatts of power, we believe that the device will have 100 milliwatts of power available to you for the entire day, and the LED is already taking five milliwatts.
A normal mobile phone camera takes over 100 milliwatts simply to take 60 frames per second video. Imagine the kind of data that comes off project area devices today, which takes over 100 milliwatts. So, you'll have to solve all those algorithms with a tiny amount of power. That necessitates an entirely new class of thinking about algorithms through a very different view of sources of information that today we're not used to. We're used to putting the datasets through big machine learning algorithms, then cranking the handle of quantisation and specification after the fact. But you're not going to be able to do that. When you've got to be constrained with 1010s milliwatts of power. You need new classes of algorithms. And this is when we get back to well, and maybe one single device doesn't need to solve everything all the time. And that's distributed machine perception.
I really think this is the area where the next breakthroughs will come basically in combining machine perception on ultra-low power always on form factors like glasses, because if we can solve that, then we can unlock contextualised AI and its contextualised AI.
If we solve this, we can really build human-oriented computing. There will be elevators past these kinds of 2d rectangles that have no context into devices that are useful for you. Everywhere you go all the time, everyone is doing pretty much everything they want to do with that.
Conclusion
So, going from something like a quest probe into a glass form factor is non-trivial.