We present a machine learning technique to recognize gestures and estimate 
metric depth of hands for 3D interaction, relying only on monocular RGB video 
input. We aim to enable spatial interaction with small, body-worn devices 
where rich 3D input is desired but the usage of conventional depth sensors is 
prohibitive due to their power consumption and size. We propose a hybrid 
classification-regression approach to learn and predict a mapping of RGB 
colors to absolute, metric depth in real time. We also classify distinct hand 
gestures, allowing for a variety of 3D interactions. We demonstrate our 
technique with three mobile interaction scenarios and evaluate the method 
quantitatively and qualitatively.