Gestop - The Journey
Published:
Developing a customizable and extensible gesture recognition system.
Note: Originally published on Medium
The established way of interacting with most computers is a mouse and keyboard. On the other hand, hand gestures are an intuitive and effective touch-less way of interaction, among humans atleast. Hand gesture based systems have seen low adoption among end-users primarily due to numerous technical hurdles in detecting in-air gestures accurately. This article describes the development of Gestop, a customizable and extensible framework for controlling your computer through hand gestures.
In this article, I’ll be walking you through my journey of developing Gestop guided by my mentor Nishant and my experience working at OffNote Labs.
The Journey Begins!
Gestop was the result of my summer internship at OffNote Labs, where we explored cutting edge problems and came up with solutions for them. After an initial discussion of interesting problems that I would like to work on, Nishant pointed me towards Gesture Recognition, and an existing library MediaPipe. Among other things, MediaPipe includes MediaPipe Hands, a high-fidelity hand and finger tracking solution.
The first hurdle was the simple problem of mouse tracking i.e. move the cursor around the screen using my hand and not the mouse. I designated the tip of my index finger as the coordinate of the cursor and mouse tracking was ready! The joy of creating something new, no matter how trivial, is not to be underestimated. :)
Click!
For deeper understanding of the problem, my mentor then suggested that I build a proof-of-concept using only algorithmic rules. While I initially wanted to move ahead straightaway to neural networks, in hindsight, this was a valuable step as it helped me gain a low-level understanding of how the keypoints were generated and used. I set up a simple rule-based system which detected whether a finger was bent or straight depending on the sum of the angles of the finger joints. Following this, I only had to execute the click action when this rule was triggered and voila, I could now move my cursor around the cursor and perform left clicks.
Onwards to Data
The problem of gesture recognition is complex, and there are many ways of approaching it and designing a solution. We discussed the various aspects of the problem over multiple calls and my mentor helped evaluate my suggestions, while also providing input when I was stuck, which happened often. :)
Of course, designing the architecture is only the start. During implementation, more issues cropped up, such as such as figuring out how to discriminate between actual hand gestures and any random hand movement the user might make while the application is running.
For each problem I faced, I perused numerous research papers, obscure blogs and unmaintained open source projects in search of ideas that would help solve the problem at hand. Once I had a concrete idea, I would go ahead and discuss it with my mentor, who would point out issues I hadn’t foreseen or describe a better way of doing the same thing. With time and (plenty of) guidance, I came up with solutions that tackled the issues I faced. The so-far unnamed Gestop could now act as a substitute for the mouse.
Refactor and Optimize
At this point, over a month and a half of continuous development had led to some unmanageable code. I had iteratively developed and added code to one single file, and fixing issues was now far harder than it should have been.
Nishant laid the groundwork for me by describing an overall modular architecture and that gave me a lot of clarity. I got to work modularizing the code into logically coherent parts and essentially, made the development a whole lot easier.
Gestures were still not detected very well however, and I spent a lot of time diagnosing and improving it. My mentor pointed me towards using tools like Tensorboard to view weights and activations, WandB which logged the runs and provided information on changes in performance when tweaking anything and so on. As I’ve come to know, these are all essential tools when working on Deep Learning research problems and I would not have had exposure to them if not for this internship at OffNote Labs.
Homing In
At this point, Gestop was a fully functional application that could detect around 7 static gestures, 14 dynamic gestures and execute a predefined action mapped to them. This wasn’t compelling enough to use just yet, with a limited number of gestures and predefined actions. my mentor encouraged me to think about adding new actions and gestures, and pursuing this thought led to the current Gestop, a customizable and extensible framework.
Allowing users to remap functionality so different gestures can do different things, allowing addition of new functionality through python functions or shell scripts, and finally, allowing users to add their own gestures to be recognized were the things what elevated Gestop from a summer project to something that could be a useful library.
Penning down thoughts
Gestop was now feature complete! Nishant suggested we write a paper to submit to CODS-COMAD, one of India’s premier data science conferences.
The first draft flowed easily from my mind onto text. I wrote down my thoughts as they came, and the pages filled up. When I finished and showed it for an initial review, I noticed the stilted sentences, the leaps in logic I’d made that are lost on the reader, and other such discrepancies.
Refining the paper is a process of iteration, many iterations in fact :). Even when all of what you wish to convey is on the paper, it is non-trivial to organize it in a manner that is easy to parse through and understand. Eventually, the changes you make become more and more minor, and you are satisfied and the paper is finally ready to be submitted!
There and Back Again
Interning at OffNote Labs and developing Gestop was a fantastic learning experience the likes of which I haven’t ever had. I had the flexibility to work at my own pace, explore my own solutions, and most importantly, guidance when I could not make progress. Building a complete gesture recognition system over a few months has helped me enormously, from learning how to communicate effectively to organizing my code and so much more. Some of my biggest takeaways:
- Code Quality — Don’t underestimate how much of an impact this will have on the pace of your work. Refactor often :)
- Logging — The usefulness of logging while debugging cannot be overstated. Most of us use well placed print statements while debugging and I like to think of logging as ‘print statements on steroids’, boosting the speed at which you find bugs.
- Experiment Logging — Somewhat similar to the previous point, when working on a model, you will go through through numerous variations trying to improve the performance of your model. Tracking these variations with tools like WandB is vital.
- Paper writing — Writing a satisfactory paper is a long process of many many iterations, and takes far longer than you would think.
- Diminishing Returns — The further along the project, the tougher it is to make progress. When starting out, there is a vast body of literature to refer to which eases problem solving. As you move forward, this goes on shrinking until it is hard to find something that is even tangentially related. At this point, you have to start relying on your own solutions, and these solutions, being your own, are far likelier to fail.
- As challenging as it is rewarding — The idea of research can be daunting, as you venture out into the unknown in search of a solution, but achieving something concrete, that has not been accomplished yet, is an indescribable feeling. :)
Feel free to contact me if you have any questions!