Kinect provides you with the skeletons that it tracks, you need to do the rest. Basically, you need to create a definition for each gesture you want, and fire it against skeletons every time the SkeletonFrameReady event fires. It is not simple.
Gesture Definition
Defining gestures can be surprisingly difficult. The simplest (simplest) gestures are those that occur at one point in time, and therefore do not rely on past locations of the limbs. For example, if you want to determine when the user raised his hand above his head, this can be checked on each individual frame. More complex gestures must take into account a period of time. For your waving gestures, you will not be able to say in one shot whether the person is waving or simply holding his hand in front of him.
So, now you should be able to store relevant information from the past, but which information is relevant? Should you store the last 30 frames and run an algorithm against this? 30 frames only you get the second value of information .. maybe 60 frames? Or for your 5 seconds, 300 frames? People don't move so fast, so maybe you can use every fifth frame, which will cause your 5 seconds to return to 60 frames. A better idea would be to select and select relevant information from frames. For brandishing gestures, the speed of the manual current, how long it was moving, how much it was moving, etc., Could be useful information.
After you figure out how to get and save all the information related to your gesture, how do you turn these numbers into a definition? A swing may require a certain minimum speed or direction (left / right, not up / down), or duration. However, this duration does not correspond to the duration of interest of 5 seconds. This duration is the absolute minimum required to assume that the user is waving. As mentioned above, you cannot determine a wave from a single frame. You should not determine the wave from 2, or 3 or 5, because this is not enough time. If my hand twitches for a split second, would you consider this wave? There is probably a sweet spot where most people agree that moving from left to right is a wave, but I certainly don't know this well enough to define it in the algorithm.
There is another problem with requiring the user to make a certain gesture within a certain period of time. Most likely, not every frame in these five seconds will be a wave, regardless of how well you write the definition. Where, as you can easily determine, someone ran a hand over the head for five seconds (because it can be determined on the basis of a single frame), it is more difficult to do for complex gestures. And while waving is not so difficult, it still shows this problem. When your hand changes direction on both sides of the wave, it stops moving for a split second. Are you still waving If you answered yes, wave slowly so that you pause a little on both sides. Will this pause still be considered a wave? Most likely, at some point in this five-week gesture, the definition will not be able to detect the wave. So, now you need to take into account the condescension to the duration of gestures. If a swing has occurred in 95% of the last five seconds, is that good enough? 90%? 80%?
What I'm trying to do here is no easy way to recognize gestures. You need to think through a gesture and define some definition that will turn a bunch of shared positions (skeleton data) into a gesture. You will need to keep track of relevant data from past frames, but understand that gesture definitions are probably not ideal.
Consider users
So, now that I have said why the fifth second wave will be difficult to detect, let me at least give my thoughts on how to do this: no. You should not force users to repeat the motion gesture for a certain period of time (five-second wave). This is surprisingly tiring and not what people expect / want from computers. Point and click instantly; as soon as we click, we expect a response. No one wants to hold a click for five seconds before they can open Minesweeper. The repetition of gestures over a certain period of time is in order, if it constantly performs an action, for example, using a gesture to iterate over a list, the user will understand that he must continue to make a gesture in order to move further through the list. This even simplifies the definition of gestures, because instead of asking for information in the last 5 seconds, you just need enough information to find out if the user is making this gesture right now.
If you want the user to hold the gesture for a set period of time, make it a motionless gesture (holding your hand in some position for x seconds is much easier than waving). It is also a very good idea to give some visual feedback, to say that the timer has started. If the user pinches the gesture (wrong hand, wrong place, etc.) and ends there for 5 or 10 seconds, waiting for something, they will not be happy, but this is not part of this question.
Starting with Kinect Gestures
Start is small .. really small. First, make sure you know your way around the SkeletonData class. Each skeleton has 20 joints, each of which has a TrackingState. This tracking state will show if Kinect can actually see the joint (tracking) if it calculates the overall position based on the rest of the skeleton (Inferred) or if it completely abandoned the attempt to find the joint (NotTracked). These states are important. You do not want to think that the user is standing on one leg simply because Kinect does not see the other leg and reports a fictitious position about it. Each joint has a position in which you know where the user is. Piecemeal. Get to know the coordinate system.
After you learn the basics of how skeleton data is reported, try some simple gestures. Print a message on the screen when the user raises his hand above his head. This requires only a comparison of each arm with the joint of the head and observation if any arm is higher than the head in the coordinate plane. Once you get started, move on to something more complex. I suggest trying the engine (the hand in front of the body moves right or left or left to right to some minimum distance). This requires information from past frames, so you have to think about what information to store. If you can get this work, you can try to hold a series of gestures for a short period of time and interpret it as a wave.
tl; dr: Gestures are tough. Start small, create your way up. Do not force users to make repeated movements for one action, it is tiring and annoying. Include visual feedback for gestures based on duration. Read the rest of this post.