Their ANN is a classifier that answers the question "Does this frame include the face of a cat?". Like any other classifier, it needs a set of training materials for uncontrolled training, which is somewhat balanced. However, using random Youtube frames will probably give you a very distorted dataset (too many negative samples). To get a more balanced set of workouts, they probably use keywords in the video title or manual video selection to get more positive patterns and less negative ones.
Diego source share