I want to be clear with the OP that this answer does not constitute a formal description of the approach in order to intuitively describe the intended approach.
Suppose that a video consists of n frames and that each of them can be represented as a three-dimensional tensor (height, width, channel). Convolutional neural networks (CNNs) can be used to create a hidden view for each frame.
(f_1, f_2,..., f_n). (RNN). RNN , CNN. (f_1, f_2,..., f_n) , RNN, ( RNN).
Yatube-8M dataset, , , . , , RNN, , c, :
alpha = softmax(FNN(f_1), FNN(f_2), ..., FNN(f_n))
c = f_1 * alpha_1 + f_2 * alpha_2 + ... + f_n * alpha_n
FNN , f_i f_i , , . c, .
, , . , . :
, , , , , . , , OP, , .