TL DR
I need help understanding some parts of a particular structured data classification algorithm. I am also open to suggestions on various algorithms for this purpose.
Hello everybody!
I am currently working on a system that includes the classification of structured data (I would prefer not to disclose anything more about this), for which I use the simple backpropagation algorithm through structure (BPTS). I plan to change the code to use the GPU to further increase speed later, but at the moment I'm looking for better algorithms than the BPTS that I could use.
I recently came across this article → [1] and I was amazed at the results. I decided to try, but I have some problems understanding some parts of the algorithm, since its description is not very clear. I have already emailed some of the authors requiring clarification, but have not heard from them, so I would really appreciate any information that you guys can offer.
A high-level description of the algorithm can be found on page 787. There, in step 1, the authors randomize network loads, as well as "Distribute the input attributes of each node through the data structure from the edge nodes to the root forward and, therefore, get the root node output." I understand that step 1 is never repeated, as it is an initialization step. The part that I quote indicates that a one-time activation also occurs here. But what element in the training dataset is used for this network activation? And should this activation only happen once? For example, in the BPTS algorithm that I use for each element in the training set, a new neural network, whose topology depends on the current element (data structure), is created on the fly and activated. Then the error returns, the scales are updated and saved, and the temporary neural network is destroyed.
Another thing that bothers me is step 3b. There, the authors note that they update the parameters {A, B, C, D} NT times using equations (17), (30), and (34). I understand that NT stands for the number of subjects in a training kit. But equations (17), (30) and (34) already include ALL elements in the training data set, so what is the point of their solution (in particular) NT times?
Another thing that I could not get is how exactly their algorithm takes into account (possibly) the different structure of each element in the training data set. I know how this works in BPTS (I described it above), but it is very unclear to me how this works with their algorithm.
Well, that's all for now. If someone knows what could be with this algorithm, I would be very interested to hear it (more precisely, read it). In addition, if you are aware of other promising algorithms and / or network architectures (can long-term short-term memory (LSTM) be used?) To classify structured data, please feel free to publish it.
Thanks in advance for any useful input!
[1] http://www.eie.polyu.edu.hk/~wcsiu/paper_store/Journal/2003/2003_J4-IEEETrans-ChoChiSiu&Tsoi.pdf