I usually saw neural networks used for this kind of recognition tasks, i.e. here , here, here , and here , since a simple google search causes so many hits for neural networks in OCR, I assume you are set to use HMM (a design constraint, right?) Regardless, these links may offer some insight into image grid and image functions.
Your approach to turning the grid into a sequence of observations is reasonable. In this case, make sure that you are not confusing observations and states. The functions that you extract from one block must be collected in one observation, i.e. Vector signs. (Compared to speech recognition, your object function vector is similar to the feature vector associated with a speech phoneme.) You really don't have much information about the underlying states. This is a hidden aspect of HMM, and the learning process should inform the model of how likely it is that one function vector should follow another for the symbol (i.e., Transition Probabilities).
Since this is an autonomous process, do not worry about the temporal aspects of how the characters are actually drawn. For the purpose of your task, you have superimposed a temporary order of the sequence of observations using the sequence from left and right, from top to bottom. This should work fine.
As for HMM performance: choose a reasonable vector of key features. In recog speech, the dimension of a vector function can be high (> 10). (The cited literature may also help.) Set aside a percentage of the training data so that you can test the model correctly. First prepare the model, and then evaluate the model in the training kit. How well are your characters classified? If it does not work well, overestimate the function vector. If it works well with test data, check the commonality of the classifier by running it in the reserved test data.
As for the number of states, I would start with something heuristic. Assuming your character images scale and normalize, maybe something like 40% (?) Of the blocks are occupied? This is a rough assumption on my part, since the original image was not provided. For an 8x8 grid, this will mean that 25 blocks are occupied. Then we could start with 25 states, but perhaps naive: empty blocks can transmit information (which means that the number of states can increase), but some sets of signs can be observed in similar states (this means that the number of states can decrease. ) If it were me, I would choose something like 20 states. Having said that: be careful not to confuse features and states. Your vector function is a representation of things observed in a particular state. If the tests described above show that your model is not working well, adjust the number of states up or down and try again.
Good luck.