Often times when I give talks, especially concerning layout or the visual grammar of sequences of images, one of the questions inevitably says something along the lines of "But, there's no guarantee that the reader will view a page in the proper order." The high variability of possible choices or readings of a comic page makes it hard for them to accept a steadfast theory of comprehension.
However, a similar issue was at play in linguistics back in the 1950s, and was one that Noam Chomsky importantly addressed in his distinction between competence and performance. Competence refers to the (idealized) organization of rules and constraints in our minds that guides us to understand language. Performance is the vast variability that happens in real life exchanges.
For example, someone might say something like this over the phone:
I ...uh... I went *cough cough* to the store *STATIC***--oday and, like, ...um... saw *CAR HORN*--ohn from my class in the check-*hiccup*-out line.
There are lots of interruptions, unclear portions and distractions. However, most likely a listener would glean from this a sentence like:
I went to the store today and saw John from my class in the checkout line.
The rules in your head are not bothered by the messiness of the context — your attentional system can filter out a lot of it.
The same is true of reading a comic page. Let's say you start in one panel and go to another, then realize it shouldn't have gone next. You're not belying the mental rules that go into comprehension — in fact, those rules are what tell you it's the wrong order. These actual rules of comprehension are unconscious to your awareness.
Your (unconscious) competence wins out over the messiness and variability of performance.
This same issue may be at play with comparisons of comics to film. Yes, film and comics are presented differently (one static, one moving), but that doesn't necessarily mean that their comprehension in people's minds is entirely different. The difference in presentation may be a "performance" issue, while the comprehension is a "competence" issue. (Though, in my mind there is bound to be at least some variance due to that presentation difference — motion vs. static — which will need experiments to explore... yay science!).
I should point out also that, in linguistics, there are some debates over the complete reality of this split in notions. For example, for a long time it was argued that words like "um" and "uh" are just performance clutter. However, research has shown that these actually hold meaning for the discourse (essentially signaling how long a pause the speaker is going to make before continuing to talk).
Nevertheless, for many issues facing the comprehension of "comics" (and/or film), it is an important split to make.