The Predictive Mind by Jakob Hohwy |
Chapter 1 –
Perception as Causal Inference
Presented by Lisa Bortolotti
In the book, Jakob Hohwy presents and defends the theory that “the
brain is a sophisticated hypothesis-testing mechanism, which is constantly
involved in minimizing the error of its predictions of the sensory input it
receives from the world.” According to Hohwy, the theory is supported by philosophical
argument and empirical evidence. Its greatest appeal is its unifying power. The
basic idea is that the mind “arises in, and is shaped by, prediction.”Presented by Lisa Bortolotti
Inferentialism A visual phenomenon that seems to support an inferentialist account of perception is binocular rivalry. When different letters or images are presented to each eye, and a partition is placed between the eyes, what we perceive is not a blend of the two objects, but one or the other object, often in alternation. The thought is that perception is not driven by the stimulus in a bottom-up way, but by an unconscious inference the brain performs. In some version of this phenomenon, as Hohwy puts it, the brain “overrules the actual input in order to make sense of the world.”
Objections A possible objection Hohwy anticipates to the approach is
the charge of neuro-anthropomorphism:
Can the brain make inferences and apply Bayesian rules? For Hohwy this is
not a serious concern. The making of inferences and the application of Bayesian
rules is supposed to happen unconsciously and be realised by neuronal activity.
He uses an analogy: we can talk about the brain making inferences as we talk
about the computer computing. Another possible objection is that an
inferentialist account of perception may explain conceptual categorisation of
experience, but cannot really contribute to our understanding of the phenomenology of perception. To respond
to this objection, Hohwy offers a more detailed account of how perception
works, starting from a discussion of regularities and causal connections.
Causal interactions Causal interactions are key to perception in this model, in three ways: (a) causal interactions make perceptual inference difficult because there is no simple one-to-one relation between cause of the sensory input and sensory input; (b) causal interactions between objects and between object and perceiver give rise to the first-person perspective on experience; (c) causal interactions between world and perceiver are planned on the basis of perception. Perception relies on a causal hierarchy, and there are differences between the regularities on which perception is based.
Perceptual hierarchy One such difference is variance: some regularities (such as “people age with time”) are invariant and are rarely or never disconfirmed by new experience, so perceptual inferences about those are likely to overlap in distinct agents. But other regularities are more variant and dependent of the agent’s perspectives (e.g., how someone’s face changes when it smiles), and can be easily disconfirmed by new experience. With all inferences relying on regularities, agents “probe deeper into the causal structure of the world” and they do so due to two-way interactions among regularities. Lower-level (more variant) regularities help select hypotheses about higher-level regularities, and hypotheses about lower-level regularities are constrained by higher-level (less variant) regularities (see graph on page 32). “Top-down priors guide perceptual inference, and perceptual inference shapes the priors”. We need Bayes’ rule to account for the way in which prior beliefs guide perceptual inference, and how perceptual inference shapes prior beliefs.
Some questions (likely to be answered in the rest of the book):
Causal interactions Causal interactions are key to perception in this model, in three ways: (a) causal interactions make perceptual inference difficult because there is no simple one-to-one relation between cause of the sensory input and sensory input; (b) causal interactions between objects and between object and perceiver give rise to the first-person perspective on experience; (c) causal interactions between world and perceiver are planned on the basis of perception. Perception relies on a causal hierarchy, and there are differences between the regularities on which perception is based.
Perceptual hierarchy One such difference is variance: some regularities (such as “people age with time”) are invariant and are rarely or never disconfirmed by new experience, so perceptual inferences about those are likely to overlap in distinct agents. But other regularities are more variant and dependent of the agent’s perspectives (e.g., how someone’s face changes when it smiles), and can be easily disconfirmed by new experience. With all inferences relying on regularities, agents “probe deeper into the causal structure of the world” and they do so due to two-way interactions among regularities. Lower-level (more variant) regularities help select hypotheses about higher-level regularities, and hypotheses about lower-level regularities are constrained by higher-level (less variant) regularities (see graph on page 32). “Top-down priors guide perceptual inference, and perceptual inference shapes the priors”. We need Bayes’ rule to account for the way in which prior beliefs guide perceptual inference, and how perceptual inference shapes prior beliefs.
Some questions (likely to be answered in the rest of the book):
- Can the model allow for some fluidity in the variance of the regularities observed?
- Acknowledging that prior beliefs have a role in perceptual inference, do we need to endorse the view that the way in which they constrain inference is dictated by Bayes’ rule? Isn’t it serendipitous that something we came up to account for the rationality of updating beliefs is actually the way in which our brain unconsciously works?
- Have we really got an account of the phenomenology of perception by reference to causal hierarchies, or have those just explained potential differences between individual perceivers’ perspectives?
ReplyDeleteLisa writes:
“the brain is a sophisticated hypothesis-testing mechanism, which is constantly involved in minimizing the error of its predictions of the sensory input it receives from the world.”
--But isn't the point of the hierarchical structure that the brain is not only responding to discrepancies between sensory input and its "predictions" about sensory input, but also to discrepancies with an entire progression of states that it will be in?
As I understand Hohwy’s theory, the distinctive mark of predictive coding is that the brain starts out with “predictions” about what states it will subsequently be in, such as a retinal image, a primal sketch, a 2.5D sketch, and a 3D sketch. (The progression borrowed from Marr is just an example – what’s important is that there is a standard progression, not that it includes these specific states). If the retinal image prediction made at time t about time t+ turns out to be false, then at time t+, instead of getting into the predicted primal sketch, the brain will get into a primal sketch that better “matches” the actual retinal image. (Any clarification about what ‘matching’ comes to would be welcome). So there’s a revised primal sketch at t+ . And now the brain ‘compares’ the revised primal sketch at t+ with the prediction, made at t, about the 2.5D sketch. Is the predicted 2.5D sketch one that would be computed from the revised primal sketch? If so, then there’s no discrepancy (no error signal), and the brain ends up at time t+ with the 2.5D sketch that (at time t) was (correctly) predicted to occur at t+. But if the 2.5D sketch predicted (at time t) to occur at time t+ differs from the 2.5D that would be computed from the primal sketch at t+, then the brain ends up in the revised 2.5D sketch. And so on up the structure. The brain checks for a discrepancy between the revised 2.5D sketch and the 3D sketch predicted (at time t) to occur at time t+. If there is a discrepancy, then the brain ends up at time t+ in a revised 3D sketch that fits better with the revised 2.5D sketch.
Hohwy doesn’t talk in terms of Marr’s theory, but as I understand the hierarchical structure he appeals to, the theory assumes that there is a progression from one type of informational state to another, and the heart of the predictive coding is the idea that the brain at time t predicts an entire progression of states that it will be in at subsequent times, and checks all of those predictions – not just its prediction about the initial sensory input. On this picture, the main difference with “bottom up” processing is that there is not always a need to compute a subsequent stage from a previous one (e.g., it is not always necessary to compute a 2.5D sketch from a primal sketch). Such computations are necessary only if there’s a discrepancy between the predicted state ‘one level up’ and the actual state ‘one level down’. Absent such discrepancy, the progression of states can follow the prediction.
Jakob, can you correct any mistakes in this interpretation?
These are really interesting comments, and great for generating some discussion I hope. I am excited (and a little terrified) by this terrific idea of having an open reading group on the book.
ReplyDeleteLisa mentions three points at the end of her account of Ch1. All three seems to me to relate to some deep issues in the book and the PEM framework.
1. Can there be fluidity in the system? The answer to this I think is ‘yes’ but it is an open question what it means for phenomenology. Fluidity can relate to imprecision in the probability distributions at some level in the hierarchy. This means that predictions are not very precise. Likewise there can be imprecision in the bottom-up prediction errors. This means that the ‘focus’ of perceptual inference can shift up and down in the hierarchy. Attention also shifts such focus around (e.g., by ‘shrinking’ the receptive field properties quite high in the hierarchy, thereby making perception more variant or perspective dependent). Precision is crucial to much that I discuss in the book.
2. Should we really believe that Bayes’ rule dictates something as fundamental as perception? This is a worry that I hear a lot. As things are explained in Ch 1, Bayes may seem somewhat optional, though a lot of perceptual processes have been shown to be suspiciously close to optimal Bayesian inference. Ch 2 will go a little bit deeper into the dark heart of the framework and intimate that in fact Bayes’ brain is an inevitable upshot of prediction error minimization (I hasten to say that this is not my result, but Friston’s; briefly, surprisal is approximated in a process that makes the chosen hypothesis the true posterior of the hypothesis given the evidence and the model, which is Bayesian inference).
3. Is this an account of phenomenology, or just of individual differences in perspective? Setting consciousness aside, I am trying to make it an account of the shape and structure of phenomenology. This is why I am so attracted to PEM. But it may be that this project falls short, and we just get the more sterile picture. Some of the later chapters try to make good on the idea, e.g., by looking at binding, cognitive penetrability, illusions and mental illness. But I am prepared to be challenged on this!
Susanna probes the role of the hierarchy in PEM, and asks whether this is sequential up through the hierarchy rather than just a matter of what happens at the sensory “surface”. Susanna also suggests this would be mean that PEM is just something like Marr-with-priors.
ReplyDeleteIt is right, I think, that when there is prediction error then this is passed up through the system in a sequential manner (residual prediction error is shunted up until it can be explained away at some level). But what happens at the sensory surface is what matters: the role of higher levels is to modulate sensory predictions to take account of non-linearities in the evolution of the input (e.g., take account of the fact that the cat is behind the picket fence). So all the top-down activity is directed at getting the complex sensory predictions right. Crucially, this makes the levels dependent on each other: remove a high level and lower levels all the way at the “bottom” will begin to generate more prediction error (e.g., doesn’t know how cats and fences interact). (There is a further level of complexity here, because after convolving causes in this top-down manner, the brain must invert the model to know which causes are actually out there (deconvolve things). This can be done with PEM, not Marr).
Another way to speak to this is in terms of what higher levels “know”. On the Marr-example, they know more, and lower levels are more “stupid”. But on PEM, higher levels only know more in the sense that they have a longer prediction horizon, so they can take more causal interaction into account. But they know less because their level of detail is low. So perceptual content depends on all levels, which is not Marr-like. (For this reason the 2d-2.5d-3d example is not something that would fit PEM).
Part of the motivation for PEM is also that the Marr picture is unworkable and doesn’t gel with the functional role of all the backwards connections in the brain. The moment we augment Marr with Bayesian priors we are on the way to a full PEM account.
Susanna also reasonably asks for clarification about what ‘matching’ means. Top-down predictions inhibit bottom-up signals, and this inhibition is the matching. To do it efficiently, learning of priors is useful, and so is optimization of the precisions of the inhibitory efforts, which is PEM. This is a rather appealing train of thought, I think: the brain is simply slowing down the causal impact it receives (so it can survive longer), and this slowing is done best with Bayes. (Mechanistically, matching is optimized in neuronal plasticity at various time scales).
This is very helpful stuff. A couple of thoughts:
ReplyDelete- On the innateness of Bayes' rule. As I understand it, some Bayesians (including probabilists like Ramsey, de Finitti, van Fraassen, Joyce) think of Bayesianism as something which is inevitably going to characterize any rational epistemic agent. We get Dutch book arguments and representation theorems deployed to support this conception. I'll be very interested to see how these sorts of arguments relate to the Friston results in ch.2.
- On the metaphor of higher levels 'knowing more'. It seems like there's at least some sense in which the higher levels know more than the lower; intuitively, they apply more conceptually-sophisticated distinctions and so can be better at picking out explanatorily-important causal structure in the environment. One reasonably familiar option - not in the spirit of PEM - would be to try and capture that asymmetry via the levels representing different natural subject-matters (modelled by sets of worlds if we like), with higher levels directed towards (or matching, etc.) more coarse-grained subject-matters and lower levels directed towards finer-grained subject matters. Hopefully the relation of this kind of thought to the 'time-scale of the prediction horizon' will become clear as we go along.
Hi Al, sorry I didn't see the comments until now. I am also interested to see how some of the traditional, more philosophical concerns about dutch books and so on relate to this machine learning literature. I am not sure how this story will go. One, rather ambitious, idea in the background here is that creatures that survive (maintain themselves in their expected states) must be minimizing their prediction error in the long run (their free energy), and this entails they must be doing Bayes. I give some of the background, in terms of the link between prediction error and Bayes below in the comments, but there is much more to it than that. Interestingly, I think this means that accepting a dutch book will increase prediction error...
DeletePerhaps the levels idea you introduce is not so foreign to PEM. The hierarchy in the brain must recapitulate the causal structure of the world, so if there is this kind of depth to the causal structure of the world it will be reflected in the brain.
Thanks for this. The dutch book results and representation theorems rely on assumptions about an agent's preferences over choices (bets) - and I'd been struggling to see how to think about the preferences had by a particular level of the hierarchy. Your replies below and to the post on the next chapter help to clarify that, I think; and the results don't obviously seem any less applicable or less instructive when applied at the subpersonal level.
DeleteI was probably too quick when I said 'not in the spirit of PEM' - apologies. As you suggest, there seems no in-principle reason why any level-structure to reality shouldn't be adapted to by an evolving PEM brain.
Can you clarify the relationship between Bayesian updating and the predictive error minimization framework? Where is prediction error located Bayes’ theorem? I know that this is something you take yourself to have explained in the book so I'll say why I have trouble generating the answer.
ReplyDeleteBayesian theories of perception are motivated by promising to solve the underdetermination problem. Given an input that underdetermines what the external-world causes are, we can select the most probable hypothesis that ‘explains’ the input. If our perceptual systems represent likelihoods of the input given various hypothesis about external world causes, and unconditional prior probabilities of those hypotheses, then using Bayes’s theorem we can calculate new distribution of conditional probabilities over those hypotheses, given the input. If one of those conditional probabilities gets selected (e.g., because it’s highest), then by conditionalization we can get an unconditional probability for that hypothesis. And it can be the content of the perceptual state we’re trying to analyze.
--Where does the error signal enter into this story? Presumably it is a datum in what’s described above. But how does an error signal present incomplete information, in the way that intuitively ‘sensory input’ does? As I understand it, in the PEM framework an error signal simply says Yes or No to a hypothesis that is “proposed” by the brain (which is a ‘hypothesis testing machine’). If it says Yes, then the error signal is as unambiguous and determinate as the hypothesis. But the hypothesis H is the thing that is supposed to resolve ambiguities, so Yes won't be less determinate than H. So while an error signal can play the role of incoming information in the Bayesian updating described above, its playing that role doesn’t seem to capture the motivation for applying the updating to the under-determination problem.
What if the error signal says No to the proposed hypothesis? (Perhaps it always says No, in which case the problem above never arises). Is the signal of the form not-H, where H is the proposed hypothesis? Not-H leaves open everything except H.
How does this apply to the case of binding (discussed in chapter 5)? Suppose a hypothesis is: “X is a red cube”. For the input to generate an error signal it has to already be about X. That entails that the input is already bound – and so minimizing prediction error (for all we’ve said so far) wouldn’t be solving the binding problem after all. If the input is feature-location information such as “yellow at location L”, then it won’t warrant an error signal of No, because the put doesn’t say anything about X. X could be a red cube even if there is something yellow at location L (and even if X is at location L+ that includes location L). At a minimum, there would have to be an error signal generated between location information in the hypothesis about an object, and location information in the input at the level of unbound features. Generally, these locations won’t be the same, so the input won’t entail No.
Do the error signals themselves come in the form of a distribution, so that e.g. ‘Yes’ (X is a red cube at L+) could get a low score, and ‘No’ could get a high score complement so that the two answers sum to 1? And then are both error signals ‘passed up the hierarchy’, leading to parallel updating processes?
This brings me back to the overarching question: Could you clarify where the error signal fits in to Bayesian updating?
(A last point: in saying that a hypothesis is proposed by the brain, I’m skirting over the hierarchical structure, where the brain would propose H at a “level”, and propose it to the input one “level down”. I don’t think that makes a difference to the question. I am also still unclear what the ‘levels’ are, and also because ultimately the hypotheses are meant to be “proposed to” the input closest to the sensory surface.)
Hi Jakob,
ReplyDeleteLike Susanna, I would appreciate clarification about how prediction error signals apply to Bayes’ theorem according to the PEM theory. I can see that the prior probabilities for the reception of various sensory data are addressed in the denominator of the right side of Bayes’ theorem. But the priors for the sensory data don’t seem to include an error signal; they’re among the expectations that need to be supervised by the world. In addition, if the world simply supplies a sensory datum, that could yield a value for p(d) in the denominator of the right side of Bayes’ theorem as a constant. But the PEM story doesn’t seem to be that a value of p(d) is simply supplied by the world as a constant. Instead, according the PEM theory, as I currently understand it, there seems to be some interaction between expectations concerning what the sensory data are (or will be) and the data actually received; the interaction is required for an error signal. And somehow that interaction yields an application of Bayes. I haven’t grasped yet exactly how that interaction fits into an application of Bayes.
It might help to work with a simple example, like Susanna’s yellow vase example. Susanna suggests that the error signals might come in a Yes/No format. She also suggests that they might come in the form of a probability distribution. I wasn’t sure if the error signals need to come in the same format at every level of the perceptual hierarchy, though maybe they are always in the same format on your view. In any case, suppose we look at the lowest level of the hierarchy. That might include, in the case of vision, the retinal images themselves and some expectation about what the retinal images are. What is the error signal in that case? (Or if that isn’t how you’re thinking of the lowest level of the hierarchy, it would be great to clarify.) And then, once we have a grip on how the error signal works at the lowest level, can you say more about how it’s the same or different at various higher levels?
There is a nice issue here, concerning how prediction error connects to Bayes’ rule. In the book I first introduce Bayes and muse about how people have applied it to perception, and then I begin talking about prediction error (Ch. 2), followed by stuff on precisions in Ch. 3. Both prediction error and precisions are needed to make the connection to Bayes. It can help (it helped me) to think first about prediction error and model fitting, and then about Bayes.
ReplyDeleteConsider first that the basic task (for a neuronal population, say) is to estimate the mean of some signal it received from the world (very roughly, and setting uncertainty aside, this is the notion of representation we are working with). One way of doing this is to collect a big enough sample from the world, add them up and then divide by the number of samples. For various reasons, this may not be how it is best done. For example, it may be that it is too difficult to remember that many samples, or it may be that this method is too slow.
Instead, the mean could be updated on a running basis, such that every new sample gives rise to an adjustment of it. With this more incremental method, the new updated mean (on observing the new sample) equals the old mean + the difference between the old mean and the new sample, divided by the number of samples up to that point. This makes sense, because in the end we want the mean being the sum of all samples, divided by the total number of samples.
It should be clear that by thinking about something as simple as calculating the mean in this piecemeal fashion, we have already (i) introduced roughly the components of prediction error minimization (the old mean is the prior, the difference between the old mean and the new sample is the prediction error, and the division by the number of samples up to that point (= 1/ith n) is a weight (or, very roughly, expected precision); and (ii) spoken of the process in a very Bayesian fashion of updating an estimate (a hypothesis) in the light of some new evidence. (continued next comment)
(continuing on...) The issue is then the update rule (or the learning rate), and this relates to both Susanna’s and Jona’s comments. If we were just doing numbers, then it would be fine to update the old estimate weighted by the ith sample – but this is because there is no noise, uncertainty, or change in numbers. If the current mean is 5 and the 2nd sample number is 7, then that’s a “no” and the new mean should be updated with the prediction error, 2, weighted by ½, giving 6 as the new update.
ReplyDeleteThe world is full of uncertainty, so this simple update rule won’t work (one problem is that the learning rate drops as the number of trials go up, which is fine for numbers but not for other, changing things). A lot of work has gone into defining new update rules (both Grush and Eliasmith have worked philosophically on this). A key move here is to work with normal distributions, Gaussians (as Susanna anticipates), and consider not only the mean but also the precision of these (where precision is the inverse of the variance of the distribution). This means priors and likelihoods can be expressed as Gaussians (e.g., a low prior is an imprecise Gaussian) and it turns out that the components of Bayes’ rule can be recovered in the prediction error formulation, as precision weighting of the prediction error, which guarantees that the updating will follow Bayes. Basically, the bigger the precision on the likelihood the more the posterior will change, and (roughly) the bigger precision on the prior the less the posterior will change. This is just as it should be, in Bayes.
So part of the answer to Jona’s and Susanna’s question is that Bayes is identical to precision-weighted prediction error updating. Part of the answer to Susanna’s question is that the updating is dictated by the learning rate, which is Bayesian. A “no” carries a lot of information, in virtue of precisions of priors and likelihoods, and this determines how much or how little the prior changes in the light of the “no”.
Overall, the prediction error minimization scheme builds on this but takes into account that there are changing levels of noise and uncertainty in the world, which must themselves be learned. This invites a hierarchical structure. It also takes into account that the system doing the estimation may itself change (e.g., we move around an object to see it better), forcing internal modeling of itself.
All this can then be turned around, such that a system that continually, over time and on average keeps prediction error low will be engaging in Bayesian inference. In the book, I try to describe some of these elements even more heuristically, in terms of model fitting, before tying it in with the idea of approximating surprisal. I agree it is good to have the more direct link between Bayes and PEM on the table. Al, this is what happens in Ch 2 so perhaps this comment will be useful there.
Susanna links this issue to the example of binding, from Ch 5. I think this comment is long enough for now, so perhaps I’ll try and respond to the binding issue a bit later (which will give me more time to think about it). Jona also asks about the updating up through the hierarchy, the answer is that the rule is the same between all levels, but to address this comment properly I think I should think a bit more. Thanks again for all these comments!
Hi Jakob,
ReplyDeleteThanks for clarifying how the PEM updating process is consistent with Bayes’ rule. I’m still a bit fuzzy on the form the error signal takes when we’re speaking at a computational level of explanation. My fuzziness makes me think I’m failing to understand a crucial part of the PEM mechanism. I want to begin by sketching one form the error signal and the prediction error minimization process might take. I’m hoping that you can then tell me where I am going wrong, if I am. I’ll conclude by mentioning one upshot of the sketch for the explanatory scope of the PEM account.
Here’s one way to understand how prediction error minimization might work. On one hand, there is the (Gaussian) prior probability distribution over the relevant hypotheses about the world. That’s the fantasy that needs supervising. Call it the prior distribution over worldly hypotheses. On the other hand, there is a distribution over worldly hypotheses generated as a signal from the world. I’ll call this the error signal distribution. If we understand both the prior and the error signal as distributions over worldly hypotheses, prediction error occurs to the extent that the prior distribution fails to match the error signal distribution. Resolving prediction error to generate a perceptual representation for a single case--and to update the priors--takes into account the mismatch (prediction error) between the prior distribution and the error signal distribution. The resolution of the mismatch could all be done in a way that is consistent with Bayes’ rule. So for example, ceteris paribus, the greater the precision on the likelihood, the more effect the error signal has on the posterior. And, ceteris paribus, the greater the precision on the prior, the less effect the error signal has on the posterior.
The above makes sense to me. But I’m not sure it’s your view. Here’s the crucial point of clarification I’m looking for: I’m not sure if you would characterize the information contained in the error signal as a distribution over worldly hypotheses, and, if so, whether you think of it as Gaussian. You say in your recent reply that “A key move here is to work with normal distributions, Gaussians (as Susanna anticipates)...This means priors and likelihoods can be expressed as Gaussians.” You say that the prior and likelihood are Gaussian but don’t directly address how to characterize the worldly error signal at a computational level of explanation. (I think Susanna was asking at one point whether the *error signal* is a distribution.) I’m most concerned at the moment to figure out what the worldly error signal is at a computational level of explanation. Only then can I tell how the worldly signal can be compared to the prior to identify and resolve a prediction error.
Here’s the upshot about the scope of the PEM account that I said I’d conclude with. On the interpretation just sketched, PEM isn’t intended to explain how the brain/nervous system generates a distribution over worldly hypotheses from e.g. retinal images and dermal pressures. Rather, the PEM mechanism only compares and resolves differing distributions over worldly hypotheses. If so, PEM is not the key to how the brain solves at least one thing that many people think of as a crucial problem of perception. PEM on my sketch doesn’t explain how the nervous system moves from retinal images, dermal pressures, etc. to a representation of the world (i.e. the representation contained in the error signal). Is it true that the PEM mechanism isn’t intended to explain that part of the process?
I am first picking up again on some of Susanna's comments further up.
ReplyDeletePrediction error can usefully be understood in terms of the likelihood. If h generates little prediction error and h* generates much prediction error, then p(e|h)>p(e|h*); (the sensory evidence e is less surprising given h than given h*). Since h can explain away more of e than h*, it has stronger evidence in its favour than h*. So in this sense e says ‘yes’ to h and ‘no’ to h*. The evidence e is still ‘incomplete’, since p(h|e) < 1. So another hypothesis could in principle explain the evidence better, with less prediction error (i.e., h can still be refined).
Underdetermination is uncertainty: x or x’ might cause e, so given e I don’t know if h or h* true (where h postulates x and h* postulates x’). But if h generates less prediction error than h* then it makes e more likely given h and then underdetermination can be addressed.
What about the prediction error simply saying ‘no’ to h, and just leaving everything open (any of not-h could then be on the table)? This is a question about inference, and therefore the learning rate. In the case of prediction error, h should be updated in a Bayesian fashion, that is, the prediction error is precision weighted (as pr the previous post). If I get a prediction error for h, then I am not free to pick any not-h hypothesis, I must pick the one that Bayes tells me to pick (which happens to be the one that minimizes prediction error, hence PEM). This not too far from the toy case where I expect the next number to be 5 but pull out a 7 (a not-5). Maths tells me to update the expectation in a very specific way (to 6, if the 7 was the second sample).
Cases with unequivocal yes and no have little interest because then there is no uncertainty and no use for Bayes (if only h can give rise to e, then other hypotheses are explained away conclusively). It is crucial that prediction errors do not simply say yes or no to hypotheses.
I still have a feeling I am missing something here. Perhaps it is more a question about where the content of the hypotheses comes from, rather than the ability to chose amongst hypotheses? So, in terms of the binding issue, which Susanna mentions, why is there a red-cube hypothesis in the first place? This makes it a question about model selection and inference. The worry is that PEM only gives model selection but not inference. My answer to that is that it is just the precision-weighted prediction error updating that is inference because it is what determines content (in virtue of shaping parameters of the distributions). So in the binding case we might start with the hypothesis that there is something red and something cubical but that they are distinct. If they are distinct then pushing one should leave the other unchanged. However, when this prediction is tested, there is prediction error: they change together. So h is updated to h’ which says they have a spring connecting them, but again we get prediction error (because there is not the expected spring-induced non-linearity in the red and cubical input), and h’ is updated to h’’ which says that they are co-occur in time and space (i.e., they are bound), and h’’ wins and we experience them as bound. The beauty of this story is that it is no different from any other Bayesian updating, so binding just becomes inference. The prediction error becomes a learning signal that generates content, something the input itself (seen in isolation from the predictions from h) cannot deliver.
ReplyDeleteAs to the hierarchy, the prediction error is processed the same way all the way up, there is never any unequivocal yes/no situation, since all levels are precision weighted (in order to be Bayesian). I’ll try and say something more about the hierarchy later. It is crucial because the rich way the hierarchy is structured makes it more plausible to hold the ‘mutual information’ view of perceptual content.
I hope some of the comments on Susanna’s comment above addresses some of the issues you raise too, Jona. PEM is very much intended to explain how the brain uses its sensory input to represent the world. The prediction error is the result of the comparison of h and the input e (if x is the input and mu is the expected value, then x minus mu is the prediction error, and mu is updated with this difference, weighted according to the ratio between the likelihood precision and the posterior precision (which is the sum of the prior precision and the likelihood precision)). The prediction error is minimized (resolved) by such Bayesian updating of h (roughly speaking only, since this doesn’t take varying uncertainty into account).
ReplyDeleteWhat happens in this process is that h must be carrying information about e (variability in e predicts variability in h). If we assume that e carries information about worldly causes y, then h comes to carry information about y. The shape of the hypotheses entertained by the brain must recapitulate the causal structure of the world (since variability in y is a matter of causal interactions, and variability in h covaries with variability in the world).
I think I am missing something in your question here, Jona, because the Bayesian process of updating hypotheses is just, I think, the process of explaining how “the brain/nervous system generates a distribution over worldly hypotheses from e.g. retinal images and dermal pressures.” (I don’t see how this is different from “PEM mechanism only compares and resolves differing distributions over worldly hypotheses”, given the element of inference I have sketched).